Google search: SRA download subset number of reads
In the link below, I learned the syntax to download a subset of fastq reads rather than an entire fastq file: https://bioinformatics.ccr.cancer.gov/docs/b4b/Module1_Unix_Biowulf/Lesson6/
scp gchaves@hb.ucsc.edu:/hb/home/gchaves/scripts/Signaling_GSE68952_H3K4me3.sh ./scripts
origin=/hb/home/gchaves/references/hg19/GRCh37.p13.genome.fa
destination=/home/bioinfo/data/references/hg19/GRCh37.p13.genome.fa
/hb/home/gchaves/programs/sshpass-1.06/sshpass -p "bioinfo" \
scp $origin bioinfo@200.128.7.33:$destination
scp -r gchaves@hb.ucsc.edu:/hb/home/gchaves/results/neuroblastoma/SRR899360_1.fastq /home/bioinfo/data/
scp -r /hb/home/gchaves/results/neuroblastoma/SRR899360_1.fastq bioinfo@200.128.7.33:/home/bioinfo/data/neuroblastoma
scp -r /hb/home/gchaves/results/neuroblastoma/SRR899360.sam bioinfo@200.128.7.33:/home/bioinfo/data/neuroblastoma
scp -r /hb/home/gchaves/references/hg19_bme237/gtf/gencode.v19.annotation.gtf bioinfo@200.128.7.33:/home/bioinfo/data/references/
cd /home/bioinfo/data/references/hg19/
bwa index -a bwtsw GRCh37.p13.genome.fa
In this part I stored the metadata information to document the SRA file that I am downloading. Later I can match this file with the patient and RNA-Seq sample that it was extracted from.
library(maditr)
sra_run_table_df <- read.table("./scripts/SraRunTable.txt", sep = ",", header = T)
sra_run_table_selected <- sra_run_table_df %>% dplyr::select(Run, BioProject, BioSample, Experiment,
flowcell, Sample.Name, BARCODE)
Check flowcells and Barcodes
library(expss)
# cross_cases(sra_run_table_selected, flowcell)
# cross_cases(sra_run_table_selected, BioProject)
# cross_cases(sra_run_table_selected, BARCODE)
This command loads the sratoolkit to the server environment:
module load sratoolkit/3.0.0
This command downloads a limited number of reads using the SRA ID number that in this case is SRR899360:
## Declare human reference genome
## REFERENCE=~/references/hg19/GRCh37.p13.genome.fa
REFERENCE=/home/bioinfo/data/references/hg19/GRCh37.p13.genome.fa
## Download fastq file
fastq-dump --split-files -X 1000 SRR899360 --split-files SRR899360 \
--outdir /hb/home/gchaves/results/neuroblastoma
## Specify fastq file
FastqR1=/hb/home/gchaves/results/neuroblastoma/SRR899360_1.fastq
## Align
module load bwa
#bwa mem -t 4 $REFERENCE $FastqR1 > /hb/home/gchaves/results/neuroblastoma/SRR899360.sam
bwa mem -t 4 /home/bioinfo/data/references/hg19/GRCh37.p13.genome.fa /home/bioinfo/data/neuroblastoma/SRR899360_1.fastq > /home/bioinfo/data/neuroblastoma/SRR899360.bam
Finally we export the number of counts to the counts file. According to this reference, p is an option for paired-end fastq files:
/home/bioinfo/data/programs/subread-2.0.6-Linux-x86_64/bin/featureCounts -T 16 -a \
/home/bioinfo/data/references/gencode.v19.annotation.gtf -t gene -g \
gene_name -o /home/bioinfo/data/neuroblastoma/SRR899360.counts \
/home/bioinfo/data/neuroblastoma/SRR899360.sam
Inspect the counts file:
cd /hb/home/gchaves/results/neuroblastoma/
head SRR899360.counts
wc -l SRR899360.counts