Google search: SRA download subset number of reads
In the link below, I learned the syntax to download a subset of fastq reads rather than an entire fastq file: https://bioinformatics.ccr.cancer.gov/docs/b4b/Module1_Unix_Biowulf/Lesson6/
scp gchaves@hb.ucsc.edu:/hb/home/gchaves/scripts/Signaling_GSE68952_H3K4me3.sh ./scripts
origin=/hb/home/gchaves/references/hg19/GRCh37.p13.genome.fa
destination=/home/bioinfo/data/references/hg19/GRCh37.p13.genome.fa
/hb/home/gchaves/programs/sshpass-1.06/sshpass -p "bioinfo" \
scp $origin bioinfo@200.128.7.33:$destination
scp -r gchaves@hb.ucsc.edu:/hb/home/gchaves/results/neuroblastoma/SRR899360_1.fastq /home/bioinfo/data/
scp -r /hb/home/gchaves/results/neuroblastoma/SRR899360_1.fastq bioinfo@200.128.7.33:/home/bioinfo/data/neuroblastoma
scp -r /hb/home/gchaves/results/neuroblastoma/SRR899360.sam bioinfo@200.128.7.33:/home/bioinfo/data/neuroblastoma
scp -r /hb/home/gchaves/references/hg19_bme237/gtf/gencode.v19.annotation.gtf bioinfo@200.128.7.33:/home/bioinfo/data/references/
cd /home/bioinfo/data/references/hg19/
bwa index -a bwtsw GRCh37.p13.genome.fa
In this part I stored the metadata information to document the SRA file that I am downloading. Later I can match this file with the patient and RNA-Seq sample that it was extracted from.
library(maditr)
sra_run_table_df <- read.table("./scripts/SraRunTable.txt", sep = ",", header = T)
sra_run_table_selected <- sra_run_table_df %>% dplyr::select(Run, BioProject, BioSample, Experiment,
flowcell, Sample.Name, BARCODE)
Check flowcells and Barcodes
library(expss)
##
## Use 'expss_output_viewer()' to display tables in the RStudio Viewer.
## To return to the console output, use 'expss_output_default()'.
# cross_cases(sra_run_table_selected, flowcell)
# cross_cases(sra_run_table_selected, BioProject)
# cross_cases(sra_run_table_selected, BARCODE)
In this part we will analyze data from the Gene Expression Omnibus
(GEO). This is a link to GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi.
Neuroblastoma cell data from Chennakesavalu 2024 can be found in the
reference ID number GSE212882. we use SRA toolkit to access SRA files in
the GEO. This command loads the sratoolkit to the server
environment:
module load sratoolkit/3.0.0
This command downloads a limited number of reads using the SRA ID number that in this case is SRR899360:
## Declare human reference genome
## REFERENCE=~/references/hg19/GRCh37.p13.genome.fa
REFERENCE=/home/bioinfo/data/references/hg19/GRCh37.p13.genome.fa
The command below downloads a limited number of reads (1000 reads) from fastq file using the SRA ID:
fastq-dump --split-files -X 1000 SRR21471195 --split-files SRR21471195 \
--outdir /hb/home/gchaves/results/neuroblastoma
## Specify fastq file
FastqR1=/hb/home/gchaves/results/neuroblastoma/fastq/SRR21471195_1.fastq
FastqR2=/hb/home/gchaves/results/neuroblastoma/fastq/SRR21471195_2.fastq
Align fastq file to reference genome using BWA
module load bwa
#bwa mem -t 4 $REFERENCE $FastqR1 > /hb/home/gchaves/results/neuroblastoma/SRR899360.sam
bwa mem -t 4 /hb/home/gchaves/references/hg19/GRCh37.p13.genome.fa /hb/home/gchaves/results/neuroblastoma/fastq/SRR21471195_1.fastq /hb/home/gchaves/results/neuroblastoma/fastq/SRR21471195_2.fastq > /hb/home/gchaves/results/neuroblastoma/sam/SRR21471195.sam
Sort SAM file
module load samtools
samtools sort /hb/home/gchaves/results/neuroblastoma/sam/SRR21471195.sam > \
/hb/home/gchaves/results/neuroblastoma/sam/SRR21471195_sorted.sam
Convert SAM to BAM file
module load samtools
samtools view -S -b /hb/home/gchaves/results/neuroblastoma/sam/SRR21471195_sorted.sam > \
/hb/home/gchaves/results/neuroblastoma/bam/SRR21471195.bam
Samtools uses reference FASTA to detect “piles” in the alignment
module load bcftools
bcftools mpileup -f /hb/home/gchaves/references/hg19/GRCh37.p13.genome.fa /hb/home/gchaves/results/neuroblastoma/bam/SRR21471195.bam > \
/hb/home/gchaves/results/neuroblastoma/bcf/SRR21471195.bcf
Bcftools extracts SNPs
bcftools view -v snps /hb/home/gchaves/results/neuroblastoma/bcf/SRR21471195.bcf > /hb/home/gchaves/results/neuroblastoma/snps_vcf/SRR21471195_snps.vcf
Transfer VCF file
scp -r gchaves@hb.ucsc.edu:/hb/home/gchaves/results/neuroblastoma/snps_vcf/SRR21471195_snps.vcf ./
Finally we export the number of counts to the counts file. According to this reference, p is an option for paired-end fastq files:
/home/bioinfo/data/programs/subread-2.0.6-Linux-x86_64/bin/featureCounts -T 16 -a \
/home/bioinfo/data/references/gencode.v19.annotation.gtf -t gene -g \
gene_name -o /home/bioinfo/data/neuroblastoma/SRR899360.counts \
/home/bioinfo/data/neuroblastoma/SRR899360.sam
Inspect the counts file:
cd /hb/home/gchaves/results/neuroblastoma/
head SRR899360.counts
wc -l SRR899360.counts