References


Google search: SRA download subset number of reads

In the link below, I learned the syntax to download a subset of fastq reads rather than an entire fastq file: https://bioinformatics.ccr.cancer.gov/docs/b4b/Module1_Unix_Biowulf/Lesson6/


Transfer files


Transfer an alignment pipeline script

scp gchaves@hb.ucsc.edu:/hb/home/gchaves/scripts/Signaling_GSE68952_H3K4me3.sh ./scripts


Transfer the human reference genome

origin=/hb/home/gchaves/references/hg19/GRCh37.p13.genome.fa
destination=/home/bioinfo/data/references/hg19/GRCh37.p13.genome.fa

/hb/home/gchaves/programs/sshpass-1.06/sshpass -p "bioinfo" \
scp $origin bioinfo@200.128.7.33:$destination


Transfer featureCounts

scp -r gchaves@hb.ucsc.edu:/hb/home/gchaves/results/neuroblastoma/SRR899360_1.fastq /home/bioinfo/data/


Transfer fastq file

scp -r /hb/home/gchaves/results/neuroblastoma/SRR899360_1.fastq bioinfo@200.128.7.33:/home/bioinfo/data/neuroblastoma


Transfer sam file

scp -r /hb/home/gchaves/results/neuroblastoma/SRR899360.sam bioinfo@200.128.7.33:/home/bioinfo/data/neuroblastoma


Transfer GTF file

scp -r /hb/home/gchaves/references/hg19_bme237/gtf/gencode.v19.annotation.gtf  bioinfo@200.128.7.33:/home/bioinfo/data/references/


Build the index

cd /home/bioinfo/data/references/hg19/
bwa index -a bwtsw GRCh37.p13.genome.fa


Look at GEO Metadata File


In this part I stored the metadata information to document the SRA file that I am downloading. Later I can match this file with the patient and RNA-Seq sample that it was extracted from.

library(maditr)
sra_run_table_df <- read.table("./scripts/SraRunTable.txt", sep = ",", header = T)
sra_run_table_selected <- sra_run_table_df %>% dplyr::select(Run, BioProject, BioSample, Experiment, 
                                                             flowcell, Sample.Name, BARCODE)


Check flowcells and Barcodes

library(expss)
## 
## Use 'expss_output_viewer()' to display tables in the RStudio Viewer.
##  To return to the console output, use 'expss_output_default()'.
# cross_cases(sra_run_table_selected, flowcell)
# cross_cases(sra_run_table_selected, BioProject)
# cross_cases(sra_run_table_selected, BARCODE)


Download fastq file


In this part we will analyze data from the Gene Expression Omnibus (GEO). This is a link to GEO: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi. Neuroblastoma cell data from Chennakesavalu 2024 can be found in the reference ID number GSE212882. we use SRA toolkit to access SRA files in the GEO. This command loads the sratoolkit to the server environment:

module load sratoolkit/3.0.0


This command downloads a limited number of reads using the SRA ID number that in this case is SRR899360:

## Declare human reference genome
## REFERENCE=~/references/hg19/GRCh37.p13.genome.fa
REFERENCE=/home/bioinfo/data/references/hg19/GRCh37.p13.genome.fa


The command below downloads a limited number of reads (1000 reads) from fastq file using the SRA ID:

fastq-dump --split-files -X 1000 SRR21471195 --split-files SRR21471195 \
--outdir /hb/home/gchaves/results/neuroblastoma

## Specify fastq file
FastqR1=/hb/home/gchaves/results/neuroblastoma/fastq/SRR21471195_1.fastq 
FastqR2=/hb/home/gchaves/results/neuroblastoma/fastq/SRR21471195_2.fastq


Alignment and Variant Call


Align fastq file to reference genome using BWA

module load bwa
#bwa mem -t 4 $REFERENCE $FastqR1 > /hb/home/gchaves/results/neuroblastoma/SRR899360.sam
bwa mem -t 4 /hb/home/gchaves/references/hg19/GRCh37.p13.genome.fa /hb/home/gchaves/results/neuroblastoma/fastq/SRR21471195_1.fastq /hb/home/gchaves/results/neuroblastoma/fastq/SRR21471195_2.fastq > /hb/home/gchaves/results/neuroblastoma/sam/SRR21471195.sam


Sort SAM file

module load samtools
samtools sort /hb/home/gchaves/results/neuroblastoma/sam/SRR21471195.sam > \
/hb/home/gchaves/results/neuroblastoma/sam/SRR21471195_sorted.sam


Convert SAM to BAM file

module load samtools
samtools view -S -b /hb/home/gchaves/results/neuroblastoma/sam/SRR21471195_sorted.sam > \
/hb/home/gchaves/results/neuroblastoma/bam/SRR21471195.bam


Samtools uses reference FASTA to detect “piles” in the alignment

module load bcftools
bcftools mpileup -f /hb/home/gchaves/references/hg19/GRCh37.p13.genome.fa /hb/home/gchaves/results/neuroblastoma/bam/SRR21471195.bam > \
/hb/home/gchaves/results/neuroblastoma/bcf/SRR21471195.bcf


Bcftools extracts SNPs

bcftools view -v snps /hb/home/gchaves/results/neuroblastoma/bcf/SRR21471195.bcf > /hb/home/gchaves/results/neuroblastoma/snps_vcf/SRR21471195_snps.vcf


Transfer VCF file

scp -r gchaves@hb.ucsc.edu:/hb/home/gchaves/results/neuroblastoma/snps_vcf/SRR21471195_snps.vcf ./



Extract counts with featureCounts


Finally we export the number of counts to the counts file. According to this reference, p is an option for paired-end fastq files:

/home/bioinfo/data/programs/subread-2.0.6-Linux-x86_64/bin/featureCounts -T 16 -a \
/home/bioinfo/data/references/gencode.v19.annotation.gtf -t gene -g \
gene_name -o /home/bioinfo/data/neuroblastoma/SRR899360.counts \
/home/bioinfo/data/neuroblastoma/SRR899360.sam


Inspect the counts file:

cd /hb/home/gchaves/results/neuroblastoma/
head SRR899360.counts
wc -l SRR899360.counts