Gepoliano Chaves, Ph. D.
September 13, 2023
In Chaves (2019), we discuss on the application of knowledge on the Biological Association between genotype (the DNA polymorphisms) and phenotype using statistical software (PLINK). We also had an introduction to programming, with the installation of libraries in R and and Bash and a brief discussion on how to process files using R and Bash. Here, we expand the notion of a text file, to the notion of information stored in the form of text. Therefore, in a file, we can also store a biological sequence. The first biological sequence storage file that we should discuss should be the FASTA file. Our discussion should also include information stored in FASTQ, SAM, BAM, BCF and VCF files. In this notebook, we will use information stored in a VCF file to generate commands that allow us to extract useful information from the files we analyze.
In this notebook, we deepen the analysis of sequences with the beginning of the Identification of Genetic Variants. Here, we treat genetic variants as DNA polymorphisms. A DNA polymorphism is a mutation or variant, which differs from the nucleotide found at the same position in the FASTA sequence used as a reference. Thus, the present pipeline defines the computational protocol for Biological or Genetic Association with a phenotypic trait. In English this computational protocol is called Variant Call.
In this Booklet, we will make a Variant Call using FASTA files that contain the SARS-CoV-2 genome, which is the biological sequence of the virus genome, in a text file, specifically called FASTA (Figure 1). We identified SARS-CoV-2 variants using samtools and bcftools to extract genetic polymorphisms from FASTA files. The FASTA file must be downloaded from GISAID and stored locally, in a folder on the researcher’s local system. Genetic variants, also called SNPs (Single Nucleotide Polymorphisms), are determined by comparing a FASTA file used as a reference, with the isolated sequence of the geographic region of interest.
By comparing the FASTA sequence of each region of the planet to a reference FASTA sequence, for example the first FASTA sequence corresponding to the initial patient presenting symptoms of respiratory syndrome, later identified as caused by SARS-CoV-2, we can note each position where there is a nucleotide different from the FASTA reference sequence obtained from the first Chinese patient. We say, then, that in the other regions, there are virus mutations, or, more technically, polymorphisms of the viral genetic material. FASTA files from virtually every region of the planet can be obtained from the German GISAID (Global Initiative on Sharing Avian Influenza Data) database. Information about the GISAID database can be obtained from its website:
In Computational Biology Notebook Number 6, we will learn how to identify genetic variants using FASTQ files instead of FASTA files. FASTQ files can be analyzed directly from a database called Gene Expression Omnibus (GEO). The advantage of GEO compared to GISAID is that if the proper pipeline of identification of genetic variants in GEO is used, local storage of sequencing files is not required.
Often, the amount of sequencing data to be analyzed is so large that it is necessary to use computers with super-powerful storage capacity. The part below, encoded in Bash, is made using the Linux Operating System. The code represents a type of “header” that must be included for submitting scripts using the Linux system on computer servers.
#!/bin/bash
#SBATCH --partition=128x24
##SBATCH --job-name=Variants_BWA # Job name
##SBATCH --mail-type=ALL # Mail events (NONE, BEGIN, END, FAIL, ALL)
##SBATCH --mail-user=gchaves@ucsc.edu # Where to send mail
##SBATCH --nodes=1 # Use one node
##SBATCH --ntasks=1 # Run a single task
##SBATCH --cpus-per-task=4 # Number of CPU cores per task
##SBATCH --output=Variants_BWA # Standard output and error log
#
# module load gcc/5.4.0
source ~/.bashrc
File organization is essential for clear and organized steps in the various procedures used, often without perfect visualization of how the computer performs. The excerpt below creates folders for the storage of the reference FASTA file, the FASTA files of the region of interest (Brazil), the SAM and BAM alignment files and finally, the BCF and VCF files, which contain information about the location in the linear genome , of the position of the identified mutations. VCF files are extracted in two ways, one containing SNPs and one containing indels.
# Store Figures
if (!dir.exists("../figures")) {
dir.create("../figures")
}
# Store reference FASTA
if (!dir.exists("../fasta_reference_file")) {
dir.create("../fasta_reference_file")
}
# Store reference FASTA
if (!dir.exists("../regions")) {
dir.create("../regions")
}
# Store files from Brazil
if (!dir.exists("../regions/brazil")) {
dir.create("../regions/brazil")
}
# Create the year for samples
if (!dir.exists("../regions/brazil/2020")) {
dir.create("../regions/brazil/2020")
}
# Store FASTA for regions
if (!dir.exists("../regions/brazil/2020/fasta/")) {
dir.create("../regions/brazil/2020/fasta/")
}
# Store BAM files
if (!dir.exists("../regions/brazil/2020/bam")) {
dir.create("../regions/brazil/2020/bam")
}
# Store BCF files
if (!dir.exists("../regions/brazil/2020/bcf")) {
dir.create("../regions/brazil/2020/bcf")
}
# Store FASTA files
if (!dir.exists("../regions/brazil/2020/fasta")) {
dir.create("../regions/brazil/2020/fasta")
}
# Store SAM files
if (!dir.exists("../regions/brazil/2020/sam")) {
dir.create("../regions/brazil/2020/sam")
}
# Store VFC files
if (!dir.exists("../regions/brazil/2020/vcf")) {
dir.create("../regions/brazil/2020/vcf")
}
# Store VCF from SNPs
if (!dir.exists("../regions/brazil/2020/vcf_snp")) {
dir.create("../regions/brazil/2020/vcf_snp")
}
# Store VCF from indels
if (!dir.exists("../regions/brazil/2020/vcf_indel")) {
dir.create("../regions/brazil/2020/vcf_indel")
}
The following steps must be performed by the genetic variant identification pipeline:
Indexing
Alignment
Identification of variants
Extracting and exporting variants
Calculation of genotype or allele frequencies
Indexing only needs to be done once
The same index is used for aligning different samples using a single FASTA reference sequence.
## [bwa_index] Pack FASTA... 0.00 sec
## [bwa_index] Construct BWT for the packed sequence...
## [BWTIncCreate] textLength=59806, availableWord=76494
## [bwt_gen] Finished constructing BWT in 5 iterations.
## [bwa_index] 0.00 seconds elapse.
## [bwa_index] Update BWT... 0.00 sec
## [bwa_index] Pack forward-only FASTA... 0.00 sec
## [bwa_index] Construct SA from BWT and Occ... 0.00 sec
## [main] Version: 0.7.17-r1188
## [main] CMD: /Users/gepolianochaves/anaconda3/bin/bwa index -a bwtsw ../fasta_reference_file/SARS-CoV-2.fasta
## [main] Real time: 0.010 sec; CPU: 0.015 sec
Indices allow the aligner to narrow down the potential origin of a query sequence within the genome, saving both time and memory.
You may have a look at Ben Langmeads youtube channel (creator of the bowtie and bowtie2 alignment software), where he lectures about concepts of sequence alignments, including the idea of indexing.
Download FASTA files by region and save in SARS-CoV-2_Regions folder, to split it by sample;
Below, the FASTA file was downloaded and saved as Brazil_2020_07_01-2020_12_31.fasta, indicating the period and geographic region;
The Brazil_2020_07_01-2020_12_31.fasta file must be saved in SARS-CoV-2_Regions before running chunk below;
The FASTA file downloaded from GISAID is named according to the region (Brazil) and the dates of collection and submission of the files;
It is necessary to install the splitfasta program to “break” the downloaded FASTA file.
Move FASTA file to specific region
We want to see frequency of new variants between July and December
We need to pay attention to the names of created folders and files
## Brazil_2020_07_01-2020_12_31_1.fasta
## Brazil_2020_07_01-2020_12_31_10.fasta
## Brazil_2020_07_01-2020_12_31_100.fasta
## Brazil_2020_07_01-2020_12_31_101.fasta
## Brazil_2020_07_01-2020_12_31_102.fasta
## Brazil_2020_07_01-2020_12_31_103.fasta
## Brazil_2020_07_01-2020_12_31_104.fasta
## Brazil_2020_07_01-2020_12_31_105.fasta
## Brazil_2020_07_01-2020_12_31_106.fasta
## Brazil_2020_07_01-2020_12_31_107.fasta
## Brazil_2020_07_01-2020_12_31_108.fasta
## Brazil_2020_07_01-2020_12_31_109.fasta
## Brazil_2020_07_01-2020_12_31_11.fasta
## Brazil_2020_07_01-2020_12_31_110.fasta
## Brazil_2020_07_01-2020_12_31_111.fasta
## Brazil_2020_07_01-2020_12_31_112.fasta
## Brazil_2020_07_01-2020_12_31_113.fasta
## Brazil_2020_07_01-2020_12_31_114.fasta
## Brazil_2020_07_01-2020_12_31_115.fasta
## Brazil_2020_07_01-2020_12_31_116.fasta
## Brazil_2020_07_01-2020_12_31_117.fasta
## Brazil_2020_07_01-2020_12_31_118.fasta
## Brazil_2020_07_01-2020_12_31_119.fasta
## Brazil_2020_07_01-2020_12_31_12.fasta
## Brazil_2020_07_01-2020_12_31_120.fasta
## Brazil_2020_07_01-2020_12_31_121.fasta
## Brazil_2020_07_01-2020_12_31_122.fasta
## Brazil_2020_07_01-2020_12_31_123.fasta
## Brazil_2020_07_01-2020_12_31_124.fasta
## Brazil_2020_07_01-2020_12_31_125.fasta
## Brazil_2020_07_01-2020_12_31_126.fasta
## Brazil_2020_07_01-2020_12_31_127.fasta
## Brazil_2020_07_01-2020_12_31_128.fasta
## Brazil_2020_07_01-2020_12_31_129.fasta
## Brazil_2020_07_01-2020_12_31_13.fasta
## Brazil_2020_07_01-2020_12_31_130.fasta
## Brazil_2020_07_01-2020_12_31_131.fasta
## Brazil_2020_07_01-2020_12_31_132.fasta
## Brazil_2020_07_01-2020_12_31_133.fasta
## Brazil_2020_07_01-2020_12_31_134.fasta
## Brazil_2020_07_01-2020_12_31_135.fasta
## Brazil_2020_07_01-2020_12_31_136.fasta
## Brazil_2020_07_01-2020_12_31_137.fasta
## Brazil_2020_07_01-2020_12_31_138.fasta
## Brazil_2020_07_01-2020_12_31_139.fasta
## Brazil_2020_07_01-2020_12_31_14.fasta
## Brazil_2020_07_01-2020_12_31_140.fasta
## Brazil_2020_07_01-2020_12_31_141.fasta
## Brazil_2020_07_01-2020_12_31_142.fasta
## Brazil_2020_07_01-2020_12_31_143.fasta
## Brazil_2020_07_01-2020_12_31_144.fasta
## Brazil_2020_07_01-2020_12_31_145.fasta
## Brazil_2020_07_01-2020_12_31_146.fasta
## Brazil_2020_07_01-2020_12_31_147.fasta
## Brazil_2020_07_01-2020_12_31_148.fasta
## Brazil_2020_07_01-2020_12_31_149.fasta
## Brazil_2020_07_01-2020_12_31_15.fasta
## Brazil_2020_07_01-2020_12_31_150.fasta
## Brazil_2020_07_01-2020_12_31_151.fasta
## Brazil_2020_07_01-2020_12_31_152.fasta
## Brazil_2020_07_01-2020_12_31_153.fasta
## Brazil_2020_07_01-2020_12_31_154.fasta
## Brazil_2020_07_01-2020_12_31_155.fasta
## Brazil_2020_07_01-2020_12_31_156.fasta
## Brazil_2020_07_01-2020_12_31_157.fasta
## Brazil_2020_07_01-2020_12_31_158.fasta
## Brazil_2020_07_01-2020_12_31_159.fasta
## Brazil_2020_07_01-2020_12_31_16.fasta
## Brazil_2020_07_01-2020_12_31_160.fasta
## Brazil_2020_07_01-2020_12_31_161.fasta
## Brazil_2020_07_01-2020_12_31_162.fasta
## Brazil_2020_07_01-2020_12_31_163.fasta
## Brazil_2020_07_01-2020_12_31_164.fasta
## Brazil_2020_07_01-2020_12_31_165.fasta
## Brazil_2020_07_01-2020_12_31_166.fasta
## Brazil_2020_07_01-2020_12_31_167.fasta
## Brazil_2020_07_01-2020_12_31_168.fasta
## Brazil_2020_07_01-2020_12_31_169.fasta
## Brazil_2020_07_01-2020_12_31_17.fasta
## Brazil_2020_07_01-2020_12_31_170.fasta
## Brazil_2020_07_01-2020_12_31_171.fasta
## Brazil_2020_07_01-2020_12_31_172.fasta
## Brazil_2020_07_01-2020_12_31_173.fasta
## Brazil_2020_07_01-2020_12_31_174.fasta
## Brazil_2020_07_01-2020_12_31_175.fasta
## Brazil_2020_07_01-2020_12_31_176.fasta
## Brazil_2020_07_01-2020_12_31_177.fasta
## Brazil_2020_07_01-2020_12_31_178.fasta
## Brazil_2020_07_01-2020_12_31_179.fasta
## Brazil_2020_07_01-2020_12_31_18.fasta
## Brazil_2020_07_01-2020_12_31_180.fasta
## Brazil_2020_07_01-2020_12_31_181.fasta
## Brazil_2020_07_01-2020_12_31_182.fasta
## Brazil_2020_07_01-2020_12_31_183.fasta
## Brazil_2020_07_01-2020_12_31_184.fasta
## Brazil_2020_07_01-2020_12_31_185.fasta
## Brazil_2020_07_01-2020_12_31_186.fasta
## Brazil_2020_07_01-2020_12_31_187.fasta
## Brazil_2020_07_01-2020_12_31_188.fasta
## Brazil_2020_07_01-2020_12_31_189.fasta
## Brazil_2020_07_01-2020_12_31_19.fasta
## Brazil_2020_07_01-2020_12_31_190.fasta
## Brazil_2020_07_01-2020_12_31_191.fasta
## Brazil_2020_07_01-2020_12_31_192.fasta
## Brazil_2020_07_01-2020_12_31_193.fasta
## Brazil_2020_07_01-2020_12_31_194.fasta
## Brazil_2020_07_01-2020_12_31_195.fasta
## Brazil_2020_07_01-2020_12_31_196.fasta
## Brazil_2020_07_01-2020_12_31_197.fasta
## Brazil_2020_07_01-2020_12_31_198.fasta
## Brazil_2020_07_01-2020_12_31_199.fasta
## Brazil_2020_07_01-2020_12_31_2.fasta
## Brazil_2020_07_01-2020_12_31_20.fasta
## Brazil_2020_07_01-2020_12_31_200.fasta
## Brazil_2020_07_01-2020_12_31_201.fasta
## Brazil_2020_07_01-2020_12_31_202.fasta
## Brazil_2020_07_01-2020_12_31_203.fasta
## Brazil_2020_07_01-2020_12_31_204.fasta
## Brazil_2020_07_01-2020_12_31_205.fasta
## Brazil_2020_07_01-2020_12_31_206.fasta
## Brazil_2020_07_01-2020_12_31_207.fasta
## Brazil_2020_07_01-2020_12_31_208.fasta
## Brazil_2020_07_01-2020_12_31_209.fasta
## Brazil_2020_07_01-2020_12_31_21.fasta
## Brazil_2020_07_01-2020_12_31_210.fasta
## Brazil_2020_07_01-2020_12_31_211.fasta
## Brazil_2020_07_01-2020_12_31_212.fasta
## Brazil_2020_07_01-2020_12_31_213.fasta
## Brazil_2020_07_01-2020_12_31_214.fasta
## Brazil_2020_07_01-2020_12_31_215.fasta
## Brazil_2020_07_01-2020_12_31_216.fasta
## Brazil_2020_07_01-2020_12_31_217.fasta
## Brazil_2020_07_01-2020_12_31_218.fasta
## Brazil_2020_07_01-2020_12_31_219.fasta
## Brazil_2020_07_01-2020_12_31_22.fasta
## Brazil_2020_07_01-2020_12_31_220.fasta
## Brazil_2020_07_01-2020_12_31_221.fasta
## Brazil_2020_07_01-2020_12_31_222.fasta
## Brazil_2020_07_01-2020_12_31_223.fasta
## Brazil_2020_07_01-2020_12_31_224.fasta
## Brazil_2020_07_01-2020_12_31_225.fasta
## Brazil_2020_07_01-2020_12_31_226.fasta
## Brazil_2020_07_01-2020_12_31_227.fasta
## Brazil_2020_07_01-2020_12_31_228.fasta
## Brazil_2020_07_01-2020_12_31_229.fasta
## Brazil_2020_07_01-2020_12_31_23.fasta
## Brazil_2020_07_01-2020_12_31_230.fasta
## Brazil_2020_07_01-2020_12_31_231.fasta
## Brazil_2020_07_01-2020_12_31_232.fasta
## Brazil_2020_07_01-2020_12_31_233.fasta
## Brazil_2020_07_01-2020_12_31_234.fasta
## Brazil_2020_07_01-2020_12_31_235.fasta
## Brazil_2020_07_01-2020_12_31_236.fasta
## Brazil_2020_07_01-2020_12_31_237.fasta
## Brazil_2020_07_01-2020_12_31_238.fasta
## Brazil_2020_07_01-2020_12_31_239.fasta
## Brazil_2020_07_01-2020_12_31_24.fasta
## Brazil_2020_07_01-2020_12_31_240.fasta
## Brazil_2020_07_01-2020_12_31_241.fasta
## Brazil_2020_07_01-2020_12_31_242.fasta
## Brazil_2020_07_01-2020_12_31_243.fasta
## Brazil_2020_07_01-2020_12_31_244.fasta
## Brazil_2020_07_01-2020_12_31_245.fasta
## Brazil_2020_07_01-2020_12_31_246.fasta
## Brazil_2020_07_01-2020_12_31_247.fasta
## Brazil_2020_07_01-2020_12_31_248.fasta
## Brazil_2020_07_01-2020_12_31_249.fasta
## Brazil_2020_07_01-2020_12_31_25.fasta
## Brazil_2020_07_01-2020_12_31_250.fasta
## Brazil_2020_07_01-2020_12_31_26.fasta
## Brazil_2020_07_01-2020_12_31_27.fasta
## Brazil_2020_07_01-2020_12_31_28.fasta
## Brazil_2020_07_01-2020_12_31_29.fasta
## Brazil_2020_07_01-2020_12_31_3.fasta
## Brazil_2020_07_01-2020_12_31_30.fasta
## Brazil_2020_07_01-2020_12_31_31.fasta
## Brazil_2020_07_01-2020_12_31_32.fasta
## Brazil_2020_07_01-2020_12_31_33.fasta
## Brazil_2020_07_01-2020_12_31_34.fasta
## Brazil_2020_07_01-2020_12_31_35.fasta
## Brazil_2020_07_01-2020_12_31_36.fasta
## Brazil_2020_07_01-2020_12_31_37.fasta
## Brazil_2020_07_01-2020_12_31_38.fasta
## Brazil_2020_07_01-2020_12_31_39.fasta
## Brazil_2020_07_01-2020_12_31_4.fasta
## Brazil_2020_07_01-2020_12_31_40.fasta
## Brazil_2020_07_01-2020_12_31_41.fasta
## Brazil_2020_07_01-2020_12_31_42.fasta
## Brazil_2020_07_01-2020_12_31_43.fasta
## Brazil_2020_07_01-2020_12_31_44.fasta
## Brazil_2020_07_01-2020_12_31_45.fasta
## Brazil_2020_07_01-2020_12_31_46.fasta
## Brazil_2020_07_01-2020_12_31_47.fasta
## Brazil_2020_07_01-2020_12_31_48.fasta
## Brazil_2020_07_01-2020_12_31_49.fasta
## Brazil_2020_07_01-2020_12_31_5.fasta
## Brazil_2020_07_01-2020_12_31_50.fasta
## Brazil_2020_07_01-2020_12_31_51.fasta
## Brazil_2020_07_01-2020_12_31_52.fasta
## Brazil_2020_07_01-2020_12_31_53.fasta
## Brazil_2020_07_01-2020_12_31_54.fasta
## Brazil_2020_07_01-2020_12_31_55.fasta
## Brazil_2020_07_01-2020_12_31_56.fasta
## Brazil_2020_07_01-2020_12_31_57.fasta
## Brazil_2020_07_01-2020_12_31_58.fasta
## Brazil_2020_07_01-2020_12_31_59.fasta
## Brazil_2020_07_01-2020_12_31_6.fasta
## Brazil_2020_07_01-2020_12_31_60.fasta
## Brazil_2020_07_01-2020_12_31_61.fasta
## Brazil_2020_07_01-2020_12_31_62.fasta
## Brazil_2020_07_01-2020_12_31_63.fasta
## Brazil_2020_07_01-2020_12_31_64.fasta
## Brazil_2020_07_01-2020_12_31_65.fasta
## Brazil_2020_07_01-2020_12_31_66.fasta
## Brazil_2020_07_01-2020_12_31_67.fasta
## Brazil_2020_07_01-2020_12_31_68.fasta
## Brazil_2020_07_01-2020_12_31_69.fasta
## Brazil_2020_07_01-2020_12_31_7.fasta
## Brazil_2020_07_01-2020_12_31_70.fasta
## Brazil_2020_07_01-2020_12_31_71.fasta
## Brazil_2020_07_01-2020_12_31_72.fasta
## Brazil_2020_07_01-2020_12_31_73.fasta
## Brazil_2020_07_01-2020_12_31_74.fasta
## Brazil_2020_07_01-2020_12_31_75.fasta
## Brazil_2020_07_01-2020_12_31_76.fasta
## Brazil_2020_07_01-2020_12_31_77.fasta
## Brazil_2020_07_01-2020_12_31_78.fasta
## Brazil_2020_07_01-2020_12_31_79.fasta
## Brazil_2020_07_01-2020_12_31_8.fasta
## Brazil_2020_07_01-2020_12_31_80.fasta
## Brazil_2020_07_01-2020_12_31_81.fasta
## Brazil_2020_07_01-2020_12_31_82.fasta
## Brazil_2020_07_01-2020_12_31_83.fasta
## Brazil_2020_07_01-2020_12_31_84.fasta
## Brazil_2020_07_01-2020_12_31_85.fasta
## Brazil_2020_07_01-2020_12_31_86.fasta
## Brazil_2020_07_01-2020_12_31_87.fasta
## Brazil_2020_07_01-2020_12_31_88.fasta
## Brazil_2020_07_01-2020_12_31_89.fasta
## Brazil_2020_07_01-2020_12_31_9.fasta
## Brazil_2020_07_01-2020_12_31_90.fasta
## Brazil_2020_07_01-2020_12_31_91.fasta
## Brazil_2020_07_01-2020_12_31_92.fasta
## Brazil_2020_07_01-2020_12_31_93.fasta
## Brazil_2020_07_01-2020_12_31_94.fasta
## Brazil_2020_07_01-2020_12_31_95.fasta
## Brazil_2020_07_01-2020_12_31_96.fasta
## Brazil_2020_07_01-2020_12_31_97.fasta
## Brazil_2020_07_01-2020_12_31_98.fasta
## Brazil_2020_07_01-2020_12_31_99.fasta
## 250
cd ../regions/brazil/2020/fasta
for i in Brazil*; do echo $i; done > COVID_List_Region.txt
wc -l COVID_List_Region.txt ## Total number of FASTA files
## 250 COVID_List_Region.txt
for fasta_file in $(cat ../regions/brazil/2020/fasta/COVID_List_Region.txt); do
## Print names of the fasta file
echo $fast_file
## Alignment
bwa mem -M -R \
'@RG\tID:SampleCorona\tLB:sample_1\tPL:ILLUMINA\tPM:HISEQ\tSM:SampleCorona' \
../fasta_reference_file/SARS-CoV-2.fasta \
../regions/brazil/2020/fasta/$fasta_file > \
../regions/brazil/2020/sam/$fasta_file".sam"
## SAM to BAM
samtools view -S -b .$fasta_file".sam" > \
../regions/brazil/2020/bam/$fasta_file".bam"
## This part is different in the VMs
bcftools mpileup --fasta-ref ../fasta_reference_file/SARS-CoV-2.fasta ../regions/brazil/2020/bam/$fasta_file".bam" > \
../regions/brazil/2020/bcf/$fasta_file".bcf"
## Bcftools extracts SNPs
bcftools view -v snps ../regions/brazil/2020/bcf/$fasta_file".bcf" > ../regions/brazil/2020/vcf_snp/$fasta_file"_snps.vcf"
## Bcftools extracts indels
bcftools view -v indels ../regions/brazil/2020/bcf/$fasta_file".bcf" > ../regions/brazil/2020/vcf_indel/$fasta_file"_indels.vcf"
done
cd brazil_vcf_julDec_snp
echo "Mutation N501Y"
grep -w 23063 * | wc -l #2/250 files
grep -w 23064 * | wc -l # 0/250 files
grep -w 23065 * | wc -l # 0/250 files
echo "Mutation P681H"
grep -w 23603 * | wc -l # 0/250 files
grep -w 23604 * | wc -l # 2/250 files
grep -w 23605 * | wc -l # 0/250 files
echo "HV 69–70 deletion"
grep -w 21766 * | wc -l
grep -w 21767 * | wc -l
grep -w 21768 * | wc -l
grep -w 21769 * | wc -l
grep -w 21770 * | wc -l
grep -w 21771 * | wc -l
grep -w 21772 * | wc -l
echo "Mutations from Yin"
grep -w 241 * | wc -l
grep -w 3037 * | wc -l
grep -w 11083 * | wc -l
grep -w 14408 * | wc -l
grep -w 17747 * | wc -l
grep -w 17858 * | wc -l
grep -w 18060 * | wc -l
grep -w 23403 * | wc -l
grep -w 26144 * | wc -l
grep -w 27046 * | wc -l
grep -w 28144 * | wc -l
grep -w 28881 * | wc -l
grep -w 28882 * | wc -l
grep -w 28883 * | wc -l
## bash: line 0: cd: brazil_vcf_julDec_snp: No such file or directory
## Mutation N501Y
## 6
## 2
## 2
## Mutation P681H
## 2
## 2
## 2
## HV 69–70 deletion
## 2
## 2
## 2
## 2
## 2
## 2
## 2
## Mutations from Yin
## 2
## 2
## 2
## 2
## 2
## 2
## 2
## 2
## 2
## 2
## 2
## 2
## 2
## 2
cd brazil_fasta_julDec
NumeroTotalFASTA=$(for i in Brazil*; do echo $i; done | wc -l)
echo "${NumeroTotalFASTA}"
cd ~/Desktop/Gepoliano/SARS-CoV-2_Analysis/brazil_vcf_julDec_snp
NumeroN501Y=$(grep 23063 * | wc -l )
echo "${NumeroN501Y}"
echo "scale=3; ${NumeroN501Y} / ${NumeroTotalFASTA} " | bc
##frequency= ( $NumeroTotalFASTA / $NumeroN501Y )
##echo " A Frequência é de N501Y = $frequency"
## bash: line 0: cd: brazil_fasta_julDec: No such file or directory
## 1
## 2
## 2.000
## Count total number of FASTA files
cd brazil_fasta_julDec
NumeroTotalFASTA=$(for i in Brazil*; do echo $i; done | wc -l)
echo "${NumeroTotalFASTA}"
## Check current directory
pwd
## Go back to the previous directory, because VCF files are stored inside it
cd ..
## Go to VCF file directory and count number of times a mutation is observed
cd brazil_vcf_julDec_snp
NumeroN501Y=$(grep 23063 * | wc -l )
echo "${NumeroN501Y}"
## Print resut of frequency calculation to screen
echo "scale=3; ${NumeroN501Y} / ${NumeroTotalFASTA} " | bc
## bash: line 1: cd: brazil_fasta_julDec: No such file or directory
## 1
## /Users/gepolianochaves/Desktop/Gepoliano/Bioinformatics Bridge Course/2023-bioinfo-bridge-course/recombio/2-Variant-Call-Pipeline/scripts
## bash: line 12: cd: brazil_vcf_julDec_snp: No such file or directory
## grep: fasta_reference_file: Is a directory
## grep: figures: Is a directory
## grep: regions: Is a directory
## grep: scripts: Is a directory
## 0
## 0
cd brazil_fasta_julDec
NumeroTotalFASTA=$(for i in Brazil*; do echo $i; done | wc -l)
cd ..
for Variant in $(cat VariantList.txt.tmp); do
Variant_Total_Occurrences=$(grep $Variant brazil_vcf_julDec_snp/* | wc -l )
echo "scale=2; 100*${Variant_Total_Occurrences} / ${NumeroTotalFASTA} " | bc
#echo "scale=3; ${Variant_Total_Occurrences} / ${NumeroTotalFASTA} " | bc
done
## bash: line 0: cd: brazil_fasta_julDec: No such file or directory
## cat: VariantList.txt.tmp: No such file or directory
https://www.the-scientist.com/news-opinion/a-guide-to-emerging-sars-cov-2-variants-68387
https://www.youtube.com/watch?v=hcnnKrlqa9M
https://www.rcsb.org/structure/6VW1
https://www.youtube.com/watch?v=nFY3EjBNPBQ