library(ggplot2)
library(maditr)
library(dplyr)
r2_gse62564_GSVA_Metadata <- readRDS("../../ReComBio Scientific/ReComBio Book English/recombio bookdown/data/r2_gse62564_GSVA_Metadata.rds")
r2_gse62564_GSVA_Metadata <- r2_gse62564_GSVA_Metadata %>%
mutate_at("HALLMARK_HYPOXIA", as.numeric) %>%
mutate_at("HALLMARK_INFLAMMATORY_RESPONSE", as.numeric) %>%
mutate_at("ADRN_Norm_vs_Hypo_Up_554.txt", as.numeric) %>%
mutate_at("ADRN_Norm_vs_Hypo_Down_635.txt", as.numeric)
Use R plot functions to visualize the correlation between Hallmark Hypoxia and Hallmark Inflammatory Response
HALLMARK_INFLAMMATORY_RESPONSE vs. HALLMARK_HYPOXIA
qplot(HALLMARK_INFLAMMATORY_RESPONSE, HALLMARK_HYPOXIA,
data = r2_gse62564_GSVA_Metadata,
#colour=quantile,
ylab = "HALLMARK_HYPOXIA",
xlab = "HALLMARK_INFLAMMATORY_RESPONSE")
Use R plot functions to visualize the correlation between Hallmark Hypoxia and Hallmark Inflammatory Response
HALLMARK_INFLAMMATORY_RESPONSE vs. ADRN_Norm_vs_Hypo_Up_554.txt
qplot(HALLMARK_INFLAMMATORY_RESPONSE, ADRN_Norm_vs_Hypo_Up_554.txt,
data = r2_gse62564_GSVA_Metadata,
# colour=quantile,
xlab = "HALLMARK_INFLAMMATORY_RESPONSE",
ylab = "ADRN_Norm_vs_Hypo_Up_554.txt")
Use R plot functions to visualize the correlation between Hallmark Hypoxia and Hallmark Inflammatory Response
HALLMARK_INFLAMMATORY_RESPONSE vs. ADRN_Norm_vs_Hypo_Down_635.txt
qplot(HALLMARK_INFLAMMATORY_RESPONSE, ADRN_Norm_vs_Hypo_Down_635.txt,
data = r2_gse62564_GSVA_Metadata,
# colour=quantile,
ylab = "ADRN_Norm_vs_Hypo_Down_635.txt",
xlab = "HALLMARK_INFLAMMATORY_RESPONSE")
From the analysis of questions 1-3, choose the correct option:
HALLMARK_INFLAMMATORY_RESPONSE and HALLMARK_HYPOXIA have a positive correlation because hypoxia is always beneficial in the tumor microenvironment
HALLMARK_INFLAMMATORY_RESPONSE and ADRN_Norm_vs_Hypo_Up_554.txt have a positive correlation because hypoxia upregulation in this case, is beneficial in the tumor microenvironment
HALLMARK_INFLAMMATORY_RESPONSE and ADRN_Norm_vs_Hypo_Up_554.txt have a positive correlation because hypoxia upregulation has a negative impact on survival
HALLMARK_INFLAMMATORY_RESPONSE and ADRN_Norm_vs_Hypo_Down_635.txt have a negative correlation because hypoxia upregulation in this case, is not beneficial in the tumor microenvironment
Which gene expression group has worse survival probability?
Which phenotype score group has worse survival probability?
Which phenotype score group has worse survival probability?
Plot the survival curve of the MYCN status variable. Which MYCN status has worse survival outcome?
Plot the survival curve of the INSS stage variable. Which INSS stage has worse survival outcome?
Mark TRUE or FALSE.
What is a transcriptome?
It is the complete set of transcripts in a cell, and their quantity, for a specific developmental stage or physiological condition.
It is the complete set of proteins in a cell, and their quantity, for a specific developmental stage or physiological condition.
It is the complete set of interfering RNAs in a cell, and their quantity, for a specific developmental stage or physiological condition.
It is the complete set of ribosomal molecules in a cell, and their quantity, for a specific developmental stage or physiological condition.
What is the main purpose of the transcriptome analysis?
To write machine learning algorithms using gene expression quantification.
To complete sequencing of the sets of proteins in a cell, and their quantity, for a specific developmental stage or physiological condition.
To interpret the functional elements of the genome and reveal the molecular constituents of cells and tissues, and the development and diseases.
To understand the ribosomal molecules in a cell, and their quantity.
Which technology has been used longer to measure gene expression?
RNA-Seq.
Microarrays.
CRISPR/Cas9.
X-ray crystallography.
Mark the option below that contains a useful application for isolation of circulating DNA and RNA for the purpose of longitudinal patient interrogation.
Liquid biopsy
Tumor biopsy
Tissue biopsy.
Mammalian cell biopsy.
X-ray crystallography measurements.
What of the options below are generally not challenges in an RNA sequencing set-up?
Library construction.
Methods to store, retrieve and process large amounts of data.
Sequencing coverage and cost.
mRNA isolation.
Which alternative depicts an alingment file format?
Fasta file.
SAM file.
VCF file.
Fastq file.
BCF file.
What are the correct steps taken in a RNA-Seq analysis pipeline?
Wet lab experiment -> Laboratory or Field Experiment -> Bioinformatic Processing -> biological inference.
Laboratory or field experiment -> Wet Lab Experiment -> Bioinformatic Processing -> Biological inference.
Bioinformatic Processing -> Wet Lab Experiment -> Laboratory or field experiment -> Biological inference.
Biological inference -> Bioinformatic Processing -> Wet Lab Experiment -> Laboratory or field experiment.
Which R package did we use to perform the differential expression step in the transcriptome analysis?
DESeq2.
ggplot2.
dplyr.
pheatmap.
clusterProfiler.
Which R package did we use to perform the enrichment analysis step in the transcriptome analysis?
DESeq2.
ggplot2.
dplyr.
pheatmap.
clusterProfiler.
Which alternative contains factors affecting gene expression levels that need to be normalized?
Sequencing depth only.
Gene length only.
Only RNA isoform composition.
Only sequencing depth and gene length.
Sequencing depth, gene length and RNA isoform composition.
How can you predict what label a new patient that was sequenced using the UCSC nanopore technology will have using the logistic regression model constructed in R?
I can build a vector to provide as input to R so that the model can use the parameters it calculated to make the prediction
If I have more than 2 patients to predict, I can build a dataframe to input to R
It is not possible to know if a patient has high risk neuroblastoma disease without doing a FISH (Fluorescence in situ hybridization) and understanding if the person has MYCN amplification
How can you know if a model represents a good indicator for the high risk status of a patient?
If the model has high accuracy
If the model has medium accuracy
If the model has low accuracy
Write a command to input information about the predictor variables of a patient into R. To come up with a solution, please discuss the strategy to solve this problem in groups in breakout rooms. Please use 10-15 min to discuss a solution.
Write a command to predict the high risk label for the patient based in the patient’s gene expression pattern. To come up with a solution, please discuss the strategy to solve this problem in groups in breakout rooms. Please use 10-15 min to discuss a solution.
Hint 1: Which chunks in the notebook help to explain a possible solution?
Hint 2: Look at the functions that can possibly present a solution?
Please describe the solution in pseudocode (human language + computer language merged)
Write a command to predict the high risk label for the patient based in the patient’s gene expression pattern. To come up with a solution, please discuss the strategy to solve this problem in groups. Please use 10-15 min to discuss a solution.
Hint 1: Which data structures can you use to input the data into the R environment? A dataframe? A vector? A character?
Please present the solution in a command or algorithm (computer language)
Dr. Waterman, who participated in the Summer Bioiformatics Bridge Course a couple years ago, has completed a round of experiments aiming to use the classification model that we developed for the neuroblastoma labels using the Kokak dataset with n=498 patients. Dr. Waterman generated 22 phenotype scores for the set of 105 patients that he assists. The dataframe with the scores are in the data folder of the google drive of the course:
https://drive.google.com/file/d/1Bs6uDvc6i59l5JAjlPbVF7IgrNG5DWfK/view?usp=share_link
Or visit the data folder on Google Drive:
https://drive.google.com/drive/folders/1fr7SkOqMqOqQIW5ttjblGkPHCJY2HDJv?usp=sharing
The dataframe is the file gse73517geo105_GSVA_Metadata_selected.csv in the shared folder.
Hint 1: Use the Waterman dataset as new data in the Machine Learning Classification notebook.
Hint 2: Construct class prediction and probabilities for the Westerman dataset.
Hint 3: Construct the results for westermand dataset using the class prediction and class probabilities for the Westerman dataset.
Watch the video about DNA sequencing technologies that was based on the Shendure et al., 2017 paper, from minutes 14:30 to 15:30. What is another technical name for Next Generation Sequencing (NGS) technologies?
Now watch the video from 19:06 - 22:37. Which step is responsible for the increased amounts of DNA necessary for DNA sequencing?
According to Shendure et al., 2017, how many and what are the names of the DNA sequencing technologies?
What is the name of the process that allows bacteria to incorporate DNA into own their bacterial cells?
What was the main clinical application of DNA sequencing technologies discussed by Shendure et al., 2017 in their paper?
According to Shendure, please identify at least one chemical event in the machanics of DNA sequencing.
According to Shendure, please identify at least one significant event in history of DNA sequencing.
Use the echo command to print a greeting to the class like this one:
echo "Print a greeting to the class:"
## Print a greeting to the class:
Now print your greeting message. Insert the message using the echo command:
echo " "
We can define a variable
test_variable=this_file
and now print the name of the variable also using the echo command:
test_variable=this_file
echo $test_variable
## this_file
Check files that are present in the current directory:
ls
## 02-variant_call_pipeline.Rmd
## 02-variant_call_pipeline.html
## 2024-bioinfo-bridge-course
## README.md
## _site
## _site.yml
## about.Rmd
## about.html
## align-extract-counts.Rmd
## align-extract-counts.html
## assignment-dates.Rmd
## assignment-dates.html
## bioinformatics-bridge-course_2024.Rproj
## certificates
## data
## download_sequences.Rmd
## download_sequences.html
## edited_videos
## exercises
## exercises.Rmd
## exercises_files
## feedback
## final-presentations.Rmd
## images
## index.Rmd
## key_pairs
## last_day
## not_ready_to_render
## notebooks
## papers
## pdf
## recordings.Rmd
## scripts
## site_libs
## slides
## transcriptomics analysis.Rmd
## videos.Rmd
In the previous chunk there are files and folders that are present in that directory. I will print the files again with an echo message so it is more clear what was the command and what was the output:
echo "List the files. These are the files inside this directory:"
ls
## List the files. These are the files inside this directory:
## 02-variant_call_pipeline.Rmd
## 02-variant_call_pipeline.html
## 2024-bioinfo-bridge-course
## README.md
## _site
## _site.yml
## about.Rmd
## about.html
## align-extract-counts.Rmd
## align-extract-counts.html
## assignment-dates.Rmd
## assignment-dates.html
## bioinformatics-bridge-course_2024.Rproj
## certificates
## data
## download_sequences.Rmd
## download_sequences.html
## edited_videos
## exercises
## exercises.Rmd
## exercises_files
## feedback
## final-presentations.Rmd
## images
## index.Rmd
## key_pairs
## last_day
## not_ready_to_render
## notebooks
## papers
## pdf
## recordings.Rmd
## scripts
## site_libs
## slides
## transcriptomics analysis.Rmd
## videos.Rmd
This is the current directory with command pwd:
pwd
## /Users/gepolianochaves/Desktop/Gepoliano/Bioinformatics Bridge Course/bioinformatics-bridge-course_2024
Above you can see that that was a directory called exercises inside another folder named bioinformatics-bridge-course_2024.
Let’s create a folder in this directory, using the mkdir command:
mkdir test_directory
If by some reason we type the command again, Bash will throw an error at us:
mkdir test_directory
## mkdir: test_directory: File exists
These are the files present in this directory:
ls
## 02-variant_call_pipeline.Rmd
## 02-variant_call_pipeline.html
## 2024-bioinfo-bridge-course
## README.md
## _site
## _site.yml
## about.Rmd
## about.html
## align-extract-counts.Rmd
## align-extract-counts.html
## assignment-dates.Rmd
## assignment-dates.html
## bioinformatics-bridge-course_2024.Rproj
## certificates
## data
## download_sequences.Rmd
## download_sequences.html
## edited_videos
## exercises
## exercises.Rmd
## exercises_files
## feedback
## final-presentations.Rmd
## images
## index.Rmd
## key_pairs
## last_day
## not_ready_to_render
## notebooks
## papers
## pdf
## recordings.Rmd
## scripts
## site_libs
## slides
## test_directory
## transcriptomics analysis.Rmd
## videos.Rmd
Let’s move into the directory that we just created. To do that we use the cd (change directory command):
cd test_directory
Check the directory where we are at. The pwd (print working directory) command prints the directory inside which we are:
cd test_directory
pwd
## /Users/gepolianochaves/Desktop/Gepoliano/Bioinformatics Bridge Course/bioinformatics-bridge-course_2024/test_directory
Note that we are inside the test_directory.
Let’s look at what is inside the test_directory:
cd test_directory
pwd
ls
## /Users/gepolianochaves/Desktop/Gepoliano/Bioinformatics Bridge Course/bioinformatics-bridge-course_2024/test_directory
echo - pode ser feito criando-se arquivos de texto
## go inside the test_directory directory
cd test_directory
## Print working directory
pwd
## List what is inside the directory
ls
echo "As you can see, there is nothing inside this directory:"
echo "Because there is nothing here, let's create something, in the part below."
## /Users/gepolianochaves/Desktop/Gepoliano/Bioinformatics Bridge Course/bioinformatics-bridge-course_2024/test_directory
## As you can see, there is nothing inside this directory:
## Because there is nothing here, let's create something, in the part below.
Let’s create 3 files inside the test_directory and list what is inside the test_directory folder:
cd test_directory
pwd
touch firstFile.txt
touch secondFile.txt
touch thirdFile.txt
ls
## /Users/gepolianochaves/Desktop/Gepoliano/Bioinformatics Bridge Course/bioinformatics-bridge-course_2024/test_directory
## firstFile.txt
## secondFile.txt
## thirdFile.txt
Let’s have the terminal print the names of the txt files in the test_directory for us:
cd test_directory
pwd
for i in *txt; do echo the name of this file is":" $i ; done
## /Users/gepolianochaves/Desktop/Gepoliano/Bioinformatics Bridge Course/bioinformatics-bridge-course_2024/test_directory
## the name of this file is: firstFile.txt
## the name of this file is: secondFile.txt
## the name of this file is: thirdFile.txt
Let’s have the terminal print one word, tow words and three words to the files, like this:
cd test_directory
pwd
echo word1 >> firstFile.txt
echo word1 >> secondFile.txt
echo word1 >> thirdFile.txt
echo word2 >> secondFile.txt
echo word2 >> thirdFile.txt
echo word3 >> thirdFile.txt
## /Users/gepolianochaves/Desktop/Gepoliano/Bioinformatics Bridge Course/bioinformatics-bridge-course_2024/test_directory
Let’s look at each one of the txt files to see what we created:
cd test_directory
echo "word counts of firstFile.txt:"
cat firstFile.txt
echo "word counts of secondFile.txt:"
cat secondFile.txt
echo "word counts of thirdFile.txt:"
cat thirdFile.txt
## word counts of firstFile.txt:
## word1
## word counts of secondFile.txt:
## word1
## word2
## word counts of thirdFile.txt:
## word1
## word2
## word3
In a for loop:
cd test_directory
for i in *txt; do cat $i ; done
#cat *txt
## word1
## word1
## word2
## word1
## word2
## word3
Relate these exercises to a terminal command? cat
This is a command to count the lines of a file:
cd test_directory
wc -l firstFile.txt
wc -l secondFile.txt
wc -l thirdFile.txt
## 1 firstFile.txt
## 2 secondFile.txt
## 3 thirdFile.txt
A script to count the number of reads or sequences in a fastq or fasta file.
A for loop script to count the number of fasta files
for i in AF243*; do echo $i; done
Finally, remove the test_directory
rm -r test_directory
What was the package used to download fasta sequences?
What was the function used to download fasta sequences?
What was the function used to save fasta sequences downloaded?
Please read this Wikipedia article about haplogroups and the genetic inheritance of the mitochondria and the Y chromosome: https://en.wikipedia.org/wiki/Haplogroup. What is the difference between a haplogroup and a hyplotype?
Using the fasta sequences that you decided to download, repeat the haplogroup analysis of the fasta sequences downloaded in R using the haplogrep page: https://haplogrep.i-med.ac.at. What are the file formats accepted by the haplogrep tool?
Please note that among the file formats accepted by the Haplogrep tool, is the VCF (variant call file) format. In what part of our course did we use a VCF file? Please try uploading VCF files instead of fasta files to Haplogrep.
Look at the Haplogrep3 haplogroup annotations: https://haplogrep.readthedocs.io/en/latest/annotations/#clusters-and-population-frequencies.
Now compare with the results of the analysis of the 17 fasta sequences (12 passing and 5 warnings) isolated from the self-declared white Brazilians that passed the Haplogrep QC.
How many individuals have African matrilineal ancestry?
How many individuals have European matrilineal ancestry?
How many individuals have Latino/Admixed American matrilineal
ancestry?
How many individuals have East Asian matrilineal ancestry?
How many individuals have South Asian matrilineal ancestry?
Which percentage of European, Latino (admixed) East Asian and African do you see in the mitocondrial haplogroups in the samples sequenced in self-declared white Brazilians?
Compare the results shown in Question 2 with the results from Mitchell 2014, shown in the figure below. What are the similarities in the colors of the ethnical-racial groups classified as African, European and Asian in the samples from Brazil and from the USA? If you need to access the paper, please refer to the shared drive or the attached link: Mitchell et al., 2014.
What is the suggested origin of matrilineal clusters L in self-declared white Brazilians?
What is the suggested origin of matrilineal clusters J, K, HV and U in self-declared white Brazilians?
What is the suggested origin of matrilineal clusters A and B in self-declared white Brazilians?
Which tool performs alignment in the Variant Call Pipeline?
samtools
bcftools
bwa
bamtools
Why was the for loop used in the Variant Call Pipeline?
To save time in the analysis
To allow identification of variants in all mitochondrial fasta files
So that the samples could be aligned to a reference genome
To save space and computational resources in the computer
Using the on-line IDE (https://www.onlinegdb.com/online_c++_compiler), please create 3-5 txt files in the terminal. Please make sure you create the files with different names. Which of the commands below can you use to print the names of the files to the screen of the terminal?
echo
cat
ls
pwd
Write a script to print the names of at least one of the txt files that were created in the previous exercise.
Write a script to print the names of all txt files that were created in question 3. Use a for loop.
In the terminal, which command can you use to print a word to the end of a txt file?
echo “word” >> file.txt
print “word” > file.txt
cat “word” > file.txt
ls “word” > file.txt
pwd “word” >> file.txt
Save some words to the files you created in question number 3. You can use the command below:
echo "word" >> file.txt
Make sure you understand the usage of the >> operator. Do not forget to also make sure the terminal recognizes the name of the txt files you created by using the tab key in your keyboard. Which command can you use to count how many words you provided as input to the text file?
wc -l
echo wc -l
wc -l > file.txt
pwd –word-counts
file.txt –word-counts
Which command can you use to double-check the words you inputed to the files you wrote?
echo
cat
pwd
ls
mkdir
Please read instructions about Launching your own AMI instances (https://datacarpentry.org/genomics-workshop/AMI-setup). The instructor can sign into the AWS EC2 Dashboard (https://console.aws.amazon.com/ec2/) but students do not have to do this or create an AWS account. Creating an AWS account is responsability of the student because you may have to pay for these resources.
Once you are provided with instructions on how to access the AWS instance, do so by typing the ssh command in the terminal:
ssh dcuser@ec2-3-82-97-221.compute-1.amazonaws.com
Once asked, type the password:
data4Carp
Once inside the AWS instance, copy the variant_call modified with your name as I am doing below. This part is similar to what I did when I specified the steps of the variant call pipeline:
cp -r variant_call variant_call_gepoliano
You can now copy the variant call pipeline commands and paste it in the terminal, as I did in the variant call notebook: https://bioinformatics-bridge-course.netlify.app/02-variant_call_pipeline#variant-call-pipeline-using-a-for-loop-strategy:
cd /home/dcuser/variant_call_gepoliano/fasta
REFERENCE=AF243627.fasta
## for loop specific to the individual fasta files
for fastaFile in AF*.fasta; do
## Pick just the eight firt letters of the string
fastaFileName=${fastaFile:0:8}
echo "started sample $fastaFileName"
## Align
bwa mem -t 4 ../references/$REFERENCE $fastaFile > ../sam/$fastaFileName".sam"
echo "Just finished aligning sample $fastaFileName"
## SAM to BAM
samtools view -S -b ../sam/$fastaFileName".sam" > \
../bam/$fastaFileName".bam"
echo "Finished SAM to BAM of sample $fastaFileName"
## Samtools uses reference FASTA to detect "piles" in the alignment
bcftools mpileup -f ../references/$REFERENCE ../bam/$fastaFileName".bam" > \
../bcf/$fastaFileName".bcf"
## Bcftools extracts SNPs
echo "started Bcftools SNPs for sample $fastaFileName"
bcftools view -v snps ../bcf/$fastaFileName".bcf" > ../snps_vcf/$fastaFileName"_snps.vcf"
done
echo "Finished the entire for loop"
What do you see in your screen? How do you know if it run sucessfully?