Part 1: Introduction to R and the R Syntax
- Chapter 3: Data visualization
Part 2: Machine Learning
- Chapter 1: RNA Sequencing (RNA-Seq)
- Chapter 2: Classification
Part 3: DNA Sequencing and Introduction to the Terminal
Part 4: Presentations and Final Considerations

Part 1: Introduction to R and the R Syntax

Chapter 3: Data visualization

Load libraries

library(ggplot2)
library(maditr)
library(dplyr)

Load dataframe

r2_gse62564_GSVA_Metadata <- readRDS("../../ReComBio Scientific/ReComBio Book English/recombio bookdown/data/r2_gse62564_GSVA_Metadata.rds")

Make Variables of Numeric Type

r2_gse62564_GSVA_Metadata <- r2_gse62564_GSVA_Metadata %>% 
   mutate_at("HALLMARK_HYPOXIA", as.numeric) %>%
   mutate_at("HALLMARK_INFLAMMATORY_RESPONSE", as.numeric) %>%
   mutate_at("ADRN_Norm_vs_Hypo_Up_554.txt", as.numeric) %>%
   mutate_at("ADRN_Norm_vs_Hypo_Down_635.txt", as.numeric)

Question 1

Use R plot functions to visualize the correlation between Hallmark Hypoxia and Hallmark Inflammatory Response

Solution:

HALLMARK_INFLAMMATORY_RESPONSE vs. HALLMARK_HYPOXIA

qplot(HALLMARK_INFLAMMATORY_RESPONSE, HALLMARK_HYPOXIA, 
      data = r2_gse62564_GSVA_Metadata,
      #colour=quantile,
      ylab = "HALLMARK_HYPOXIA",
      xlab = "HALLMARK_INFLAMMATORY_RESPONSE")

Question 2

Use R plot functions to visualize the correlation between Hallmark Hypoxia and Hallmark Inflammatory Response

Solution:

HALLMARK_INFLAMMATORY_RESPONSE vs. ADRN_Norm_vs_Hypo_Up_554.txt

qplot(HALLMARK_INFLAMMATORY_RESPONSE, ADRN_Norm_vs_Hypo_Up_554.txt, 
      data = r2_gse62564_GSVA_Metadata,
      # colour=quantile,
      xlab = "HALLMARK_INFLAMMATORY_RESPONSE",
      ylab = "ADRN_Norm_vs_Hypo_Up_554.txt")

Question 3

Use R plot functions to visualize the correlation between Hallmark Hypoxia and Hallmark Inflammatory Response

Solution:

HALLMARK_INFLAMMATORY_RESPONSE vs. ADRN_Norm_vs_Hypo_Down_635.txt

qplot(HALLMARK_INFLAMMATORY_RESPONSE, ADRN_Norm_vs_Hypo_Down_635.txt, 
      data = r2_gse62564_GSVA_Metadata,
      # colour=quantile,
      ylab = "ADRN_Norm_vs_Hypo_Down_635.txt",
      xlab = "HALLMARK_INFLAMMATORY_RESPONSE")

Question 4

From the analysis of questions 1-3, choose the correct option:

HALLMARK_INFLAMMATORY_RESPONSE and HALLMARK_HYPOXIA have a positive correlation because hypoxia is always beneficial in the tumor microenvironment
HALLMARK_INFLAMMATORY_RESPONSE and ADRN_Norm_vs_Hypo_Up_554.txt have a positive correlation because hypoxia upregulation in this case, is beneficial in the tumor microenvironment
HALLMARK_INFLAMMATORY_RESPONSE and ADRN_Norm_vs_Hypo_Up_554.txt have a positive correlation because hypoxia upregulation has a negative impact on survival
HALLMARK_INFLAMMATORY_RESPONSE and ADRN_Norm_vs_Hypo_Down_635.txt have a negative correlation because hypoxia upregulation in this case, is not beneficial in the tumor microenvironment

Question 5

Which gene expression group has worse survival probability?

High HIF1A expression
Low HIF1A expression

Question 6

Which phenotype score group has worse survival probability?

High Hallmark Hypoxia
Low Hallmark Hypoxia

Question 7

Which phenotype score group has worse survival probability?

High Hallmark Inflammatory Response
Low Hallmark Inflammatory Response

Question 8

Plot the survival curve of the MYCN status variable. Which MYCN status has worse survival outcome?

Individuals with MYCN amplification
Individuals without MYCN amplification
The MYCN group with unknown MYCN status

Question 9

Plot the survival curve of the INSS stage variable. Which INSS stage has worse survival outcome?

Question 10

Mark TRUE or FALSE.

The higher the age at diagnosis the greater the HIF1A expression difference between HR and non-HR
The lower the age at diagnosis the greater the HIF1A expression difference between HR and non-HR

Part 3: DNA Sequencing and Introduction to the Terminal

Chapter 1: DNA Sequencing

Question 1

Watch the video about DNA sequencing technologies that was based on the Shendure et al., 2017 paper, from minutes 14:30 to 15:30. What is another technical name for Next Generation Sequencing (NGS) technologies?

Question 2

Now watch the video from 19:06 - 22:37. Which step is responsible for the increased amounts of DNA necessary for DNA sequencing?

Question 3

According to Shendure et al., 2017, how many and what are the names of the DNA sequencing technologies?

Question 4

What is the name of the process that allows bacteria to incorporate DNA into own their bacterial cells?

Question 5

What was the main clinical application of DNA sequencing technologies discussed by Shendure et al., 2017 in their paper?

Question 6

According to Shendure, please identify at least one chemical event in the machanics of DNA sequencing.

Question 7

According to Shendure, please identify at least one significant event in history of DNA sequencing.

Chapter 2: Introductory Commands in Command Line or Terminal

Use the echo command to print a greeting to the class like this one:

echo "Print a greeting to the class:"

## Print a greeting to the class:

Now print your greeting message. Insert the message using the echo command:

echo " "

We can define a variable

test_variable=this_file

and now print the name of the variable also using the echo command:

test_variable=this_file
echo $test_variable

## this_file

Check files that are present in the current directory:

ls

## 02-variant_call_pipeline.Rmd
## 02-variant_call_pipeline.html
## 2024-bioinfo-bridge-course
## PO
## README.md
## _site
## _site.yml
## about.Rmd
## about.html
## align-extract-counts.Rmd
## align-extract-counts.html
## assignment-dates.Rmd
## assignment-dates.html
## bioinformatics-bridge-course_2024.Rproj
## certificates
## data
## download_sequences.Rmd
## download_sequences.html
## edited_videos
## exercises
## exercises.Rmd
## exercises_files
## feedback
## final-presentations.Rmd
## images
## index.Rmd
## key_pairs
## last_day
## not_ready_to_render
## notebooks
## papers
## pdf
## recordings.Rmd
## scripts
## site_libs
## slides
## transcriptomics analysis.Rmd
## videos.Rmd

In the previous chunk there are files and folders that are present in that directory. I will print the files again with an echo message so it is more clear what was the command and what was the output:

echo "List the files. These are the files inside this directory:"
ls

## List the files. These are the files inside this directory:
## 02-variant_call_pipeline.Rmd
## 02-variant_call_pipeline.html
## 2024-bioinfo-bridge-course
## PO
## README.md
## _site
## _site.yml
## about.Rmd
## about.html
## align-extract-counts.Rmd
## align-extract-counts.html
## assignment-dates.Rmd
## assignment-dates.html
## bioinformatics-bridge-course_2024.Rproj
## certificates
## data
## download_sequences.Rmd
## download_sequences.html
## edited_videos
## exercises
## exercises.Rmd
## exercises_files
## feedback
## final-presentations.Rmd
## images
## index.Rmd
## key_pairs
## last_day
## not_ready_to_render
## notebooks
## papers
## pdf
## recordings.Rmd
## scripts
## site_libs
## slides
## transcriptomics analysis.Rmd
## videos.Rmd

This is the current directory with command pwd:

pwd

## /Users/gepolianochaves/Desktop/Gepoliano/Bioinformatics Bridge Course/bioinformatics-bridge-course_2024

Above you can see that that was a directory called exercises inside another folder named bioinformatics-bridge-course_2024.

Let’s create a folder in this directory, using the mkdir command:

mkdir test_directory

If by some reason we type the command again, Bash will throw an error at us:

mkdir test_directory

## mkdir: test_directory: File exists

These are the files present in this directory:

ls

## 02-variant_call_pipeline.Rmd
## 02-variant_call_pipeline.html
## 2024-bioinfo-bridge-course
## PO
## README.md
## _site
## _site.yml
## about.Rmd
## about.html
## align-extract-counts.Rmd
## align-extract-counts.html
## assignment-dates.Rmd
## assignment-dates.html
## bioinformatics-bridge-course_2024.Rproj
## certificates
## data
## download_sequences.Rmd
## download_sequences.html
## edited_videos
## exercises
## exercises.Rmd
## exercises_files
## feedback
## final-presentations.Rmd
## images
## index.Rmd
## key_pairs
## last_day
## not_ready_to_render
## notebooks
## papers
## pdf
## recordings.Rmd
## scripts
## site_libs
## slides
## test_directory
## transcriptomics analysis.Rmd
## videos.Rmd

Let’s move into the directory that we just created. To do that we use the cd (change directory command):

cd test_directory

Check the directory where we are at. The pwd (print working directory) command prints the directory inside which we are:

cd test_directory
pwd

## /Users/gepolianochaves/Desktop/Gepoliano/Bioinformatics Bridge Course/bioinformatics-bridge-course_2024/test_directory

Note that we are inside the test_directory.

Let’s look at what is inside the test_directory:

cd test_directory
pwd
ls

## /Users/gepolianochaves/Desktop/Gepoliano/Bioinformatics Bridge Course/bioinformatics-bridge-course_2024/test_directory

echo - pode ser feito criando-se arquivos de texto

## go inside the test_directory directory
cd test_directory

## Print working directory
pwd

## List what is inside the directory
ls

echo "As you can see, there is nothing inside this directory:"
echo "Because there is nothing here, let's create something, in the part below."

## /Users/gepolianochaves/Desktop/Gepoliano/Bioinformatics Bridge Course/bioinformatics-bridge-course_2024/test_directory
## As you can see, there is nothing inside this directory:
## Because there is nothing here, let's create something, in the part below.

Let’s create 3 files inside the test_directory and list what is inside the test_directory folder:

cd test_directory
pwd
touch  firstFile.txt
touch  secondFile.txt
touch  thirdFile.txt
ls

## /Users/gepolianochaves/Desktop/Gepoliano/Bioinformatics Bridge Course/bioinformatics-bridge-course_2024/test_directory
## firstFile.txt
## secondFile.txt
## thirdFile.txt

Let’s have the terminal print the names of the txt files in the test_directory for us:

cd test_directory
pwd
for i in *txt; do echo the name of this file is":" $i ; done

## /Users/gepolianochaves/Desktop/Gepoliano/Bioinformatics Bridge Course/bioinformatics-bridge-course_2024/test_directory
## the name of this file is: firstFile.txt
## the name of this file is: secondFile.txt
## the name of this file is: thirdFile.txt

Let’s have the terminal print one word, tow words and three words to the files, like this:

cd test_directory
pwd
echo word1 >> firstFile.txt
echo word1 >> secondFile.txt
echo word1 >> thirdFile.txt

echo word2 >> secondFile.txt
echo word2 >> thirdFile.txt

echo word3 >> thirdFile.txt

## /Users/gepolianochaves/Desktop/Gepoliano/Bioinformatics Bridge Course/bioinformatics-bridge-course_2024/test_directory

Let’s look at each one of the txt files to see what we created:

cd test_directory
echo "word counts of firstFile.txt:"
cat firstFile.txt
echo "word counts of secondFile.txt:"
cat secondFile.txt
echo "word counts of thirdFile.txt:"
cat thirdFile.txt

## word counts of firstFile.txt:
## word1
## word counts of secondFile.txt:
## word1
## word2
## word counts of thirdFile.txt:
## word1
## word2
## word3

In a for loop:

cd test_directory

for i in *txt; do cat $i ; done
#cat *txt

## word1
## word1
## word2
## word1
## word2
## word3

Relate these exercises to a terminal command? cat

This is a command to count the lines of a file:

cd test_directory
wc -l firstFile.txt
wc -l secondFile.txt
wc -l thirdFile.txt

##        1 firstFile.txt
##        2 secondFile.txt
##        3 thirdFile.txt

A script to count the number of reads or sequences in a fastq or fasta file.

A for loop script to count the number of fasta files

for i in AF243*; do echo $i; done

Finally, remove the test_directory

rm -r test_directory

Chapter 3: Download fasta sequences

Question 1

What was the package used to download fasta sequences?

Question 2

What was the function used to download fasta sequences?

Question 3

What was the function used to save fasta sequences downloaded?

Part 4: Presentations and Final Considerations

Chapter 1: Genetic Variants and Haplogroups

Question 1

Please read this Wikipedia article about haplogroups and the genetic inheritance of the mitochondria and the Y chromosome: https://en.wikipedia.org/wiki/Haplogroup. What is the difference between a haplogroup and a hyplotype?

Schematic definition of a haplogroup as presented on the Wikipedia article.

Question 2

Using the fasta sequences that you decided to download, repeat the haplogroup analysis of the fasta sequences downloaded in R using the haplogrep page: https://haplogrep.i-med.ac.at. What are the file formats accepted by the haplogrep tool?

Question 3

Please note that among the file formats accepted by the Haplogrep tool, is the VCF (variant call file) format. In what part of our course did we use a VCF file? Please try uploading VCF files instead of fasta files to Haplogrep.

Question 4

Look at the Haplogrep3 haplogroup annotations: https://haplogrep.readthedocs.io/en/latest/annotations/#clusters-and-population-frequencies.

Question 5

Now compare with the results of the analysis of the 17 fasta sequences (12 passing and 5 warnings) isolated from the self-declared white Brazilians that passed the Haplogrep QC.

Question 6

How many individuals have African matrilineal ancestry?

Question 7

How many individuals have European matrilineal ancestry?

Question 8

How many individuals have Latino/Admixed American matrilineal ancestry?

Question 9

How many individuals have East Asian matrilineal ancestry?

Question 10

How many individuals have South Asian matrilineal ancestry?

Question 11

Which percentage of European, Latino (admixed) East Asian and African do you see in the mitocondrial haplogroups in the samples sequenced in self-declared white Brazilians?

Question 12

Compare the results shown in Question 2 with the results from Mitchell 2014, shown in the figure below. What are the similarities in the colors of the ethnical-racial groups classified as African, European and Asian in the samples from Brazil and from the USA? If you need to access the paper, please refer to the shared drive or the attached link: Mitchell et al., 2014.

Human mitochondrial haplogroups in samples from the USA reported by Mitchell et al., 2014.

Question 13

What is the suggested origin of matrilineal clusters L in self-declared white Brazilians?

Question 14

What is the suggested origin of matrilineal clusters J, K, HV and U in self-declared white Brazilians?

Question 15

What is the suggested origin of matrilineal clusters A and B in self-declared white Brazilians?

Chapter 2: Variant Call Pipeline

Question 1

Which tool performs alignment in the Variant Call Pipeline?

samtools
bcftools
bwa
bamtools

Question 2

Why was the for loop used in the Variant Call Pipeline?

To save time in the analysis
To allow identification of variants in all mitochondrial fasta files
So that the samples could be aligned to a reference genome
To save space and computational resources in the computer

Question 3

Using the on-line IDE (https://www.onlinegdb.com/online_c++_compiler), please create 3-5 txt files in the terminal. Please make sure you create the files with different names. Which of the commands below can you use to print the names of the files to the screen of the terminal?

Question 4

Write a script to print the names of at least one of the txt files that were created in the previous exercise.

Question 5

Write a script to print the names of all txt files that were created in question 3. Use a for loop.

Question 6

In the terminal, which command can you use to print a word to the end of a txt file?

Question 7

Save some words to the files you created in question number 3. You can use the command below:

echo  "word" >> file.txt

Make sure you understand the usage of the >> operator. Do not forget to also make sure the terminal recognizes the name of the txt files you created by using the tab key in your keyboard. Which command can you use to count how many words you provided as input to the text file?

Question 8

Which command can you use to double-check the words you inputed to the files you wrote?

Chapter 3: Accessing the Terminal in AWS Image Instance

Question 1

Please read instructions about Launching your own AMI instances (https://datacarpentry.org/genomics-workshop/AMI-setup). The instructor can sign into the AWS EC2 Dashboard (https://console.aws.amazon.com/ec2/) but students do not have to do this or create an AWS account. Creating an AWS account is responsability of the student because you may have to pay for these resources.

Question 2

Once you are provided with instructions on how to access the AWS instance, do so by typing the ssh command in the terminal:

ssh dcuser@ec2-3-82-97-221.compute-1.amazonaws.com

Once asked, type the password:

data4Carp

Please only insert the commands once instructed because all studets will access the resources at the same time.

Question 3

Once inside the AWS instance, copy the variant_call modified with your name as I am doing below. This part is similar to what I did when I specified the steps of the variant call pipeline:

cp -r variant_call variant_call_gepoliano

Question 4

You can now copy the variant call pipeline commands and paste it in the terminal, as I did in the variant call notebook: https://bioinformatics-bridge-course.netlify.app/02-variant_call_pipeline#variant-call-pipeline-using-a-for-loop-strategy:


cd /home/dcuser/variant_call_gepoliano/fasta

REFERENCE=AF243627.fasta

## for loop specific to the individual fasta files
for fastaFile in AF*.fasta; do
  
  ## Pick just the eight firt letters of the string
  fastaFileName=${fastaFile:0:8}
  echo "started sample $fastaFileName"
  
  ## Align
  bwa mem -t 4 ../references/$REFERENCE $fastaFile > ../sam/$fastaFileName".sam"
  echo "Just finished aligning sample $fastaFileName"
  
  ## SAM to BAM
  samtools view -S -b ../sam/$fastaFileName".sam" > \
  ../bam/$fastaFileName".bam"
  echo "Finished SAM to BAM of sample $fastaFileName"
  
  ## Samtools uses reference FASTA to detect "piles" in the alignment
  bcftools mpileup -f ../references/$REFERENCE ../bam/$fastaFileName".bam" > \
  ../bcf/$fastaFileName".bcf"
  
  ## Bcftools extracts SNPs
  echo "started Bcftools SNPs for sample $fastaFileName"
  bcftools view -v snps ../bcf/$fastaFileName".bcf" > ../snps_vcf/$fastaFileName"_snps.vcf"

done

echo "Finished the entire for loop"

What do you see in your screen? How do you know if it run sucessfully?

Exercises

Research with Computational Biology (ReComBio)

September 7th, 2024

Part 1: Introduction to R and the R Syntax

Chapter 3: Data visualization

Load libraries

Load dataframe

Make Variables of Numeric Type

Question 1

Solution:

Question 2

Solution:

Question 3

Solution:

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Part 2: Machine Learning

Chapter 1: RNA Sequencing (RNA-Seq)

Question 1

Question 2

Question 3

Question 4

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Chapter 2: Classification

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Part 3: DNA Sequencing and Introduction to the Terminal

Chapter 1: DNA Sequencing

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Chapter 2: Introductory Commands in Command Line or Terminal

Chapter 3: Download fasta sequences

Question 1

Question 2

Question 3

Part 4: Presentations and Final Considerations

Chapter 1: Genetic Variants and Haplogroups

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11

Question 12

Question 13

Question 14

Question 15

Chapter 2: Variant Call Pipeline

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8