1. Introduction
- 1.1. Mitochondrial genome and matrilinear information
- 1.2. Mitochondrial haplogroups
2. Computational Imputation and Databases
3. Alignment
- Load multi-fasta mtDNA file
- Subset from positions 1bp to 302bp
3. Alignment Visualization
5. Visualize Haplogroup Classification with Haplogrep 3
6. Distance Matrix and Phylogenetic Tree of Partial Sequence

1. Introduction

1.1. Mitochondrial genome and matrilinear information

Mitochondria is an organelle that contains the martrilineal genetic information.
Mitochondrial genome is 16569 bp long (Bolze 2021) Bolze (2014)

Classification tree of mitochondrial haplogroups. Figure from Sabrina L. Mitchell (2014).

1.2. Mitochondrial haplogroups

Genetic variants (mutations) in mitochondrial DNA can be organized in haplogroups.
Each matrilineal ancestry has characteristic haplogroups.
This figure from Mitchell et al., 2014 shows the phylogenetic tree of mitochondrial haplogroups in a large population of the USA.

Classification tree of mitochondrial haplogroups. Figure from Sabrina L. Mitchell (2014).

2. Computational Imputation and Databases

A review on genotype imputation was writen by Yun Li and Abecasis (2009). They even mention using the HapMap project as panels for genotype inference. They do not mention an imputation workflow or protocol for genetic variants of Brazilians. An imputation workflow was provided by Kalle Pärn (2018).
I was able to find an R function for imputation in https://search.r-project.org/CRAN/refmans/dotgen/html/imp.html
The function above is in package https://cran.r-project.org/web/packages/dotgen/index.html

Reference panels can be used to create populational references. With milions or billions of sequences deposited in a database, a genetic model can be created or imputed. Figure from tellmeGen (n.d.).

3. Alignment

Load multi-fasta mtDNA file

Load package msa: multiple sequence alignment

library("msa")

Load multi-fasta mtDNA from Brazil.

mt_brazilian <- readDNAStringSet(file="../../DNA do Brasil/matrilineal sequences/data/mt_brazilian.fasta")

## Warning in .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec, :
## reading FASTA file ../../DNA do Brasil/matrilineal
## sequences/data/mt_brazilian.fasta: ignored 425 invalid one-letter sequence
## codes

Subset from positions 1bp to 302bp

Information about subsetting strings was gained from Chapter 4: Manipulating Sequences with Biostrings, found in Gatto (2023).

seq_start_to_end <- subseq(mt_brazilian, start = 100, end = 200)

Print sequence

print(seq_start_to_end)

## DNAStringSet object of length 17:
##      width seq                                              names               
##  [1]   101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243780
##  [2]   101 ATGAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243781
##  [3]   101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243782
##  [4]   101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243783
##  [5]   101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243784
##  ...   ... ...
## [13]   101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243792
## [14]   101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243793
## [15]   101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243794
## [16]   101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243795
## [17]   101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243796

Align sequences

start_time <- Sys.time()
alignment_start_to_end <- msa(seq_start_to_end)

## use default substitution matrix

end_time <- Sys.time()
end_time - start_time

## Time difference of 0.069911 secs

alignment_start_to_end

## CLUSTAL 2.1  
## 
## Call:
##    msa(seq_start_to_end)
## 
## MsaDNAMultipleAlignment with 17 rows and 102 columns
##      aln                                                   names
##  [1] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243780
##  [2] ATGAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243781
##  [3] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243789
##  [4] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243786
##  [5] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243795
##  [6] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243796
##  [7] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243785
##  [8] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243787
##  [9] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243794
## [10] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243783
## [11] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243784
## [12] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243791
## [13] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTC-AAAGCCACCC AF243788
## [14] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243793
## [15] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243782
## [16] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243792
## [17] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAATTCCAAAGCCACC- AF243790
##  Con ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- Consensus

3. Alignment Visualization

Load library

Visualization of the Spike Protein Region alignment was made possible with library ggmsa and the work of Zhou (2022).

library(ggmsa)

## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2

## Registered S3 method overwritten by 'ggtree':
##   method      from 
##   identify.gg ggfun

## ggmsa v1.8.0  Document: http://yulab-smu.top/ggmsa/
## 
## If you use ggmsa in published research, please cite:
## L Zhou, T Feng, S Xu, F Gao, TT Lam, Q Wang, T Wu, H Huang, L Zhan, L Li, Y Guan, Z Dai*, G Yu* ggmsa: a visual exploration tool for multiple sequence alignment and associated data. Briefings in Bioinformatics. DOI:10.1093/bib/bbac222

Visualization with sample names

Plot alignment at around position 23063bp, showing variant N501Y

ggmsa(mt_brazilian, 
      start = 40, end = 80, 
      char_width = 0.5, 
      seq_name = T) + geom_seqlogo() + geom_msaBar()

## Warning in rbind(c("T", "A", "T", "T", "G", "A", "C", "T", "C", "A", "C", :
## number of columns of result is not a multiple of vector length (arg 9)

## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

Visualization without sample names

Plot without sequence names

ggmsa(mt_brazilian, 
      start = 40, end = 80, 
      char_width = 0.5, 
      seq_name = F) + geom_seqlogo() + geom_msaBar()

## Warning in rbind(c("T", "A", "T", "T", "G", "A", "C", "T", "C", "A", "C", :
## number of columns of result is not a multiple of vector length (arg 9)

## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

5. Visualize Haplogroup Classification with Haplogrep 3

Visualization is available https://haplogrep.readthedocs.io/en/latest/annotations/#clusters-and-population-frequencies

Haplogroup Classification with Haplogrep 3

6. Distance Matrix and Phylogenetic Tree of Partial Sequence

This is necessary so function dist.ml can take the DNA.bin object and process the Distance Matrix. Function dist.ml() is from package phangorn

library("phangorn")

## Warning: package 'phangorn' was built under R version 4.1.2

## Loading required package: ape

## Warning: package 'ape' was built under R version 4.1.2

## 
## Attaching package: 'ape'

## The following object is masked from 'package:Biostrings':
## 
##     complement

dm_extremo_sul_msa_partial <- dist.ml(as.DNAbin(alignment_start_to_end))

Plot distance matrix

Function table.paint() is from package adegenet. Load library adegenet:

library("adegenet")

## Warning: package 'adegenet' was built under R version 4.1.2

## Loading required package: ade4

## Warning: package 'ade4' was built under R version 4.1.2

## 
## Attaching package: 'ade4'

## The following object is masked from 'package:Biostrings':
## 
##     score

## The following object is masked from 'package:BiocGenerics':
## 
##     score

## 
##    /// adegenet 2.1.10 is loaded ////////////
## 
##    > overview: '?adegenet'
##    > tutorials/doc/questions: 'adegenetWeb()' 
##    > bug reports/feature requests: adegenetIssues()

## 
## Attaching package: 'adegenet'

## The following object is masked from 'package:phangorn':
## 
##     AICc

dm_extremo_sul_msa_partial_df <- as.data.frame(as.matrix(dm_extremo_sul_msa_partial))
table.paint(dm_extremo_sul_msa_partial_df, cleg=0, clabel.row=.5, clabel.col=.5)

Construct Phylogenetic Trees

Phylogenetic tree UPGMA method

treeUPGMA_extremo_sul_partial <- upgma(dm_extremo_sul_msa_partial_df)

Plot Phylogenetic Trees Extremo Sul

Plot UPGMA Extremo Sul

plot(treeUPGMA_extremo_sul_partial, main="UPGMA", col="red")

7. References

Bolze, Alexandre. 2014. Analysis of Mitochondrial DNA Variants in 200,000 Individuals. https://cdn.sanity.io/files/g5irbagy/production/960b88f3fd681a6207e9d18b9973e99fa6df9e1f.pdf.

Gatto, Laurent. 2023. Chapter 4 Manipulating Sequences with Biostrings. https://uclouvain-cbio.github.io/WSBIM1322/sec-biostrings.html.

Kalle Pärn, Javier Nunez Fontarnau, Marita A. Isokallio. 2018. Genotype Imputation Workflow V3.0 v.1. https://www.protocols.io/view/genotype-imputation-workflow-v3-0-e6nvw78dlmkj/v1?step=3.

Sabrina L. Mitchell, Kristin Brown‐Gentry, Robert Goodloe. 2014. Characterization of Mitochondrial Haplogroups in a Large Population‐based Sample from the United States. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4113317/.

tellmeGen. n.d. Genética Preditiva. Accessed May 2, 2024. https://www.tellmegen.com/pt/genetica-preditiva.

Yun Li, Serena Sanna, Cristen Willer, and Gonçalo Abecasis. 2009. Genotype Imputation. https://pubmed.ncbi.nlm.nih.gov/19715440/.

Zhou, Lang. 2022. Ggmsa:a Visual Exploration Tool for Multiple Sequence Alignment and Associated Data. http://yulab-smu.top/ggmsa/index.html.

Phylogenetic Analysis of Brazilian mtDNA Sequences

Gepoliano Chaves

May 23rd, 2024