Mitochondria is an organelle that contains the martrilineal genetic information.
Mitochondrial genome is 16569 bp long (Bolze 2021) Bolze (2014)
Genetic variants (mutations) in mitochondrial DNA can be organized in haplogroups.
Each matrilineal ancestry has characteristic haplogroups.
This figure from Mitchell et al., 2014 shows the phylogenetic tree of mitochondrial haplogroups in a large population of the USA.
A review on genotype imputation was writen by Yun Li and Abecasis (2009). They even mention using the HapMap project as panels for genotype inference. They do not mention an imputation workflow or protocol for genetic variants of Brazilians. An imputation workflow was provided by Kalle Pärn (2018).
I was able to find an R function for imputation in https://search.r-project.org/CRAN/refmans/dotgen/html/imp.html
The function above is in package https://cran.r-project.org/web/packages/dotgen/index.html
Load package msa: multiple sequence alignment
library("msa")
Load multi-fasta mtDNA from Brazil.
mt_brazilian <- readDNAStringSet(file="../../DNA do Brasil/matrilineal sequences/data/mt_brazilian.fasta")
## Warning in .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec, :
## reading FASTA file ../../DNA do Brasil/matrilineal
## sequences/data/mt_brazilian.fasta: ignored 425 invalid one-letter sequence
## codes
Information about subsetting strings was gained from Chapter 4: Manipulating Sequences with Biostrings, found in Gatto (2023).
seq_start_to_end <- subseq(mt_brazilian, start = 100, end = 200)
Print sequence
print(seq_start_to_end)
## DNAStringSet object of length 17:
## width seq names
## [1] 101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243780
## [2] 101 ATGAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243781
## [3] 101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243782
## [4] 101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243783
## [5] 101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243784
## ... ... ...
## [13] 101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243792
## [14] 101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243793
## [15] 101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243794
## [16] 101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243795
## [17] 101 ATAAAAACCCAATCCACATCAAA...CAACTGCAACTCCAAAGCCACC AF243796
Align sequences
start_time <- Sys.time()
alignment_start_to_end <- msa(seq_start_to_end)
## use default substitution matrix
end_time <- Sys.time()
end_time - start_time
## Time difference of 0.069911 secs
alignment_start_to_end
## CLUSTAL 2.1
##
## Call:
## msa(seq_start_to_end)
##
## MsaDNAMultipleAlignment with 17 rows and 102 columns
## aln names
## [1] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243780
## [2] ATGAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243781
## [3] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243789
## [4] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243786
## [5] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243795
## [6] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243796
## [7] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243785
## [8] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243787
## [9] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243794
## [10] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243783
## [11] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243784
## [12] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243791
## [13] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTC-AAAGCCACCC AF243788
## [14] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243793
## [15] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243782
## [16] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- AF243792
## [17] ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAATTCCAAAGCCACC- AF243790
## Con ATAAAAACCCAATCCACATCAAAAC...ATCAACTGCAACTCCAAAGCCACC- Consensus
library(ggmsa)
## Registered S3 methods overwritten by 'ggalt':
## method from
## grid.draw.absoluteGrob ggplot2
## grobHeight.absoluteGrob ggplot2
## grobWidth.absoluteGrob ggplot2
## grobX.absoluteGrob ggplot2
## grobY.absoluteGrob ggplot2
## Registered S3 method overwritten by 'ggtree':
## method from
## identify.gg ggfun
## ggmsa v1.8.0 Document: http://yulab-smu.top/ggmsa/
##
## If you use ggmsa in published research, please cite:
## L Zhou, T Feng, S Xu, F Gao, TT Lam, Q Wang, T Wu, H Huang, L Zhan, L Li, Y Guan, Z Dai*, G Yu* ggmsa: a visual exploration tool for multiple sequence alignment and associated data. Briefings in Bioinformatics. DOI:10.1093/bib/bbac222
ggmsa(mt_brazilian,
start = 40, end = 80,
char_width = 0.5,
seq_name = T) + geom_seqlogo() + geom_msaBar()
## Warning in rbind(c("T", "A", "T", "T", "G", "A", "C", "T", "C", "A", "C", :
## number of columns of result is not a multiple of vector length (arg 9)
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
ggmsa(mt_brazilian,
start = 40, end = 80,
char_width = 0.5,
seq_name = F) + geom_seqlogo() + geom_msaBar()
## Warning in rbind(c("T", "A", "T", "T", "G", "A", "C", "T", "C", "A", "C", :
## number of columns of result is not a multiple of vector length (arg 9)
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
Visualization is available https://haplogrep.readthedocs.io/en/latest/annotations/#clusters-and-population-frequencies
This is necessary so function dist.ml can take the DNA.bin object and process the Distance Matrix. Function dist.ml() is from package phangorn
library("phangorn")
## Warning: package 'phangorn' was built under R version 4.1.2
## Loading required package: ape
## Warning: package 'ape' was built under R version 4.1.2
##
## Attaching package: 'ape'
## The following object is masked from 'package:Biostrings':
##
## complement
dm_extremo_sul_msa_partial <- dist.ml(as.DNAbin(alignment_start_to_end))
Function table.paint() is from package adegenet. Load library adegenet:
library("adegenet")
## Warning: package 'adegenet' was built under R version 4.1.2
## Loading required package: ade4
## Warning: package 'ade4' was built under R version 4.1.2
##
## Attaching package: 'ade4'
## The following object is masked from 'package:Biostrings':
##
## score
## The following object is masked from 'package:BiocGenerics':
##
## score
##
## /// adegenet 2.1.10 is loaded ////////////
##
## > overview: '?adegenet'
## > tutorials/doc/questions: 'adegenetWeb()'
## > bug reports/feature requests: adegenetIssues()
##
## Attaching package: 'adegenet'
## The following object is masked from 'package:phangorn':
##
## AICc
dm_extremo_sul_msa_partial_df <- as.data.frame(as.matrix(dm_extremo_sul_msa_partial))
table.paint(dm_extremo_sul_msa_partial_df, cleg=0, clabel.row=.5, clabel.col=.5)
Phylogenetic tree UPGMA method
treeUPGMA_extremo_sul_partial <- upgma(dm_extremo_sul_msa_partial_df)
Plot UPGMA Extremo Sul
plot(treeUPGMA_extremo_sul_partial, main="UPGMA", col="red")