The R2 platform is a freely accessible online genomics analysis and visualization tool which can analyze a large collection of public data, but also allows shielded analysis of private dataset(s) as indicated by Rijswijk (2024).
R2 consists of a database, storing the genomic information, coupled to an extensive set of tools to analyze/visualize the datasets.
A link to the can be found in the references session (R2 2023) and below:
The file downloaded from R2 comes with the “#” symbols, which need to be removed.
r2_gse62564 <- read.table("../data/ps_avgpres_gse62564geo498_seqcnb1_box1687888611-datagrabber-.txt")
class(r2_gse62564)
saveRDS(r2_gse62564, file = "../data/r2_gse62564.rds")
Removal of “#” will create and save the appropriate data-frame
r2_gse62564 <- read.table("../data/ps_avgpres_gse62564geo498_seqcnb1_box1687888611-datagrabber_modified.txt")
class(r2_gse62564)
saveRDS(r2_gse62564, file = "../data/r2_gse62564.rds")
r2_gse62564.rds loaded needs to be the correct one.
If it was loaded from the first chunk above, R throws an error.
Object r2_gse62564.rds is stored in the data folder of the BBC-2024 Google Drive:
This code allows calculation of the time R takes to load the dataframe into the R environment:
start_time <- Sys.time()
r2_gse62564 <- readRDS("../data/r2_gse62564.rds")
end_time <- Sys.time()
end_time - start_time
## Time difference of 2.061233 secs
Use the head() function
head(r2_gse62564, n = 20)
Use the dim() function
dim(r2_gse62564)
## [1] 24961 500
Load dplyr package
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Metadata is a set of data that describes and gives information about other data.
Take the first 17 lines to be the metadata.
metadata_gse62564 <- r2_gse62564[1:17,]
Visualize metadata dataframe
## View(metadata_gse62564)
head(metadata_gse62564, n = 20)
Visualize gene expression dataframe
head(r2_gse62564, n = 20)
Note that the second column, in the previous visualization, is repeated
Remove the second column, which is repeated using the select method of the dplyr package
r2_gse62564 <- dplyr::select(r2_gse62564, -V2)
head(r2_gse62564, n = 20)
Remove all metadata information, keep only gene counts in the gene expression dataframe
r2_gse62564 <- r2_gse62564[-c(2:18),]
Give names to columns using row 1
colnames(r2_gse62564) <- r2_gse62564[1, ]
Remove column that contain rows names
r2_gse62564 <- r2_gse62564[-1,]
head(r2_gse62564, n = 20)
library(maditr)
##
## To modify variables or add new variables:
## let(mtcars, new_var = 42, new_var2 = new_var*hp) %>% head()
##
## Attaching package: 'maditr'
## The following objects are masked from 'package:dplyr':
##
## between, coalesce, first, last
## The following object is masked from 'package:base':
##
## sort_by
Remove duplicated rows using the maditr package
distinct() keeps only unique/distinct rows from a data frame.
r2_gse62564_distinct <- r2_gse62564 %>% distinct(`H:hugo`, .keep_all = TRUE)
Look at the gene expression dataset
head(r2_gse62564, n = 20)
Give names to rows
rownames(r2_gse62564_distinct) <- r2_gse62564_distinct$`H:hugo`
Remove the H:hugo column
r2_gse62564_distinct <- dplyr::select(r2_gse62564_distinct, -`H:hugo`)
Look at the gene expression dataset after the distinct method is applied
head(r2_gse62564, n = 20)
names(r2_gse62564_distinct) <- toupper(names(r2_gse62564_distinct))
r2_gse62564_distinct <- r2_gse62564_distinct %>% mutate_if(is.character, as.numeric)
r2_gse62564_matrix <- as.matrix(r2_gse62564_distinct)
r2_gse62564_df <- as.data.frame(r2_gse62564_matrix)
Remove the extra H:hugo column in r2_gse62564
r2_gse62564 <- dplyr::select(r2_gse62564, -`H:hugo`)
Rename columns in r2_gse62564_distinct
names(r2_gse62564_distinct) <- toupper(names(r2_gse62564))
r2_gse62564_distinct <- r2_gse62564_distinct %>% mutate_if(is.character, as.numeric)
r2_gse62564_matrix <- as.matrix(r2_gse62564_distinct)
r2_gse62564_df <- as.data.frame(r2_gse62564_matrix)
In this part, we construct the dataframe to include phenotype scoring with the gene expression values. To calculate phenotype scores, we use the GSVA library.
Load the GSVA and GSEABAse libraries
# install.packages(BiocManager)
# library(BiocManager)
# BiocManager::install("GSVA")
library("GSVA")
library("GSEABase")
GSVA error:
Show in New Window
[1] "data.frame"
Show in New Window
Time difference of 5.320694 secs
Show in New Window
Error in getGmt("../data/cfDNA_genes_PCA.txt") :
could not find function "getGmt"
Show in New Window
Error: package or namespace load failed for ‘GSVA’:
.onLoad failed in loadNamespace() for 'XML', details:
call: dyn.load(file, DLLpath = DLLpath, ...)
error: unable to load shared object '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/XML/libs/XML.so':
dlopen(/Library/Frameworks/R.framework/Versions/4.1/Resources/library/XML/libs/XML.so, 6): Library not loaded: @rpath/libxml2.2.dylib
Referenced from: /Library/Frameworks/R.framework/Versions/4.1/Resources/library/XML/libs/XML.so
Reason: image not found
Reference for GSVA error:
https://support.bioconductor.org/p/9157686/
Re-install Bioconductor and GSVA:
https://bioconductor.org/install/
Load gene sets
cfDNA_PCA_gene_list <- getGmt("../data/cfDNA_genes_PCA.txt")
## Warning in readLines(con, ...): incomplete final line found on
## '../data/cfDNA_genes_PCA.txt'
## Warning in getGmt("../data/cfDNA_genes_PCA.txt"): 321 record(s) contain
## duplicate ids: ADRN_Gene_List_373, ADRN_Gronigen, ...,
## WP_GLYCOLYSIS_AND_GLUCONEOGENESIS_7, WP_GLYCOLYSIS_IN_SENESCENCE_Genes_4
Construct GSVA data-frame. This step needs to be re-run everytime there is a new gene set. From previous code versions, this chunk is now edpracated.
r2_gse62564_GSVA <- gsva(r2_gse62564_matrix,
cfDNA_PCA_gene_list,
min.sz=1, max.sz=Inf,
verbose=TRUE)
## for simplicity, use only a subset of the sample data
ses <- leukemia_eset[1:1000, ]
gsc <- c2BroadSets[1:100]
gp1 <- gsvaParam(ses, gsc, minSize = 2, maxSize = Inf)
gp1
gsva_gp1 <- gsva(gp1, verbose=TRUE)
build GSVA parameter object, according to documentation. In the examples of the gsvaParam class documentation of the leukemia and c2BroadSets, only the ses and gsc objects were used, not the maxDiff set to TRUE.
# gsvapar <- gsvaParam(y, geneSets, maxDiff=TRUE)
gsvapar_gse62564 <- gsvaParam(r2_gse62564_matrix, cfDNA_PCA_gene_list)
## estimate GSVA enrichment scores for the three sets
# gsva_es <- gsva(gsvapar)
gsva_es_gse62564 <- gsva(gsvapar_gse62564)
## Warning in .filterGenes(dataMatrix, removeConstant = removeConstant,
## removeNzConstant = removeNzConstant): 164 genes with constant values throughout
## the samples.
## Warning in .filterGenes(dataMatrix, removeConstant = removeConstant,
## removeNzConstant = removeNzConstant): Genes with constant values are discarded.
## No annotation package name available in the input data object.
## Attempting to directly match identifiers in data to gene sets.
## Warning in .filterAndMapGenesAndGeneSets(param): Some gene sets have size one.
## Consider setting 'minSize' to a value > 1.
## Estimating GSVA scores for 312 gene sets.
## Estimating ECDFs with Gaussian kernels
## | | | 0% | | | 1% | |= | 1% | |= | 2% | |== | 2% | |== | 3% | |== | 4% | |=== | 4% | |=== | 5% | |==== | 5% | |==== | 6% | |===== | 7% | |===== | 8% | |====== | 8% | |====== | 9% | |======= | 9% | |======= | 10% | |======= | 11% | |======== | 11% | |======== | 12% | |========= | 12% | |========= | 13% | |========== | 14% | |========== | 15% | |=========== | 15% | |=========== | 16% | |============ | 17% | |============ | 18% | |============= | 18% | |============= | 19% | |============== | 20% | |============== | 21% | |=============== | 21% | |=============== | 22% | |================ | 22% | |================ | 23% | |================= | 24% | |================= | 25% | |================== | 25% | |================== | 26% | |=================== | 27% | |=================== | 28% | |==================== | 28% | |==================== | 29% | |===================== | 29% | |===================== | 30% | |====================== | 31% | |====================== | 32% | |======================= | 32% | |======================= | 33% | |======================== | 34% | |======================== | 35% | |========================= | 35% | |========================= | 36% | |========================== | 37% | |========================== | 38% | |=========================== | 38% | |=========================== | 39% | |============================ | 39% | |============================ | 40% | |============================ | 41% | |============================= | 41% | |============================= | 42% | |============================== | 42% | |============================== | 43% | |=============================== | 44% | |=============================== | 45% | |================================ | 45% | |================================ | 46% | |================================= | 46% | |================================= | 47% | |================================= | 48% | |================================== | 48% | |================================== | 49% | |=================================== | 49% | |=================================== | 50% | |=================================== | 51% | |==================================== | 51% | |==================================== | 52% | |===================================== | 52% | |===================================== | 53% | |===================================== | 54% | |====================================== | 54% | |====================================== | 55% | |======================================= | 55% | |======================================= | 56% | |======================================== | 57% | |======================================== | 58% | |========================================= | 58% | |========================================= | 59% | |========================================== | 59% | |========================================== | 60% | |========================================== | 61% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 62% | |============================================ | 63% | |============================================= | 64% | |============================================= | 65% | |============================================== | 65% | |============================================== | 66% | |=============================================== | 67% | |=============================================== | 68% | |================================================ | 68% | |================================================ | 69% | |================================================= | 70% | |================================================= | 71% | |================================================== | 71% | |================================================== | 72% | |=================================================== | 72% | |=================================================== | 73% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 75% | |===================================================== | 76% | |====================================================== | 77% | |====================================================== | 78% | |======================================================= | 78% | |======================================================= | 79% | |======================================================== | 79% | |======================================================== | 80% | |========================================================= | 81% | |========================================================= | 82% | |========================================================== | 82% | |========================================================== | 83% | |=========================================================== | 84% | |=========================================================== | 85% | |============================================================ | 85% | |============================================================ | 86% | |============================================================= | 87% | |============================================================= | 88% | |============================================================== | 88% | |============================================================== | 89% | |=============================================================== | 89% | |=============================================================== | 90% | |=============================================================== | 91% | |================================================================ | 91% | |================================================================ | 92% | |================================================================= | 92% | |================================================================= | 93% | |================================================================== | 94% | |================================================================== | 95% | |=================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 96% | |==================================================================== | 97% | |==================================================================== | 98% | |===================================================================== | 98% | |===================================================================== | 99% | |======================================================================| 99% | |======================================================================| 100%
Merge gene expression and gsva scores
# r2_gse62564_GSVA_genes <- rbind(r2_gse62564_GSVA, r2_gse62564_matrix)
gsva_es_gse62564_genes <- rbind(gsva_es_gse62564, r2_gse62564_matrix)
# r2_gse62564_GSVA_genes <- rbind(r2_gse62564_GSVA, r2_gse62564_matrix)
gsva_es_gse62564_genes <- rbind(gsva_es_gse62564, r2_gse62564_matrix)
saveRDS(gsva_es_gse62564_genes, file = "../data/gsva_es_gse62564_genes.rds")
gse62564_metadata <- metadata_gse62564
# Repeat row with IDs to confirm matching later
gse62564_metadata[18,] <- gse62564_metadata[1,]
# Remove first column of metadata DF
gse62564_metadata <- gse62564_metadata[,-c(1)]
# Make all IDs in row 2, uppercase
gse62564_metadata[1,2:499] <- toupper(gse62564_metadata[1,2:499])
# Make row 1, column names
library(janitor)
gse62564_metadata <- gse62564_metadata %>%
row_to_names(row_number = 1)
# Make column as row names in metadata
library(tidyverse)
gse62564_metadata <- gse62564_metadata %>% remove_rownames %>% column_to_rownames(var="probeset")
r2_gse62564_GSVA_Metadata <- rbind(gse62564_metadata, r2_gse62564_GSVA_genes)
r2_gse62564_GSVA_Metadata <- t(r2_gse62564_GSVA_Metadata)
r2_gse62564_GSVA_Metadata <- as.data.frame(r2_gse62564_GSVA_Metadata)
saveRDS(r2_gse62564_GSVA_Metadata, file = "../results/r2_gse62564_GSVA_Metadata.rds")
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Big Sur 11.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Chicago
## tzcode source: internal
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] GSEABase_1.66.0 graph_1.82.0 annotate_1.82.0
## [4] XML_3.99-0.17 AnnotationDbi_1.66.0 IRanges_2.38.1
## [7] S4Vectors_0.42.1 Biobase_2.64.0 BiocGenerics_0.50.0
## [10] GSVA_1.52.3 maditr_0.8.4 dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 blob_1.2.4
## [3] Biostrings_2.72.1 fastmap_1.2.0
## [5] SingleCellExperiment_1.26.0 digest_0.6.37
## [7] rsvd_1.0.5 lifecycle_1.0.4
## [9] KEGGREST_1.44.1 RSQLite_2.3.7
## [11] magrittr_2.0.3 compiler_4.4.1
## [13] rlang_1.1.4 sass_0.4.9
## [15] tools_4.4.1 utf8_1.2.4
## [17] yaml_2.3.10 data.table_1.15.4
## [19] knitr_1.48 S4Arrays_1.4.1
## [21] bit_4.0.5 DelayedArray_0.30.1
## [23] abind_1.4-5 BiocParallel_1.38.0
## [25] HDF5Array_1.32.1 withr_3.0.1
## [27] grid_4.4.1 fansi_1.0.6
## [29] beachmat_2.20.0 xtable_1.8-4
## [31] Rhdf5lib_1.26.0 SummarizedExperiment_1.34.0
## [33] cli_3.6.3 rmarkdown_2.28
## [35] crayon_1.5.3 generics_0.1.3
## [37] rstudioapi_0.16.0 httr_1.4.7
## [39] rjson_0.2.22 DBI_1.2.3
## [41] cachem_1.1.0 rhdf5_2.48.0
## [43] zlibbioc_1.50.0 parallel_4.4.1
## [45] XVector_0.44.0 matrixStats_1.3.0
## [47] vctrs_0.6.5 Matrix_1.7-0
## [49] jsonlite_1.8.8 BiocSingular_1.20.0
## [51] bit64_4.0.5 irlba_2.3.5.1
## [53] magick_2.8.4 jquerylib_0.1.4
## [55] glue_1.7.0 codetools_0.2-20
## [57] GenomeInfoDb_1.40.1 GenomicRanges_1.56.1
## [59] UCSC.utils_1.0.0 ScaledMatrix_1.12.0
## [61] tibble_3.2.1 pillar_1.9.0
## [63] htmltools_0.5.8.1 rhdf5filters_1.16.0
## [65] GenomeInfoDbData_1.2.12 R6_2.5.1
## [67] sparseMatrixStats_1.16.0 evaluate_0.24.0
## [69] lattice_0.22-6 png_0.1-8
## [71] SpatialExperiment_1.14.0 memoise_2.0.1
## [73] bslib_0.8.0 Rcpp_1.0.13
## [75] SparseArray_1.4.8 xfun_0.47
## [77] MatrixGenerics_1.16.0 pkgconfig_2.0.3