The survival analysis reference used was writen by Zabor
Transfer libraries
The survival analysis reference used was Jaeger.
To add p-values to plots, I used the Kassambara 2017 and 2020 references
library(maditr)
##
## To get total summary skip 'by' argument: take_all(mtcars, mean)
##
## Attaching package: 'maditr'
## The following object is masked from 'package:base':
##
## sort_by
Load the GSVA object back to the R environment. This object now contains all the metadata information.
r2_gse62564_GSVA_Metadata <- readRDS( "../results/r2_gse62564_GSVA_Metadata.rds")
In neuroblastoma, the main risk classification system classifies patients using features such as age, stage, histology and MYCN amplification status.
Prognosis is the association between tumor features and patient survival.
MYCN is a gene that encodes a transcription factor that is very important in tumor prognosis.
Let’s take a look at the MYCN expression compared by the neuroblastoma risk group.
First, let’s see which type of variable is the the MYCN status and the MYCN expression.
Check the class of the mycn_status variable:
class(r2_gse62564_GSVA_Metadata$mycn_status)
## [1] "character"
Check the class of the MYCN variable
class(r2_gse62564_GSVA_Metadata$MYCN)
## [1] "character"
They are both of type “character”. In order to plot, we have to make the MYCN expression to be of type numeric:
r2_gse62564_GSVA_Metadata$MYCN <- as.numeric(r2_gse62564_GSVA_Metadata$MYCN)
Check again the class of the MYCN variable:
class(r2_gse62564_GSVA_Metadata$MYCN)
## [1] "numeric"
Make age_at_diagnosis numeric to check age mean
r2_gse62564_GSVA_Metadata$age_at_diagnosis <- as.numeric(r2_gse62564_GSVA_Metadata$age_at_diagnosis)
Check the mean age at diagnosis
age_diagnosis_days_gse62564_498 <- mean(r2_gse62564_GSVA_Metadata$age_at_diagnosis)
age_diagnosis_days_gse62564_498
## [1] 758.5643
Let’s try and see the boxplot of the MYCN expression vs. MYCN amplification status.
Using the boxplot function that we tried before, we will have an error.
boxplot(data=r2_gse62564_GSVA_Metadata, r2_gse62564_GSVA_Metadata$mycn_status, r2_gse62564_GSVA_Metadata$MYCN)
Maybe we could try the boxplot function making the mycn_status a numeric variable.
boxplot(data=r2_gse62564_GSVA_Metadata, as.numeric(r2_gse62564_GSVA_Metadata$mycn_status), r2_gse62564_GSVA_Metadata$MYCN)
Let’s try and see if ggplot, which we have also used before, takes the mycn_status variable as categorical as it is right now.
library(ggplot2)
ggplot(r2_gse62564_GSVA_Metadata, aes(x=mycn_status, y=MYCN, fill=mycn_status)) +
geom_boxplot()
In this figure, we see that group with clinical MYCN amplification, also shows high MYCN expression
What is the difference, in measurement, between MYCN amplification and MYCN expression?
How are the MYCN amplification and MYCN gene expression correlated?
For now, I am not bothering neither with formating of the X and Y labels nor with the N/A values of the MYCN amplification status.
We can plot the high risk status vs. MYCN amplification and leave the fill argument as mycn_status
library(ggplot2)
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=MYCN, fill=mycn_status)) +
geom_boxplot()
Or we can make it the high_risk status instead
library(ggplot2)
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=MYCN, fill=high_risk)) +
geom_boxplot()
Here we see that the high_risk status correlates positively with MYCN expression.
Make the OS (overall survival) status numeric, instead of character. Do the same for the EFS.
Construct os_bin_numeric from os_bin
## OS
r2_gse62564_GSVA_Metadata <- r2_gse62564_GSVA_Metadata %>%
dplyr::mutate(os_bin_numeric =
ifelse(r2_gse62564_GSVA_Metadata$os_bin == "event", "1", "0"))
Make os_bin_numeric as numeric
r2_gse62564_GSVA_Metadata["os_bin_numeric"] <- as.numeric(r2_gse62564_GSVA_Metadata$os_bin_numeric)
Construct os_bin_numeric from os_bin
## EFS
r2_gse62564_GSVA_Metadata <- r2_gse62564_GSVA_Metadata %>%
dplyr::mutate(efs_bin_numeric =
ifelse(r2_gse62564_GSVA_Metadata$efs_bin == "event", "1", "0"))
Make efs_bin_numeric as numeric
r2_gse62564_GSVA_Metadata["efs_bin_numeric"] <- as.numeric(r2_gse62564_GSVA_Metadata$efs_bin_numeric)
Construct survival curve
library(survival)
fit_os_high_risk <- survfit(Surv(as.numeric(os_day), os_bin_numeric) ~ r2_gse62564_GSVA_Metadata$high_risk,
data = r2_gse62564_GSVA_Metadata)
Plot survival curve w/o percents
library(survminer)
ggsurvplot(fit_os_high_risk,
data = r2_gse62564_GSVA_Metadata,
risk.table = TRUE,
# risk.table = "abs_pct",
pval = TRUE,
pval.coord = c(2000, 1),
)
Plot survival curve w/ percents
ggsurvplot(fit_os_high_risk,
data = r2_gse62564_GSVA_Metadata,
# risk.table = TRUE,
risk.table = "abs_pct",
pval = TRUE,
pval.coord = c(2000, 1),
)
Let’s now establish a relationship between neuroblastoma high risk status and hypoxia
Hypoxia is the condition of low oxygen concentration that is established in the tumor microenvironment as cancer progresses.
HIF1A is a transcription factor sensitive to oxygen levels;
HIF1S is a single gene
We need to make the HIF1A column (variable), numeric:
r2_gse62564_GSVA_Metadata$HIF1A <- as.numeric(r2_gse62564_GSVA_Metadata$HIF1A)
Compare HIF1A means between the high risk groups using the compare_means() function
compare_means(data = r2_gse62564_GSVA_Metadata, HIF1A ~ high_risk, method = "t.test")
Plot high risk vs. HIF1A
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=HIF1A, fill=high_risk)) +
geom_boxplot()+
stat_compare_means(method = "t.test", label.x = 1.4)+
theme_linedraw()
This is one of the gene sets we talked about before;
HALLMARK_HYPOXIA represent a group of genes that define response to hypoxia
In the dataframe, we also must make HALLMARK_HYPOXIA as numeric:
r2_gse62564_GSVA_Metadata$HALLMARK_HYPOXIA <- as.numeric(r2_gse62564_GSVA_Metadata$HALLMARK_HYPOXIA)
To then plot:
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=HALLMARK_HYPOXIA, fill=high_risk)) +
geom_boxplot()
This plot suggests that there are components in the HALLMARK_HYPOXIA phenotype responsible for increased hypoxia phenotype in patients of the non-high risk group.
Surprisingly, these components do not correlate with the HIF1A activation program in the high-risk group
This finding is the type of information worth reporting in a publication
Because this concept will appear in a publication, I want to run statistical analysis and well as want to make the figure look fancy.
In the next chunk, we construct a data-frame with statistics calculations for our plot.
The reference used here was the Kassambara 2023 article:
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
# Statistical test
stat.test <- r2_gse62564_GSVA_Metadata %>%
t_test(HALLMARK_HYPOXIA ~ high_risk) %>%
add_significance()
We can then look at the statistics dataframe
stat.test
In the next chunk, we create the object that will show the statistical results. After, we use the stat_pvalue_manual function.
We also change the theme of the ggplot2 figure plot.
library(ggpubr)
r2_gse62564_GSVA_Metadata$high_risk <- as.factor(r2_gse62564_GSVA_Metadata$high_risk)
Plot high risk vs. HALLMARK_HYPOXIA
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=HALLMARK_HYPOXIA)) +
geom_boxplot()+
stat_pvalue_manual(stat.test, label = "p", y.position = 0.6) + # y.position is mandatory
scale_y_continuous(expand = expansion(mult = c(0.05, 0.1)))
Make HR factor again
r2_gse62564_GSVA_Metadata$high_risk <- as.factor(r2_gse62564_GSVA_Metadata$high_risk)
Plot high_risk vs. HALLMARK_HYPOXIA again but with statistics:
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=HALLMARK_HYPOXIA, fill="high_risk")) +
geom_boxplot()+
stat_pvalue_manual(stat.test, label = "p", y.position = 0.6) + # y.position is mandatory
scale_y_continuous(expand = expansion(mult = c(0.05, 0.1)))
Use compare_means function to construct t-test dataframe:
compare_means(data = r2_gse62564_GSVA_Metadata, HALLMARK_HYPOXIA ~ high_risk, method = "t.test")
Plot HR vs. HALLMARK_HYPOXIA with different colors per HR group and different theme:
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=HALLMARK_HYPOXIA, fill=high_risk)) +
geom_boxplot()+
stat_compare_means(method = "t.test")+
theme_linedraw()
r2_gse62564_GSVA_Metadata$ADRN_Norm_vs_Hypo_Down_635.txt <- as.numeric(r2_gse62564_GSVA_Metadata$ADRN_Norm_vs_Hypo_Down_635.txt)
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=ADRN_Norm_vs_Hypo_Down_635.txt, fill=high_risk)) +
geom_boxplot()
compare_means(data = r2_gse62564_GSVA_Metadata, ADRN_Norm_vs_Hypo_Down_635.txt ~ high_risk, method = "t.test")
r2_gse62564_GSVA_Metadata$ADRN_Norm_vs_Hypo_Down_635.txt <- as.numeric(r2_gse62564_GSVA_Metadata$ADRN_Norm_vs_Hypo_Down_635.txt)
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=ADRN_Norm_vs_Hypo_Down_635.txt, fill=high_risk)) +
geom_boxplot()+
stat_compare_means(method = "t.test")+
theme_linedraw()
Make ADRN_Norm_vs_Hypo_Up_554.txt numeric to avoid error
r2_gse62564_GSVA_Metadata$ADRN_Norm_vs_Hypo_Up_554.txt <- as.numeric(r2_gse62564_GSVA_Metadata$ADRN_Norm_vs_Hypo_Up_554.txt)
Run the t-test
compare_means(data = r2_gse62564_GSVA_Metadata, ADRN_Norm_vs_Hypo_Up_554.txt ~ high_risk, method = "t.test")
Plot
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=ADRN_Norm_vs_Hypo_Up_554.txt, fill=high_risk)) +
geom_boxplot()+
stat_compare_means(method = "t.test")+
theme_linedraw()
The compare_means function can not deal with spaces in the T cell exhaustion variable name.
We will have to use a trick to deal with the spaces in “T cell exhaustion”.
We will use the package stringr to insert dots (.) replacing the spaces in the r2_gse62564_GSVA_Metadata dataframe.
To start, we have to create the r2_gse62564_GSVA_Metadata_Spaces dataframe, which will have structure names that contained spaces, changed to one of these new characteres: “.” , “,”, or ““.
r2_gse62564_GSVA_Metadata_Spaces <- r2_gse62564_GSVA_Metadata
library(stringr)
names(r2_gse62564_GSVA_Metadata_Spaces)<-str_replace_all(names(r2_gse62564_GSVA_Metadata_Spaces), c(" " = "." , "," = "" ))
r2_gse62564_GSVA_Metadata$HALLMARK_INFLAMMATORY_RESPONSE <- as.numeric(r2_gse62564_GSVA_Metadata$HALLMARK_INFLAMMATORY_RESPONSE)
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=HALLMARK_INFLAMMATORY_RESPONSE, fill=high_risk)) +
geom_boxplot()
compare_means(data = r2_gse62564_GSVA_Metadata, HALLMARK_INFLAMMATORY_RESPONSE ~ high_risk, method = "t.test")
r2_gse62564_GSVA_Metadata$HALLMARK_INFLAMMATORY_RESPONSE <- as.numeric(r2_gse62564_GSVA_Metadata$HALLMARK_INFLAMMATORY_RESPONSE)
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=HALLMARK_INFLAMMATORY_RESPONSE, fill=high_risk)) +
geom_boxplot()+
stat_compare_means(method = "t.test")+
theme_linedraw()
Construct the HIF1A quantile variable:
library(tidyverse)
r2_gse62564_GSVA_Metadata <- r2_gse62564_GSVA_Metadata %>%
mutate(quantile = ntile(HIF1A, 2))
Make the HIF1A quantile variable a factor
r2_gse62564_GSVA_Metadata$quantile <- as.factor(r2_gse62564_GSVA_Metadata$quantile )
Scaterplot HALLMARK_HYPOXIA vs. HIF1A
qplot(HALLMARK_HYPOXIA, HIF1A,
data = r2_gse62564_GSVA_Metadata,
colour=quantile,
ylab = "HIF1A",
xlab = "Hallmark Hypoxia")
Construct survival object with HIF1A quantile variable
library(survival)
fit_os_HIF1A <- survfit(Surv(as.numeric(os_day), os_bin_numeric) ~ r2_gse62564_GSVA_Metadata$quantile,
data = r2_gse62564_GSVA_Metadata)
Plot survival curve OS
library(survminer)
ggsurvplot(fit_os_HIF1A,
data = r2_gse62564_GSVA_Metadata,
risk.table = TRUE,
# risk.table = "abs_pct",
pval = TRUE,
pval.coord = c(4000, 1),
)
Construct the HIF1A quantile variable:
library(tidyverse)
r2_gse62564_GSVA_Metadata <- r2_gse62564_GSVA_Metadata %>%
mutate(quantile = ntile(HIF1A, 2))
Construct survival object EFS
library(survival)
fit_efs_HIF1A <- survfit(Surv(as.numeric(efs_day), efs_bin_numeric) ~ r2_gse62564_GSVA_Metadata$quantile,
data = r2_gse62564_GSVA_Metadata)
Plot survival curve EFS
library(survminer)
ggsurvplot(fit_efs_HIF1A,
data = r2_gse62564_GSVA_Metadata,
risk.table = TRUE,
# risk.table = "abs_pct",
pval = TRUE,
pval.coord = c(4000, 1),
)
r2_gse62564_GSVA_Metadata_Spaces_survival <- r2_gse62564_GSVA_Metadata_Spaces
Make age_at_diagnosis numeric to plot the quantiles
r2_gse62564_GSVA_Metadata_Spaces_survival$age_at_diagnosis <- as.numeric(r2_gse62564_GSVA_Metadata_Spaces_survival$age_at_diagnosis)
Check maximum age_at_diagnosis
max(r2_gse62564_GSVA_Metadata_Spaces_survival$age_at_diagnosis)
## [1] 8983
Check minimum age_at_diagnosis
min(r2_gse62564_GSVA_Metadata_Spaces_survival$age_at_diagnosis)
## [1] 0
Construct age_qtls quantiles
r2_gse62564_GSVA_Metadata_Spaces_survival_age <- r2_gse62564_GSVA_Metadata_Spaces_survival %>%
mutate(age_qtls = ntile(age_at_diagnosis, 3))
Make age_qtls factor
r2_gse62564_GSVA_Metadata_Spaces_survival_age$age_qtls <- as.factor(r2_gse62564_GSVA_Metadata_Spaces_survival_age$age_qtls)
Plot age_qtls vs. HIF1A
ggplot(r2_gse62564_GSVA_Metadata_Spaces_survival_age, aes(x=age_qtls, y=HIF1A, fill=high_risk)) +
geom_boxplot()+
stat_compare_means(method = "t.test")+
theme_linedraw()
Divide the age_at_diagnosis variable into several quantiles:
r2_gse62564_GSVA_Metadata_Spaces_survival_age <- r2_gse62564_GSVA_Metadata_Spaces_survival %>%
mutate(age_qtls = ntile(age_at_diagnosis, 3))
Make age_qtls a factor
r2_gse62564_GSVA_Metadata_Spaces_survival_age$age_qtls <- as.factor(r2_gse62564_GSVA_Metadata_Spaces_survival_age$age_qtls)
Plot age_qtls vs HIF1A Gene Expression
ggplot(r2_gse62564_GSVA_Metadata_Spaces_survival_age, aes(x=age_qtls, y=HIF1A, fill=mycn_status)) +
geom_boxplot()+
stat_compare_means(method = "anova")+
theme_linedraw()
Remove the NAs from the mycn_status variable
r2_gse62564_GSVA_Metadata_Spaces_survival_age <- r2_gse62564_GSVA_Metadata_Spaces_survival_age[which(
r2_gse62564_GSVA_Metadata_Spaces_survival_age$mycn_status != "na" ), ]
Construct age_qtls variable; the number of quantiles can change between 3-4 or more.
r2_gse62564_GSVA_Metadata_Spaces_survival_age <- r2_gse62564_GSVA_Metadata_Spaces_survival_age %>%
mutate(age_qtls = ntile(age_at_diagnosis, 3))
Plot age_qtls vs HIF1A Gene Expression
r2_gse62564_GSVA_Metadata_Spaces_survival_age$age_qtls <- as.factor(r2_gse62564_GSVA_Metadata_Spaces_survival_age$age_qtls)
ggplot(r2_gse62564_GSVA_Metadata_Spaces_survival_age, aes(x=age_qtls, y=HIF1A, fill=mycn_status)) +
geom_boxplot()+
stat_compare_means(method = "anova")+
theme_linedraw()
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Big Sur 11.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Chicago
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.9.3 forcats_1.0.0 dplyr_1.1.4 purrr_1.0.2
## [5] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 tidyverse_2.0.0
## [9] stringr_1.5.1 rstatix_0.7.2 survminer_0.4.9 ggpubr_0.6.0
## [13] survival_3.7-0 ggplot2_3.5.1 maditr_0.8.4
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.5 xfun_0.47 bslib_0.8.0 lattice_0.22-6
## [5] tzdb_0.4.0 vctrs_0.6.5 tools_4.4.1 generics_0.1.3
## [9] fansi_1.0.6 highr_0.11 pkgconfig_2.0.3 Matrix_1.7-0
## [13] data.table_1.15.4 lifecycle_1.0.4 compiler_4.4.1 farver_2.1.2
## [17] munsell_0.5.1 carData_3.0-5 htmltools_0.5.8.1 sass_0.4.9
## [21] yaml_2.3.10 pillar_1.9.0 car_3.1-2 jquerylib_0.1.4
## [25] cachem_1.1.0 abind_1.4-5 km.ci_0.5-6 commonmark_1.9.1
## [29] tidyselect_1.2.1 digest_0.6.37 stringi_1.8.4 labeling_0.4.3
## [33] splines_4.4.1 fastmap_1.2.0 grid_4.4.1 colorspace_2.1-1
## [37] cli_3.6.3 magrittr_2.0.3 utf8_1.2.4 broom_1.0.6
## [41] withr_3.0.1 scales_1.3.0 backports_1.5.0 timechange_0.3.0
## [45] rmarkdown_2.28 ggtext_0.1.2 gridExtra_2.3 ggsignif_0.6.4
## [49] hms_1.1.3 zoo_1.8-12 evaluate_0.24.0 knitr_1.48
## [53] KMsurv_0.1-5 markdown_1.13 survMisc_0.5.6 rlang_1.1.4
## [57] gridtext_0.1.5 Rcpp_1.0.13 xtable_1.8-4 glue_1.7.0
## [61] xml2_1.3.6 rstudioapi_0.16.0 jsonlite_1.8.8 R6_2.5.1