References


The survival analysis reference used was writen by Zabor

  • (Zabor 2023).

Transfer libraries

The survival analysis reference used was Jaeger.

  • (Jaeger 2015).

To add p-values to plots, I used the Kassambara 2017 and 2020 references

  • (Kassambara 2017)


Load libraries


library(maditr)
## 
## To get total summary skip 'by' argument: take_all(mtcars, mean)
## 
## Attaching package: 'maditr'
## The following object is masked from 'package:base':
## 
##     sort_by


1. Upload processed file (from R2 dataset)


Load the GSVA object back to the R environment. This object now contains all the metadata information.

r2_gse62564_GSVA_Metadata <- readRDS( "../results/r2_gse62564_GSVA_Metadata.rds")


  • In neuroblastoma, the main risk classification system classifies patients using features such as age, stage, histology and MYCN amplification status.

  • Prognosis is the association between tumor features and patient survival.

  • MYCN is a gene that encodes a transcription factor that is very important in tumor prognosis.

  • Let’s take a look at the MYCN expression compared by the neuroblastoma risk group.

  • First, let’s see which type of variable is the the MYCN status and the MYCN expression.


Check the class of the mycn_status variable:

class(r2_gse62564_GSVA_Metadata$mycn_status)
## [1] "character"


Check the class of the MYCN variable

class(r2_gse62564_GSVA_Metadata$MYCN)
## [1] "character"


They are both of type “character”. In order to plot, we have to make the MYCN expression to be of type numeric:

r2_gse62564_GSVA_Metadata$MYCN <- as.numeric(r2_gse62564_GSVA_Metadata$MYCN)


Check again the class of the MYCN variable:

class(r2_gse62564_GSVA_Metadata$MYCN)
## [1] "numeric"


Check the age at diagnosis


Make age_at_diagnosis numeric to check age mean

r2_gse62564_GSVA_Metadata$age_at_diagnosis <- as.numeric(r2_gse62564_GSVA_Metadata$age_at_diagnosis)


Check the mean age at diagnosis

age_diagnosis_days_gse62564_498 <- mean(r2_gse62564_GSVA_Metadata$age_at_diagnosis)
age_diagnosis_days_gse62564_498
## [1] 758.5643


2. First Plots of the Neuroblastoma R2 Dataset

MYCN Amplification status and MYCN expression


Let’s try and see the boxplot of the MYCN expression vs. MYCN amplification status.

Using the boxplot function that we tried before, we will have an error.

boxplot(data=r2_gse62564_GSVA_Metadata, r2_gse62564_GSVA_Metadata$mycn_status, r2_gse62564_GSVA_Metadata$MYCN)


Maybe we could try the boxplot function making the mycn_status a numeric variable.

boxplot(data=r2_gse62564_GSVA_Metadata, as.numeric(r2_gse62564_GSVA_Metadata$mycn_status), r2_gse62564_GSVA_Metadata$MYCN)


Let’s try and see if ggplot, which we have also used before, takes the mycn_status variable as categorical as it is right now.

library(ggplot2)
ggplot(r2_gse62564_GSVA_Metadata, aes(x=mycn_status, y=MYCN, fill=mycn_status)) + 
    geom_boxplot()


In this figure, we see that group with clinical MYCN amplification, also shows high MYCN expression

What is the difference, in measurement, between MYCN amplification and MYCN expression?

How are the MYCN amplification and MYCN gene expression correlated?

For now, I am not bothering neither with formating of the X and Y labels nor with the N/A values of the MYCN amplification status.


High Risk status and MYCN expression


We can plot the high risk status vs. MYCN amplification and leave the fill argument as mycn_status

library(ggplot2)
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=MYCN, fill=mycn_status)) + 
    geom_boxplot()


Or we can make it the high_risk status instead

library(ggplot2)
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=MYCN, fill=high_risk)) + 
    geom_boxplot()


Here we see that the high_risk status correlates positively with MYCN expression.


3. Survival Analysis


Processing of variables to allow graph plotting


Make the OS (overall survival) status numeric, instead of character. Do the same for the EFS.

Construct os_bin_numeric from os_bin

## OS
r2_gse62564_GSVA_Metadata <- r2_gse62564_GSVA_Metadata %>%
  dplyr::mutate(os_bin_numeric =
                  ifelse(r2_gse62564_GSVA_Metadata$os_bin == "event", "1", "0"))


Make os_bin_numeric as numeric

r2_gse62564_GSVA_Metadata["os_bin_numeric"] <- as.numeric(r2_gse62564_GSVA_Metadata$os_bin_numeric)

Construct os_bin_numeric from os_bin

## EFS
r2_gse62564_GSVA_Metadata <- r2_gse62564_GSVA_Metadata %>%
  dplyr::mutate(efs_bin_numeric =
                  ifelse(r2_gse62564_GSVA_Metadata$efs_bin == "event", "1", "0"))


Make efs_bin_numeric as numeric

r2_gse62564_GSVA_Metadata["efs_bin_numeric"] <- as.numeric(r2_gse62564_GSVA_Metadata$efs_bin_numeric)


Survival Plots


Construct survival curve

library(survival)
fit_os_high_risk <- survfit(Surv(as.numeric(os_day), os_bin_numeric) ~ r2_gse62564_GSVA_Metadata$high_risk,
                   data = r2_gse62564_GSVA_Metadata)


Plot survival curve w/o percents

library(survminer)
ggsurvplot(fit_os_high_risk, 
           data = r2_gse62564_GSVA_Metadata, 
           risk.table = TRUE,
           # risk.table = "abs_pct",
           pval = TRUE, 
           pval.coord = c(2000, 1),
           )


Plot survival curve w/ percents

ggsurvplot(fit_os_high_risk, 
           data = r2_gse62564_GSVA_Metadata, 
           # risk.table = TRUE,
           risk.table = "abs_pct",
           pval = TRUE, 
           pval.coord = c(2000, 1),
           )

  • Have students generate the EFS plot


4. Hypoxia in the Tumor Microenvironment


  • Let’s now establish a relationship between neuroblastoma high risk status and hypoxia

  • Hypoxia is the condition of low oxygen concentration that is established in the tumor microenvironment as cancer progresses.


4.1. HIF1A


  • HIF1A is a transcription factor sensitive to oxygen levels;

  • HIF1S is a single gene

  • We need to make the HIF1A column (variable), numeric:


r2_gse62564_GSVA_Metadata$HIF1A <- as.numeric(r2_gse62564_GSVA_Metadata$HIF1A)


Compare HIF1A means between the high risk groups using the compare_means() function

compare_means(data = r2_gse62564_GSVA_Metadata, HIF1A ~ high_risk, method = "t.test")


Plot high risk vs. HIF1A

ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=HIF1A, fill=high_risk)) + 
    geom_boxplot()+
  stat_compare_means(method = "t.test", label.x = 1.4)+
  theme_linedraw()


4.2. HALLMARK_HYPOXIA


  • This is one of the gene sets we talked about before;

  • HALLMARK_HYPOXIA represent a group of genes that define response to hypoxia

  • In the dataframe, we also must make HALLMARK_HYPOXIA as numeric:


r2_gse62564_GSVA_Metadata$HALLMARK_HYPOXIA <- as.numeric(r2_gse62564_GSVA_Metadata$HALLMARK_HYPOXIA)


To then plot:

ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=HALLMARK_HYPOXIA, fill=high_risk)) + 
    geom_boxplot()


  • This plot suggests that there are components in the HALLMARK_HYPOXIA phenotype responsible for increased hypoxia phenotype in patients of the non-high risk group.

  • Surprisingly, these components do not correlate with the HIF1A activation program in the high-risk group

  • This finding is the type of information worth reporting in a publication

  • Because this concept will appear in a publication, I want to run statistical analysis and well as want to make the figure look fancy.


Statistical Analysis


In the next chunk, we construct a data-frame with statistics calculations for our plot.

The reference used here was the Kassambara 2023 article:

  • (Kassambara 2023)
library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
# Statistical test
stat.test <- r2_gse62564_GSVA_Metadata %>%
  t_test(HALLMARK_HYPOXIA ~ high_risk) %>%
  add_significance()


We can then look at the statistics dataframe

stat.test


In the next chunk, we create the object that will show the statistical results. After, we use the stat_pvalue_manual function.

We also change the theme of the ggplot2 figure plot.

  • Make high risk factor:
library(ggpubr)

r2_gse62564_GSVA_Metadata$high_risk <- as.factor(r2_gse62564_GSVA_Metadata$high_risk)


Plot high risk vs. HALLMARK_HYPOXIA

ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=HALLMARK_HYPOXIA)) + 
    geom_boxplot()+
  stat_pvalue_manual(stat.test, label = "p", y.position = 0.6) + # y.position is mandatory
  scale_y_continuous(expand = expansion(mult = c(0.05, 0.1)))


Make HR factor again

r2_gse62564_GSVA_Metadata$high_risk <- as.factor(r2_gse62564_GSVA_Metadata$high_risk)


Plot high_risk vs. HALLMARK_HYPOXIA again but with statistics:

ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=HALLMARK_HYPOXIA, fill="high_risk")) + 
    geom_boxplot()+
  stat_pvalue_manual(stat.test, label = "p", y.position = 0.6) + # y.position is mandatory
  scale_y_continuous(expand = expansion(mult = c(0.05, 0.1)))


Use compare_means function to construct t-test dataframe:

compare_means(data = r2_gse62564_GSVA_Metadata, HALLMARK_HYPOXIA ~ high_risk, method = "t.test")


Plot HR vs. HALLMARK_HYPOXIA with different colors per HR group and different theme:

ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=HALLMARK_HYPOXIA, fill=high_risk)) + 
    geom_boxplot()+
  stat_compare_means(method = "t.test")+
  theme_linedraw()


Conclusions of the HIF1A and Hallmark Hypoxia analysis


  • The transcriptional program activated by HIF1A is different than the Hallmark Hypoxia program
    • HIF1A expression is higher in high-risk group
    • Hallmark Hypoxia score is higher in non-high risk group


  • Hallmark Hypoxia program does not seem to be lead by HIF1A activation
    • The Hallmark Hypoxia gene set does not have HIF1A
    • Depending on how this gene-set was generated, it used genes that correlate positively or negatively with HIF1A
    • Hypoxia causes increase in HIF1A and
      • improved survival through up-regulated genes
      • worse survival through down-regulated genes


  • Therefore, we should ask if hypoxia activation is also a HIF1A-activated program


4.3. Hypoxia Down-Regulation


No statistical analysis


r2_gse62564_GSVA_Metadata$ADRN_Norm_vs_Hypo_Down_635.txt <- as.numeric(r2_gse62564_GSVA_Metadata$ADRN_Norm_vs_Hypo_Down_635.txt)
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=ADRN_Norm_vs_Hypo_Down_635.txt, fill=high_risk)) + 
    geom_boxplot()


Statistical analysis


compare_means(data = r2_gse62564_GSVA_Metadata, ADRN_Norm_vs_Hypo_Down_635.txt ~ high_risk, method = "t.test")
r2_gse62564_GSVA_Metadata$ADRN_Norm_vs_Hypo_Down_635.txt <- as.numeric(r2_gse62564_GSVA_Metadata$ADRN_Norm_vs_Hypo_Down_635.txt)
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=ADRN_Norm_vs_Hypo_Down_635.txt, fill=high_risk)) + 
    geom_boxplot()+
  stat_compare_means(method = "t.test")+
  theme_linedraw()


4.4. Hypoxia Up-Regulation


Statistical analysis


Make ADRN_Norm_vs_Hypo_Up_554.txt numeric to avoid error

r2_gse62564_GSVA_Metadata$ADRN_Norm_vs_Hypo_Up_554.txt <- as.numeric(r2_gse62564_GSVA_Metadata$ADRN_Norm_vs_Hypo_Up_554.txt)


Run the t-test

compare_means(data = r2_gse62564_GSVA_Metadata, ADRN_Norm_vs_Hypo_Up_554.txt ~ high_risk, method = "t.test")


Plot

ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=ADRN_Norm_vs_Hypo_Up_554.txt, fill=high_risk)) + 
    geom_boxplot()+
  stat_compare_means(method = "t.test")+
  theme_linedraw()


Statistical analysis


The compare_means function can not deal with spaces in the T cell exhaustion variable name.

We will have to use a trick to deal with the spaces in “T cell exhaustion”.

We will use the package stringr to insert dots (.) replacing the spaces in the r2_gse62564_GSVA_Metadata dataframe.

To start, we have to create the r2_gse62564_GSVA_Metadata_Spaces dataframe, which will have structure names that contained spaces, changed to one of these new characteres: “.” , “,”, or ““.

r2_gse62564_GSVA_Metadata_Spaces <- r2_gse62564_GSVA_Metadata
library(stringr)
names(r2_gse62564_GSVA_Metadata_Spaces)<-str_replace_all(names(r2_gse62564_GSVA_Metadata_Spaces), c(" " = "." , "," = "" ))


4.5. HALLMARK_INFLAMMATORY_RESPONSE


No statistical analysis


r2_gse62564_GSVA_Metadata$HALLMARK_INFLAMMATORY_RESPONSE <- as.numeric(r2_gse62564_GSVA_Metadata$HALLMARK_INFLAMMATORY_RESPONSE)
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=HALLMARK_INFLAMMATORY_RESPONSE, fill=high_risk)) + 
    geom_boxplot()


Statistical analysis


compare_means(data = r2_gse62564_GSVA_Metadata, HALLMARK_INFLAMMATORY_RESPONSE ~ high_risk, method = "t.test")
r2_gse62564_GSVA_Metadata$HALLMARK_INFLAMMATORY_RESPONSE <- as.numeric(r2_gse62564_GSVA_Metadata$HALLMARK_INFLAMMATORY_RESPONSE)
ggplot(r2_gse62564_GSVA_Metadata, aes(x=high_risk, y=HALLMARK_INFLAMMATORY_RESPONSE, fill=high_risk)) + 
    geom_boxplot()+
  stat_compare_means(method = "t.test")+
  theme_linedraw()


5. Survival analysis of the hypoxia gene sets


5.1. HIF1A


OS (Overall Survival)


Construct the HIF1A quantile variable:

library(tidyverse)

r2_gse62564_GSVA_Metadata <- r2_gse62564_GSVA_Metadata %>%
    mutate(quantile = ntile(HIF1A, 2))


Make the HIF1A quantile variable a factor

r2_gse62564_GSVA_Metadata$quantile <- as.factor(r2_gse62564_GSVA_Metadata$quantile )


Scaterplot HALLMARK_HYPOXIA vs. HIF1A

qplot(HALLMARK_HYPOXIA, HIF1A, 
      data = r2_gse62564_GSVA_Metadata,
      colour=quantile,
      ylab = "HIF1A",
      xlab = "Hallmark Hypoxia")


Construct survival object with HIF1A quantile variable

library(survival)
fit_os_HIF1A <- survfit(Surv(as.numeric(os_day), os_bin_numeric) ~ r2_gse62564_GSVA_Metadata$quantile,
                   data = r2_gse62564_GSVA_Metadata)


Plot survival curve OS

library(survminer)
ggsurvplot(fit_os_HIF1A, 
           data = r2_gse62564_GSVA_Metadata, 
           risk.table = TRUE,
           # risk.table = "abs_pct",
           pval = TRUE, 
           pval.coord = c(4000, 1),
           )


EFS (Event Free Survival)


Construct the HIF1A quantile variable:

library(tidyverse)

r2_gse62564_GSVA_Metadata <- r2_gse62564_GSVA_Metadata %>%
    mutate(quantile = ntile(HIF1A, 2))


Construct survival object EFS

library(survival)
fit_efs_HIF1A <- survfit(Surv(as.numeric(efs_day), efs_bin_numeric) ~ r2_gse62564_GSVA_Metadata$quantile,
                   data = r2_gse62564_GSVA_Metadata)


Plot survival curve EFS

library(survminer)
ggsurvplot(fit_efs_HIF1A, 
           data = r2_gse62564_GSVA_Metadata, 
           risk.table = TRUE,
           # risk.table = "abs_pct",
           pval = TRUE, 
           pval.coord = c(4000, 1),
           )


Create variable r2_gse62564_GSVA_Metadata_Spaces_survival


r2_gse62564_GSVA_Metadata_Spaces_survival <- r2_gse62564_GSVA_Metadata_Spaces 


6. HIF1A relationship to age at diagnosis


6.1. HIF1A, age at diagnosis and HR


Make age_at_diagnosis numeric to plot the quantiles

r2_gse62564_GSVA_Metadata_Spaces_survival$age_at_diagnosis <- as.numeric(r2_gse62564_GSVA_Metadata_Spaces_survival$age_at_diagnosis)


Check maximum age_at_diagnosis

max(r2_gse62564_GSVA_Metadata_Spaces_survival$age_at_diagnosis)
## [1] 8983


Check minimum age_at_diagnosis

min(r2_gse62564_GSVA_Metadata_Spaces_survival$age_at_diagnosis)
## [1] 0


Construct age_qtls quantiles

r2_gse62564_GSVA_Metadata_Spaces_survival_age <- r2_gse62564_GSVA_Metadata_Spaces_survival %>%
    mutate(age_qtls = ntile(age_at_diagnosis, 3))


Make age_qtls factor

r2_gse62564_GSVA_Metadata_Spaces_survival_age$age_qtls <- as.factor(r2_gse62564_GSVA_Metadata_Spaces_survival_age$age_qtls)


Plot age_qtls vs. HIF1A

ggplot(r2_gse62564_GSVA_Metadata_Spaces_survival_age, aes(x=age_qtls, y=HIF1A, fill=high_risk)) + 
    geom_boxplot()+
  stat_compare_means(method = "t.test")+
  theme_linedraw()


6.2. HIF1A, age at diagnosis and MYCN status


Divide the age_at_diagnosis variable into several quantiles:

r2_gse62564_GSVA_Metadata_Spaces_survival_age <- r2_gse62564_GSVA_Metadata_Spaces_survival %>%
    mutate(age_qtls = ntile(age_at_diagnosis, 3))


Make age_qtls a factor

r2_gse62564_GSVA_Metadata_Spaces_survival_age$age_qtls <- as.factor(r2_gse62564_GSVA_Metadata_Spaces_survival_age$age_qtls)


Plot age_qtls vs HIF1A Gene Expression

ggplot(r2_gse62564_GSVA_Metadata_Spaces_survival_age, aes(x=age_qtls, y=HIF1A, fill=mycn_status)) + 
    geom_boxplot()+
  stat_compare_means(method = "anova")+
  theme_linedraw()


Remove the NAs from the mycn_status variable

r2_gse62564_GSVA_Metadata_Spaces_survival_age <- r2_gse62564_GSVA_Metadata_Spaces_survival_age[which(
  r2_gse62564_GSVA_Metadata_Spaces_survival_age$mycn_status != "na"  ), ]


Construct age_qtls variable; the number of quantiles can change between 3-4 or more.

r2_gse62564_GSVA_Metadata_Spaces_survival_age <- r2_gse62564_GSVA_Metadata_Spaces_survival_age %>%
    mutate(age_qtls = ntile(age_at_diagnosis, 3))


Plot age_qtls vs HIF1A Gene Expression

r2_gse62564_GSVA_Metadata_Spaces_survival_age$age_qtls <- as.factor(r2_gse62564_GSVA_Metadata_Spaces_survival_age$age_qtls)
ggplot(r2_gse62564_GSVA_Metadata_Spaces_survival_age, aes(x=age_qtls, y=HIF1A, fill=mycn_status)) + 
    geom_boxplot()+
  stat_compare_means(method = "anova")+
  theme_linedraw()


Note that the difference in HIF1A expression decreases as the age increases


How does this affect HIF1A activity in younger children?

References


Jaeger, T. Florian. 2015. Transfer Installed Libraries to a Different Version of r. https://hlplab.wordpress.com/2012/06/01/transferring-installed-packages-between-different-installations-of-r/.
Kassambara, Alboukadel. 2017. Add p-Values and Significance Levels to Ggplots. http://www.sthda.com/english/articles/24-ggpubr-publication-ready-plots/76-add-p-values-and-significance-levels-to-ggplots/.
———. 2023. How to Add p-Values onto Basic GGPLOTS. https://www.datanovia.com/en/blog/how-to-add-p-values-onto-basic-ggplots/.
Zabor, Emily C. 2023. Survival Analysis in r. https://www.emilyzabor.com/tutorials/survival_analysis_in_r_tutorial.html.


Session Info


sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Big Sur 11.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] lubridate_1.9.3 forcats_1.0.0   dplyr_1.1.4     purrr_1.0.2    
##  [5] readr_2.1.5     tidyr_1.3.1     tibble_3.2.1    tidyverse_2.0.0
##  [9] stringr_1.5.1   rstatix_0.7.2   survminer_0.4.9 ggpubr_0.6.0   
## [13] survival_3.7-0  ggplot2_3.5.1   maditr_0.8.4   
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.5      xfun_0.47         bslib_0.8.0       lattice_0.22-6   
##  [5] tzdb_0.4.0        vctrs_0.6.5       tools_4.4.1       generics_0.1.3   
##  [9] fansi_1.0.6       highr_0.11        pkgconfig_2.0.3   Matrix_1.7-0     
## [13] data.table_1.15.4 lifecycle_1.0.4   compiler_4.4.1    farver_2.1.2     
## [17] munsell_0.5.1     carData_3.0-5     htmltools_0.5.8.1 sass_0.4.9       
## [21] yaml_2.3.10       pillar_1.9.0      car_3.1-2         jquerylib_0.1.4  
## [25] cachem_1.1.0      abind_1.4-5       km.ci_0.5-6       commonmark_1.9.1 
## [29] tidyselect_1.2.1  digest_0.6.37     stringi_1.8.4     labeling_0.4.3   
## [33] splines_4.4.1     fastmap_1.2.0     grid_4.4.1        colorspace_2.1-1 
## [37] cli_3.6.3         magrittr_2.0.3    utf8_1.2.4        broom_1.0.6      
## [41] withr_3.0.1       scales_1.3.0      backports_1.5.0   timechange_0.3.0 
## [45] rmarkdown_2.28    ggtext_0.1.2      gridExtra_2.3     ggsignif_0.6.4   
## [49] hms_1.1.3         zoo_1.8-12        evaluate_0.24.0   knitr_1.48       
## [53] KMsurv_0.1-5      markdown_1.13     survMisc_0.5.6    rlang_1.1.4      
## [57] gridtext_0.1.5    Rcpp_1.0.13       xtable_1.8-4      glue_1.7.0       
## [61] xml2_1.3.6        rstudioapi_0.16.0 jsonlite_1.8.8    R6_2.5.1