Summary


In this activity the goal is to generate a model that will be able to classify a neuroblastoma patient as having high risk (HR) or low risk (LR) disease.

This prediction is based on the gene expression pattern of tumor isolated from the patient using a Machine Learning algorithm.

The hypothesis is that by creating a model, we will estimate the risk category a patient belongs to with elevated accuracy using the 22 predictor variables.

The steps in the (Chugh 2023) tutorial were followed. Other tutorials used were (An 2024) and (Lu 2023).


1. Introduction


1.1. Supervised and Unsupervised Methods


In this part, we need to distinguish supervised and unsupervised machine learning methods. Supervised machine learning methods use the information of the label or the category of a pattern of data. Unsupervised machine learning methods aim to find a pattern of groups inside the dataset.

  • Detection of a sub-population of cancer cells that drive agressiveness.

  • Isolation and study of neuroblastoma cells that drive phenotypes of agressiveness.

    • MES cells are rare and aggressive.
    • ADRN cells are common and more treatable than MES cells.
  • Hierarchical Clustering is one of the ML methods that have been used to identify the pattern of gene expression of both ADRN and MES cells.


Classification and Clustering. Classification is a supervised machine learning method while clustering is an unsupervised machine learning method.
Classification and Clustering. Classification is a supervised machine learning method while clustering is an unsupervised machine learning method.


  • We will study two unsupervised ML methods: Hierarchical Clustering and Principal Component Analysis (PCA)


Clustering


  • A technique of data segmentation that partitions data into several groups based on their similarity.
  • Clustering uses unsupervised learning techniques to organize unlabeled data points into homogeneous groups where those within a “cluster” have similarities and data points out of these “clusters” have dissimilarities or differences.
  • Since data is not pre-labeled, there is no training process.
  • Instead, clusters are created using similarity functions that measure the distance between points.
  • Relative to classification, clustering is considered a less complex approach and well suited for large datasets.


Algorithms for Clustering


  1. K-means
  2. K-medoids
  3. Density Based
  4. Hierarchical
  5. Common Applications:
  6. Recommender System
  7. Customer Segmentation


1.2. Unsupervised Learning (Hierarchical Clustering) in Neuroblastoma Research


  • Clustering can be used to separate the genetic profiles of cancer cells that drive progression or defeat of the disease.

  • ADRN neuroblastoma cells are differentiated cells similar to neurons that are killed by chemiotherapy.

  • MES neuroblastoma cells are non-differentiated cells thought to behave like stem cells.

  • MES cells are resitent to chemotherapy.


Heatmap using RNA-Seq and Protein Expression Blots


Clustering of cell differentiation markers in neuroblastoma. Figure from Groningen (2017).
Clustering of cell differentiation markers in neuroblastoma. Figure from Groningen (2017).


Heatmap using epigenetic profiles


Super-enhancers as spacial chromatin regulators


Super-enhancers are regions of chromatin that are maintained open to facilitate expression of proteins that keep the differentiation state of the neuroblastoma cells. Superenhancers are discussed by Sebastian Pott (2015).

Mediator of RNA polymerase II transcription subunit 1 (MED1).


Super-enhancers as spacial chromatin regulators. Figure from Qunying Jia (2020).
Super-enhancers as spacial chromatin regulators. Figure from Qunying Jia (2020).


Algorithmic Steps for Identification of Super-enhancers
Superenhancers. Figure from Sebastian Pott (2015).
Superenhancers. Figure from Sebastian Pott (2015).


Super-enhancers Sequence and Location on Heatmap
Cell differentiation state is maintained by organization of super-enhancers in neuroblastoma. Figure from Groningen (2017).
Cell differentiation state is maintained by organization of super-enhancers in neuroblastoma. Figure from Groningen (2017).


1.3. Foundations of the Logistic Regression model


Maximum Likelihood Estimation (MLE)


  • Articles about Machine Learning classification concepts are in An (2024) and Nguyen (2024)

  • Maximum Likelihood Estimation (MLE) is a method to estimate the parameters of an assumed probability distribution, given some observed data according to Wikipedia (2024).

  • This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable.


Probability function.
Probability function.


Features that contribute to probability function


  • Each feature (column or variable) is a phenotype expression given by a gene expression or a phenotype score

  • Each coeficient of the variable predicts the importance of the variable in calculating the probability

Features that contribute to probability function.
Features that contribute to probability function.


2. Data Upload and Processing


Load libraries


library(readr)
library(tidymodels)
library(stringr)
library(shiny)
library(dplyr)


Load dataframe


  • This dataframe was constructed selecting the hypoxia and inflammatory phenotype variables from previous dataframes.
kocak_df <- read.csv("../data/r2_gse62564_GSVA_inss_mycn_status_selected.csv")


Remove problematic columns

kocak_df <- kocak_df %>% dplyr::select(-c(sample.id))


Make Factors

kocak_df$high_risk   <- as.factor(kocak_df$high_risk)
kocak_df$inss_stage  <- as.factor(kocak_df$inss_stage)
kocak_df$mycn_status <- as.factor(kocak_df$mycn_status)


Make numeric

kocak_df$MYCN <- as.numeric(kocak_df$MYCN)


Change variable name

kocak_df_clinical_maintained_df <- kocak_df


Remove inss_stage and mycn_status information to avoid influencing the model fitting

kocak_df <- kocak_df %>% dplyr::select(-c(inss_stage, mycn_status))


3. Create Model: Look at the Train and Test dataframes


HIPOTHESIS:

By creating a model, we will estimate the risk category a patient belongs to with elevated accuracy using the 22 predictor variables.


Split data


  • When you generate “random numbers” in R, you are actually generating pseudorandom numbers.

  • These numbers are generated with an algorithm that requires a seed to initialize.

  • Being pseudorandom instead of pure random means that, if you know the seed and the generator, you can predict (and reproduce) the output RCoder (2024).


set.seed(421)


  • Split data into train set and test set

  • The prop argument indicates the proportion of patients taken to construct the train set (80% of the Kocak dataset)

  • Note that here we are deciding randomly which samples go into the train set and which samples go into the test set

split <- initial_split(kocak_df, prop = 0.8, strata = high_risk)
train <- split %>% 
  training()
test <- split %>% 
  testing()

One exercise here is to check what is the class and structure of the split object.


  • Now we have two newly created datasets that are also dataframes: train and test
class(train)
## [1] "data.frame"
dim(train)
## [1] 397  23
class(test)
## [1] "data.frame"
dim(test)
## [1] 101  23


Chunk build new dataframe is hidden


  • Note that the train set receives high-risk and non-high risk patients

  • Therefore the model is trained using the high-risk and non-high risk labels of the dataframe:

train


Fitting and Evaluating the Model


Before training the model, glmnet needs to be installed, according to Trevor Hastie (2023), if you do not have it in your system:

install.packages("glmnet")


  • The next command allows us to train a logistic regression model.

  • Information about function logistic_reg() can be found in the Max Kuhn (2024a) reference.

  • Information about the mixture and penalty parameters can be found in the Max Kuhn (2024b) reference.

  • Note that we are using the line set_mode(“classification”)

model <- logistic_reg(mixture = double(1), penalty = double(1)) %>%
  set_engine("glmnet") %>%
  set_mode("classification") %>%
  fit(high_risk ~ ., data = train)


Model summary


Let’s ask R what the model looks like:

model


More information about how parsnip generates the logistic regression model can be found in Chapter 6 of Tidy Modeling with parsnip SILGE (2023).

str(model)


  • According to Robinson (2024) in the documentation for the Broom package the tidy function is from the Broom package.

  • In that reference there is a reference to the “Tidy Data” paper of Wickham (2014).

tidy(model)
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-8

The output is shown with the estimate column representing coefficients of the predictor. The estimate column is therefore an estimation of how much the predictor contributes to the high risk label. The higher the value, the more the predictor contributes to the label.


4. Making predictions


Class


Predict the risk class or label

pred_class <- predict(model,
                      new_data = test,
                      type = "class")

pred_class


This chunk will be hidden for now: Class Predictions for New Individuals


Class Probability


Calculate the risk class probability

pred_proba <- predict(model,
                      new_data = test,
                      type = "prob")
pred_proba


Evaluate the Model


Evaluate the model using the accuracy() function


The Accuracy Formula


\[ Accuracy = \frac{Correct \ Classifications}{Total \ Classifications} \]


Reference for the Accuracy formula:

https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall

Reference for math symbols:

https://datalabzone.com/posts/blogsforr/2023/02/07/math.html


Build data-frame results by

  • selecting the high_risk column in the test dataframe (note that these are the real labels) and

  • binding the pred_class and pred_proba

results <- test %>%
  dplyr::select(high_risk) %>%       # Step 1: Select the high_risk column in the *test* dataframe
  bind_cols(pred_class, pred_proba)  # Step 2: Bind the *pred_class* and *pred_proba* with the real labels

With these steps in the construction of the object, results will have the real labels and the predicted class and predicted probability of the class


Print the results to see the predicted labels and the actual labels

results


Calculate the accuracy of the model using the accuracy function

accuracy(results, truth = high_risk, estimate = .pred_class)


5. Variable Importance


This section can be skipped because these results are not totally as I expect them to be.


Let’s understand the variables impacting the subscription buying decision. In a logistic regression scenario, the coefficients decide how sensitive the target variable is to the individual predictors. The higher the value of coefficients the higher their importance is. Sort the variables in descending order of the absolute value of their coefficient values and display only the coefficients with an absolute value greater than 0.5.

coeff_model <- tidy(model) %>% 
  arrange(desc(abs(estimate))) %>% 
  filter(abs(estimate) > 0.002)


coeff_model <- tidy(model) %>% 
  arrange(desc(abs(estimate))) %>% 
  filter(estimate > 0.1)


Plot the feature importance using the ggplot() function.

ggplot(coeff_model, aes(x = term, y = estimate, fill = term)) + geom_col() + coord_flip()


Hyperparameter Tuning



More Evaluation Metrics


Using the best hyperparameters:

Let’s train a logistic regression model Use it to generate predictions on test set Create a confusion matrix using the true values, and the estimates


You can calculate the precision (positive predictive value, the number of true positives divided by the number of predicted positives) with the precision() function.


Similarly, you can calculate the recall (sensitivity, the number of true positives divided by the number of actual positives) with the recall() function.


Let’s understand the variables impacting the subscription buying decision. In a logistic regression scenario, the coefficients decide how sensitive the target variable is to the individual predictors. The higher the value of coefficients the higher their importance is. Sort the variables in descending order of the absolute value of their coefficient values and display only the coefficients with an absolute value greater than 0.5.


Plot the feature importance using the ggplot() function.


References


An, Jerry. 2024. How to Remember All These Classification Concepts Forever. (Date accessed: 08/30/2024). https://medium.com/swlh/how-to-remember-all-these-classification-concepts-forever-761c065be33.
Chugh, Vidhi. 2023. Logistic Regression in r Tutorial. https://www.datacamp.com/tutorial/logistic-regression-R.
Groningen, Tim van. 2017. Neuroblastoma Is Composed of Two Super-Enhancer Associated Differentiation States. Nature Genetics. Vol. 49.
Lu, Min. 2023. Fast Unified Random Forests with randomForestSRC. https://www.randomforestsrc.org/articles/getstarted.html.
Max Kuhn, Davis Vaughan. 2024a. Logistic Regression. https://parsnip.tidymodels.org/reference/logistic_reg.html.
———. 2024b. Logistic Regression via Glmnet. https://parsnip.tidymodels.org/reference/details_logistic_reg_glmnet.html.
Nguyen, Alexander. 2024. https://levelup.gitconnected.com/the-resume-that-got-a-software-engineer-a-300-000-job-at-google-8c5a1ecff40f.
Qunying Jia, Yuan Tan, Shuhua Chen. 2020. Oncogenic Super-Enhancer Formation in Tumorigenesis and Its Molecular Mechanisms.
RCoder. 2024. Set Seed in r. (Date accessed: 08/30/2024). https://r-coder.com/set-seed-r/.
Robinson, David. 2024. Broom: Let’s Tidy up a Bit. The Comprehensive R Archive Network. (Date accessed: 08/30/2024). https://cran.r-project.org/web/packages/broom/vignettes/broom.html.
Sebastian Pott, Jason D Lieb. 2015. What Are Super-Enhancers? Nature Genetics. Vol. 47.
SILGE, MAX KUHN AND JULIA. 2023. Set Seed in r. (Date accessed: 08/30/2024). https://www.tmwr.org/models.
Trevor Hastie, Kenneth Tay, Junyang Qian. 2023. Forest Plot for Cox Proportional Hazards Model. https://glmnet.stanford.edu/articles/glmnet.html.
Wickham, Hadley. 2014. Tidy Data. Journal of Statistical Software. Vol. 59.
Wikipedia. 2024. Maximum Likelihood Estimation. (Date accessed: 08/29/2024). https://en.wikipedia.org/wiki/Maximum_likelihood_estimation.


Session Information


sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Big Sur 11.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] glmnet_4.1-8       Matrix_1.7-0       shiny_1.9.1        stringr_1.5.1     
##  [5] yardstick_1.3.1    workflowsets_1.1.0 workflows_1.1.4    tune_1.2.1        
##  [9] tidyr_1.3.1        tibble_3.2.1       rsample_1.2.1      recipes_1.1.0     
## [13] purrr_1.0.2        parsnip_1.2.1      modeldata_1.4.0    infer_1.0.7       
## [17] ggplot2_3.5.1      dplyr_1.1.4        dials_1.3.0        scales_1.3.0      
## [21] broom_1.0.6        tidymodels_1.2.0   readr_2.1.5       
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1    timeDate_4032.109   farver_2.1.2       
##  [4] fastmap_1.2.0       promises_1.3.0      digest_0.6.37      
##  [7] rpart_4.1.23        mime_0.12           timechange_0.3.0   
## [10] lifecycle_1.0.4     survival_3.7-0      magrittr_2.0.3     
## [13] compiler_4.4.1      rlang_1.1.4         sass_0.4.9         
## [16] tools_4.4.1         utf8_1.2.4          yaml_2.3.10        
## [19] data.table_1.15.4   knitr_1.48          labeling_0.4.3     
## [22] DiceDesign_1.10     withr_3.0.1         nnet_7.3-19        
## [25] grid_4.4.1          fansi_1.0.6         xtable_1.8-4       
## [28] colorspace_2.1-1    future_1.34.0       globals_0.16.3     
## [31] iterators_1.0.14    MASS_7.3-61         cli_3.6.3          
## [34] rmarkdown_2.28      generics_0.1.3      rstudioapi_0.16.0  
## [37] future.apply_1.11.2 tzdb_0.4.0          cachem_1.1.0       
## [40] splines_4.4.1       parallel_4.4.1      vctrs_0.6.5        
## [43] hardhat_1.4.0       jsonlite_1.8.8      hms_1.1.3          
## [46] listenv_0.9.1       foreach_1.5.2       gower_1.0.1        
## [49] jquerylib_0.1.4     glue_1.7.0          parallelly_1.38.0  
## [52] codetools_0.2-20    shape_1.4.6.1       stringi_1.8.4      
## [55] lubridate_1.9.3     gtable_0.3.5        later_1.3.2        
## [58] munsell_0.5.1       GPfit_1.0-8         pillar_1.9.0       
## [61] furrr_0.3.1         htmltools_0.5.8.1   ipred_0.9-15       
## [64] lava_1.8.0          R6_2.5.1            lhs_1.2.0          
## [67] evaluate_0.24.0     lattice_0.22-6      highr_0.11         
## [70] backports_1.5.0     httpuv_1.6.15       bslib_0.8.0        
## [73] class_7.3-22        Rcpp_1.0.13         prodlim_2024.06.25 
## [76] xfun_0.47           pkgconfig_2.0.3