In this activity the goal is to generate a model that will be able to classify a neuroblastoma patient as having high risk (HR) or low risk (LR) disease.
This prediction is based on the gene expression pattern of tumor isolated from the patient using a Machine Learning algorithm.
The hypothesis is that by creating a model, we will estimate the risk category a patient belongs to with elevated accuracy using the 22 predictor variables.
The steps in the (Chugh 2023) tutorial were followed. Other tutorials used were (An 2024) and (Lu 2023).
In this part, we need to distinguish supervised and unsupervised machine learning methods. Supervised machine learning methods use the information of the label or the category of a pattern of data. Unsupervised machine learning methods aim to find a pattern of groups inside the dataset.
Detection of a sub-population of cancer cells that drive agressiveness.
Isolation and study of neuroblastoma cells that drive phenotypes of agressiveness.
Hierarchical Clustering is one of the ML methods that have been used to identify the pattern of gene expression of both ADRN and MES cells.
Clustering can be used to separate the genetic profiles of cancer cells that drive progression or defeat of the disease.
ADRN neuroblastoma cells are differentiated cells similar to neurons that are killed by chemiotherapy.
MES neuroblastoma cells are non-differentiated cells thought to behave like stem cells.
MES cells are resitent to chemotherapy.
Super-enhancers are regions of chromatin that are maintained open to facilitate expression of proteins that keep the differentiation state of the neuroblastoma cells. Superenhancers are discussed by Sebastian Pott (2015).
Mediator of RNA polymerase II transcription subunit 1 (MED1).
Articles about Machine Learning classification concepts are in An (2024) and Nguyen (2024)
Maximum Likelihood Estimation (MLE) is a method to estimate the parameters of an assumed probability distribution, given some observed data according to Wikipedia (2024).
This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable.
Each feature (column or variable) is a phenotype expression given by a gene expression or a phenotype score
Each coeficient of the variable predicts the importance of the variable in calculating the probability
library(readr)
library(tidymodels)
library(stringr)
library(shiny)
library(dplyr)
kocak_df <- read.csv("../data/r2_gse62564_GSVA_inss_mycn_status_selected.csv")
Remove problematic columns
kocak_df <- kocak_df %>% dplyr::select(-c(sample.id))
Make Factors
kocak_df$high_risk <- as.factor(kocak_df$high_risk)
kocak_df$inss_stage <- as.factor(kocak_df$inss_stage)
kocak_df$mycn_status <- as.factor(kocak_df$mycn_status)
Make numeric
kocak_df$MYCN <- as.numeric(kocak_df$MYCN)
Change variable name
kocak_df_clinical_maintained_df <- kocak_df
Remove inss_stage and mycn_status information to avoid influencing the model fitting
kocak_df <- kocak_df %>% dplyr::select(-c(inss_stage, mycn_status))
HIPOTHESIS:
By creating a model, we will estimate the risk category a patient belongs to with elevated accuracy using the 22 predictor variables.
When you generate “random numbers” in R, you are actually generating pseudorandom numbers.
These numbers are generated with an algorithm that requires a seed to initialize.
Being pseudorandom instead of pure random means that, if you know the seed and the generator, you can predict (and reproduce) the output RCoder (2024).
set.seed(421)
Split data into train set and test set
The prop argument indicates the proportion of patients taken to construct the train set (80% of the Kocak dataset)
Note that here we are deciding randomly which samples go into the train set and which samples go into the test set
split <- initial_split(kocak_df, prop = 0.8, strata = high_risk)
train <- split %>%
training()
test <- split %>%
testing()
One exercise here is to check what is the class and structure of the split object.
class(train)
## [1] "data.frame"
dim(train)
## [1] 397 23
class(test)
## [1] "data.frame"
dim(test)
## [1] 101 23
Chunk build new dataframe is hidden
Note that the train set receives high-risk and non-high risk patients
Therefore the model is trained using the high-risk and non-high risk labels of the dataframe:
train
Before training the model, glmnet needs to be installed, according to Trevor Hastie (2023), if you do not have it in your system:
install.packages("glmnet")
The next command allows us to train a logistic regression model.
Information about function logistic_reg() can be found in the Max Kuhn (2024a) reference.
Information about the mixture and penalty parameters can be found in the Max Kuhn (2024b) reference.
Note that we are using the line set_mode(“classification”)
model <- logistic_reg(mixture = double(1), penalty = double(1)) %>%
set_engine("glmnet") %>%
set_mode("classification") %>%
fit(high_risk ~ ., data = train)
Let’s ask R what the model looks like:
model
More information about how parsnip generates the logistic regression model can be found in Chapter 6 of Tidy Modeling with parsnip SILGE (2023).
str(model)
According to Robinson (2024) in the documentation for the Broom package the tidy function is from the Broom package.
In that reference there is a reference to the “Tidy Data” paper of Wickham (2014).
tidy(model)
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
The output is shown with the estimate column representing coefficients of the predictor. The estimate column is therefore an estimation of how much the predictor contributes to the high risk label. The higher the value, the more the predictor contributes to the label.
Predict the risk class or label
pred_class <- predict(model,
new_data = test,
type = "class")
pred_class
This chunk will be hidden for now: Class Predictions for New Individuals
Calculate the risk class probability
pred_proba <- predict(model,
new_data = test,
type = "prob")
pred_proba
Evaluate the model using the accuracy() function
\[ Accuracy = \frac{Correct \ Classifications}{Total \ Classifications} \]
Reference for the Accuracy formula:
https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall
Reference for math symbols:
https://datalabzone.com/posts/blogsforr/2023/02/07/math.html
Build data-frame results by
selecting the high_risk column in the test dataframe (note that these are the real labels) and
binding the pred_class and pred_proba
results <- test %>%
dplyr::select(high_risk) %>% # Step 1: Select the high_risk column in the *test* dataframe
bind_cols(pred_class, pred_proba) # Step 2: Bind the *pred_class* and *pred_proba* with the real labels
With these steps in the construction of the object, results will have the real labels and the predicted class and predicted probability of the class
Print the results to see the predicted labels and the actual labels
results
Calculate the accuracy of the model using the accuracy function
accuracy(results, truth = high_risk, estimate = .pred_class)
This section can be skipped because these results are not totally as I expect them to be.
Let’s understand the variables impacting the subscription buying decision. In a logistic regression scenario, the coefficients decide how sensitive the target variable is to the individual predictors. The higher the value of coefficients the higher their importance is. Sort the variables in descending order of the absolute value of their coefficient values and display only the coefficients with an absolute value greater than 0.5.
coeff_model <- tidy(model) %>%
arrange(desc(abs(estimate))) %>%
filter(abs(estimate) > 0.002)
coeff_model <- tidy(model) %>%
arrange(desc(abs(estimate))) %>%
filter(estimate > 0.1)
Plot the feature importance using the ggplot() function.
ggplot(coeff_model, aes(x = term, y = estimate, fill = term)) + geom_col() + coord_flip()
Using the best hyperparameters:
Let’s train a logistic regression model Use it to generate predictions on test set Create a confusion matrix using the true values, and the estimates
You can calculate the precision (positive predictive value, the number of true positives divided by the number of predicted positives) with the precision() function.
Similarly, you can calculate the recall (sensitivity, the number of true positives divided by the number of actual positives) with the recall() function.
Let’s understand the variables impacting the subscription buying decision. In a logistic regression scenario, the coefficients decide how sensitive the target variable is to the individual predictors. The higher the value of coefficients the higher their importance is. Sort the variables in descending order of the absolute value of their coefficient values and display only the coefficients with an absolute value greater than 0.5.
Plot the feature importance using the ggplot() function.
sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Big Sur 11.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Chicago
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] glmnet_4.1-8 Matrix_1.7-0 shiny_1.9.1 stringr_1.5.1
## [5] yardstick_1.3.1 workflowsets_1.1.0 workflows_1.1.4 tune_1.2.1
## [9] tidyr_1.3.1 tibble_3.2.1 rsample_1.2.1 recipes_1.1.0
## [13] purrr_1.0.2 parsnip_1.2.1 modeldata_1.4.0 infer_1.0.7
## [17] ggplot2_3.5.1 dplyr_1.1.4 dials_1.3.0 scales_1.3.0
## [21] broom_1.0.6 tidymodels_1.2.0 readr_2.1.5
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.1 timeDate_4032.109 farver_2.1.2
## [4] fastmap_1.2.0 promises_1.3.0 digest_0.6.37
## [7] rpart_4.1.23 mime_0.12 timechange_0.3.0
## [10] lifecycle_1.0.4 survival_3.7-0 magrittr_2.0.3
## [13] compiler_4.4.1 rlang_1.1.4 sass_0.4.9
## [16] tools_4.4.1 utf8_1.2.4 yaml_2.3.10
## [19] data.table_1.15.4 knitr_1.48 labeling_0.4.3
## [22] DiceDesign_1.10 withr_3.0.1 nnet_7.3-19
## [25] grid_4.4.1 fansi_1.0.6 xtable_1.8-4
## [28] colorspace_2.1-1 future_1.34.0 globals_0.16.3
## [31] iterators_1.0.14 MASS_7.3-61 cli_3.6.3
## [34] rmarkdown_2.28 generics_0.1.3 rstudioapi_0.16.0
## [37] future.apply_1.11.2 tzdb_0.4.0 cachem_1.1.0
## [40] splines_4.4.1 parallel_4.4.1 vctrs_0.6.5
## [43] hardhat_1.4.0 jsonlite_1.8.8 hms_1.1.3
## [46] listenv_0.9.1 foreach_1.5.2 gower_1.0.1
## [49] jquerylib_0.1.4 glue_1.7.0 parallelly_1.38.0
## [52] codetools_0.2-20 shape_1.4.6.1 stringi_1.8.4
## [55] lubridate_1.9.3 gtable_0.3.5 later_1.3.2
## [58] munsell_0.5.1 GPfit_1.0-8 pillar_1.9.0
## [61] furrr_0.3.1 htmltools_0.5.8.1 ipred_0.9-15
## [64] lava_1.8.0 R6_2.5.1 lhs_1.2.0
## [67] evaluate_0.24.0 lattice_0.22-6 highr_0.11
## [70] backports_1.5.0 httpuv_1.6.15 bslib_0.8.0
## [73] class_7.3-22 Rcpp_1.0.13 prodlim_2024.06.25
## [76] xfun_0.47 pkgconfig_2.0.3