Summary
1. Introduction
2. Data Upload and Processing
- Load libraries
- Load dataframe
3. Create Model: Look at the Train and Test dataframes
- Split data
- Fitting and Evaluating the Model
4. Making predictions
5. Variable Importance
- Hyperparameter Tuning
- More Evaluation Metrics
References
Session Information

Summary

In this activity the goal is to generate a model that will be able to classify a neuroblastoma patient as having high risk (HR) or low risk (LR) disease.

This prediction is based on the gene expression pattern of tumor isolated from the patient using a Machine Learning algorithm.

The hypothesis is that by creating a model, we will estimate the risk category a patient belongs to with elevated accuracy using the 22 predictor variables.

The steps in the (Chugh 2023) tutorial were followed. Other tutorials used were (An 2024) and (Lu 2023).

1. Introduction

1.1. Supervised and Unsupervised Methods

In this part, we need to distinguish supervised and unsupervised machine learning methods. Supervised machine learning methods use the information of the label or the category of a pattern of data. Unsupervised machine learning methods aim to find a pattern of groups inside the dataset.

Detection of a sub-population of cancer cells that drive agressiveness.
Isolation and study of neuroblastoma cells that drive phenotypes of agressiveness.
- MES cells are rare and aggressive.
- ADRN cells are common and more treatable than MES cells.
Hierarchical Clustering is one of the ML methods that have been used to identify the pattern of gene expression of both ADRN and MES cells.

Classification and Clustering. Classification is a supervised machine learning method while clustering is an unsupervised machine learning method.

We will study two unsupervised ML methods: Hierarchical Clustering and Principal Component Analysis (PCA)

Clustering

A technique of data segmentation that partitions data into several groups based on their similarity.
Clustering uses unsupervised learning techniques to organize unlabeled data points into homogeneous groups where those within a “cluster” have similarities and data points out of these “clusters” have dissimilarities or differences.
Since data is not pre-labeled, there is no training process.
Instead, clusters are created using similarity functions that measure the distance between points.
Relative to classification, clustering is considered a less complex approach and well suited for large datasets.

Algorithms for Clustering

K-means
K-medoids
Density Based
Hierarchical
Common Applications:
Recommender System
Customer Segmentation

1.2. Unsupervised Learning (Hierarchical Clustering) in Neuroblastoma Research

Clustering can be used to separate the genetic profiles of cancer cells that drive progression or defeat of the disease.
ADRN neuroblastoma cells are differentiated cells similar to neurons that are killed by chemiotherapy.
MES neuroblastoma cells are non-differentiated cells thought to behave like stem cells.
MES cells are resitent to chemotherapy.

Heatmap using RNA-Seq and Protein Expression Blots

Clustering of cell differentiation markers in neuroblastoma. Figure from Groningen (2017).

Heatmap using epigenetic profiles

Super-enhancers as spacial chromatin regulators

Super-enhancers are regions of chromatin that are maintained open to facilitate expression of proteins that keep the differentiation state of the neuroblastoma cells. Superenhancers are discussed by Sebastian Pott (2015).

Mediator of RNA polymerase II transcription subunit 1 (MED1).

Super-enhancers as spacial chromatin regulators. Figure from Qunying Jia (2020).

Algorithmic Steps for Identification of Super-enhancers

Superenhancers. Figure from Sebastian Pott (2015).

Super-enhancers Sequence and Location on Heatmap

Cell differentiation state is maintained by organization of super-enhancers in neuroblastoma. Figure from Groningen (2017).

1.3. Foundations of the Logistic Regression model

Maximum Likelihood Estimation (MLE)

Articles about Machine Learning classification concepts are in An (2024) and Nguyen (2024)
Maximum Likelihood Estimation (MLE) is a method to estimate the parameters of an assumed probability distribution, given some observed data according to Wikipedia (2024).
This is achieved by maximizing a likelihood function so that, under the assumed statistical model, the observed data is most probable.

Probability function.

Features that contribute to probability function

Each feature (column or variable) is a phenotype expression given by a gene expression or a phenotype score
Each coeficient of the variable predicts the importance of the variable in calculating the probability

Features that contribute to probability function.

2. Data Upload and Processing

Load libraries

library(readr)
library(tidymodels)
library(stringr)
library(shiny)
library(dplyr)

Load dataframe

This dataframe was constructed selecting the hypoxia and inflammatory phenotype variables from previous dataframes.

kocak_df <- read.csv("../data/r2_gse62564_GSVA_inss_mycn_status_selected.csv")

Remove problematic columns

kocak_df <- kocak_df %>% dplyr::select(-c(sample.id))

Make Factors

kocak_df$high_risk   <- as.factor(kocak_df$high_risk)
kocak_df$inss_stage  <- as.factor(kocak_df$inss_stage)
kocak_df$mycn_status <- as.factor(kocak_df$mycn_status)

Make numeric

kocak_df$MYCN <- as.numeric(kocak_df$MYCN)

Change variable name

kocak_df_clinical_maintained_df <- kocak_df

Remove inss_stage and mycn_status information to avoid influencing the model fitting

kocak_df <- kocak_df %>% dplyr::select(-c(inss_stage, mycn_status))

3. Create Model: Look at the Train and Test dataframes

HIPOTHESIS:

By creating a model, we will estimate the risk category a patient belongs to with elevated accuracy using the 22 predictor variables.

Split data

When you generate “random numbers” in R, you are actually generating pseudorandom numbers.
These numbers are generated with an algorithm that requires a seed to initialize.
Being pseudorandom instead of pure random means that, if you know the seed and the generator, you can predict (and reproduce) the output RCoder (2024).

set.seed(421)

Split data into train set and test set
The prop argument indicates the proportion of patients taken to construct the train set (80% of the Kocak dataset)
Note that here we are deciding randomly which samples go into the train set and which samples go into the test set

split <- initial_split(kocak_df, prop = 0.8, strata = high_risk)
train <- split %>% 
  training()
test <- split %>% 
  testing()

One exercise here is to check what is the class and structure of the split object.

Now we have two newly created datasets that are also dataframes: train and test

class(train)

## [1] "data.frame"

dim(train)

## [1] 397  23

class(test)

## [1] "data.frame"

dim(test)

## [1] 101  23

Chunk build new dataframe is hidden

Note that the train set receives high-risk and non-high risk patients
Therefore the model is trained using the high-risk and non-high risk labels of the dataframe:

train

ABCDEFGHIJ0123456789

MYCN <dbl>	STC1 <dbl>	P4HA1 <dbl>	BHLHE40 <dbl>	HIF1A <dbl>	ELVIDGE_HIF1A_TARGETS_UP <dbl>	Quiescence.Down.Cheung <dbl>
111.049	9.317	13.553	24.425	105.257	-0.251472520	0.419832141
43.224	49.302	30.283	119.521	143.518	-0.182710749	-0.564851712
40.919	9.718	13.764	49.604	58.720	-0.355559669	-0.732969824
88.791	14.059	11.534	46.791	57.886	0.169756000	0.304580765
32.757	12.456	18.256	272.160	86.806	-0.249712253	-0.574011591
4.605	8.485	13.372	70.214	90.511	0.220475665	-0.607272700
25.437	22.541	15.642	125.586	68.602	0.095250494	-0.491713973
14.718	12.852	19.988	231.241	114.457	-0.030011981	-0.348165623
45.140	29.286	18.448	119.850	121.259	0.129029810	-0.326677923
17.078	100.526	22.978	303.754	221.250	0.014482470	-0.464458962

Fitting and Evaluating the Model

Before training the model, glmnet needs to be installed, according to Trevor Hastie (2023), if you do not have it in your system:

install.packages("glmnet")

The next command allows us to train a logistic regression model.
Information about function logistic_reg() can be found in the Max Kuhn (2024a) reference.
Information about the mixture and penalty parameters can be found in the Max Kuhn (2024b) reference.
Note that we are using the line set_mode(“classification”)

model <- logistic_reg(mixture = double(1), penalty = double(1)) %>%
  set_engine("glmnet") %>%
  set_mode("classification") %>%
  fit(high_risk ~ ., data = train)

Model summary

Let’s ask R what the model looks like:

model

More information about how parsnip generates the logistic regression model can be found in Chapter 6 of Tidy Modeling with parsnip SILGE (2023).

str(model)

According to Robinson (2024) in the documentation for the Broom package the tidy function is from the Broom package.
In that reference there is a reference to the “Tidy Data” paper of Wickham (2014).

tidy(model)

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 4.1-8

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	penalty <dbl>
(Intercept)	-0.762926885	0
MYCN	0.002232123	0
STC1	0.018141782	0
P4HA1	0.013830180	0
BHLHE40	-0.001106689	0
HIF1A	-0.002891470	0
ELVIDGE_HIF1A_TARGETS_UP	1.425445531	0
Quiescence.Down.Cheung	0.458795803	0
T.cell.exhaustion	0.291974268	0
HALLMARK_INFLAMMATORY_RESPONSE	-0.970836794	0

The output is shown with the estimate column representing coefficients of the predictor. The estimate column is therefore an estimation of how much the predictor contributes to the high risk label. The higher the value, the more the predictor contributes to the label.

4. Making predictions

Class

Predict the risk class or label

pred_class <- predict(model,
                      new_data = test,
                      type = "class")

pred_class

ABCDEFGHIJ0123456789

.pred_class <fct>
yes
yes
no
no
no
no
no
no
no
no

This chunk will be hidden for now: Class Predictions for New Individuals

Class Probability

Calculate the risk class probability

pred_proba <- predict(model,
                      new_data = test,
                      type = "prob")
pred_proba

ABCDEFGHIJ0123456789

.pred_no <dbl>	.pred_yes <dbl>
0.0962015640	0.903798436
0.0415423791	0.958457621
0.9289635193	0.071036481
0.9023574563	0.097642544
0.9893081953	0.010691805
0.9310281595	0.068971840
0.9298698703	0.070130130
0.9344081675	0.065591833
0.9432419442	0.056758056
0.9518278641	0.048172136

Evaluate the Model

Evaluate the model using the accuracy() function

The Accuracy Formula

Reference for the Accuracy formula:

https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall

Reference for math symbols:

https://datalabzone.com/posts/blogsforr/2023/02/07/math.html

Build data-frame results by

selecting the high_risk column in the test dataframe (note that these are the real labels) and
binding the pred_class and pred_proba

results <- test %>%
  dplyr::select(high_risk) %>%       # Step 1: Select the high_risk column in the *test* dataframe
  bind_cols(pred_class, pred_proba)  # Step 2: Bind the *pred_class* and *pred_proba* with the real labels

With these steps in the construction of the object, results will have the real labels and the predicted class and predicted probability of the class

Print the results to see the predicted labels and the actual labels

results

ABCDEFGHIJ0123456789

high_risk <fct>	.pred_class <fct>	.pred_no <dbl>	.pred_yes <dbl>
yes	yes	0.0962015640	0.903798436
yes	yes	0.0415423791	0.958457621
no	no	0.9289635193	0.071036481
no	no	0.9023574563	0.097642544
no	no	0.9893081953	0.010691805
no	no	0.9310281595	0.068971840
no	no	0.9298698703	0.070130130
no	no	0.9344081675	0.065591833
no	no	0.9432419442	0.056758056
no	no	0.9518278641	0.048172136

Calculate the accuracy of the model using the accuracy function

accuracy(results, truth = high_risk, estimate = .pred_class)

ABCDEFGHIJ0123456789

.metric <chr>	.estimator <chr>	.estimate <dbl>
accuracy	binary	0.9108911

5. Variable Importance

This section can be skipped because these results are not totally as I expect them to be.

Let’s understand the variables impacting the subscription buying decision. In a logistic regression scenario, the coefficients decide how sensitive the target variable is to the individual predictors. The higher the value of coefficients the higher their importance is. Sort the variables in descending order of the absolute value of their coefficient values and display only the coefficients with an absolute value greater than 0.5.

coeff_model <- tidy(model) %>% 
  arrange(desc(abs(estimate))) %>% 
  filter(abs(estimate) > 0.002)

coeff_model <- tidy(model) %>% 
  arrange(desc(abs(estimate))) %>% 
  filter(estimate > 0.1)

Plot the feature importance using the ggplot() function.

ggplot(coeff_model, aes(x = term, y = estimate, fill = term)) + geom_col() + coord_flip()

Hyperparameter Tuning

More Evaluation Metrics

Using the best hyperparameters:

Let’s train a logistic regression model Use it to generate predictions on test set Create a confusion matrix using the true values, and the estimates

You can calculate the precision (positive predictive value, the number of true positives divided by the number of predicted positives) with the precision() function.

Similarly, you can calculate the recall (sensitivity, the number of true positives divided by the number of actual positives) with the recall() function.

Plot the feature importance using the ggplot() function.

References

An, Jerry. 2024. How to Remember All These Classification Concepts Forever. (Date accessed: 08/30/2024). https://medium.com/swlh/how-to-remember-all-these-classification-concepts-forever-761c065be33.

Chugh, Vidhi. 2023. Logistic Regression in r Tutorial. https://www.datacamp.com/tutorial/logistic-regression-R.

Groningen, Tim van. 2017. Neuroblastoma Is Composed of Two Super-Enhancer Associated Differentiation States. Nature Genetics. Vol. 49.

Lu, Min. 2023. Fast Unified Random Forests with randomForestSRC. https://www.randomforestsrc.org/articles/getstarted.html.

Max Kuhn, Davis Vaughan. 2024a. Logistic Regression. https://parsnip.tidymodels.org/reference/logistic_reg.html.

———. 2024b. Logistic Regression via Glmnet. https://parsnip.tidymodels.org/reference/details_logistic_reg_glmnet.html.

Nguyen, Alexander. 2024. https://levelup.gitconnected.com/the-resume-that-got-a-software-engineer-a-300-000-job-at-google-8c5a1ecff40f.

Qunying Jia, Yuan Tan, Shuhua Chen. 2020. Oncogenic Super-Enhancer Formation in Tumorigenesis and Its Molecular Mechanisms.

RCoder. 2024. Set Seed in r. (Date accessed: 08/30/2024). https://r-coder.com/set-seed-r/.

Robinson, David. 2024. Broom: Let’s Tidy up a Bit. The Comprehensive R Archive Network. (Date accessed: 08/30/2024). https://cran.r-project.org/web/packages/broom/vignettes/broom.html.

Sebastian Pott, Jason D Lieb. 2015. What Are Super-Enhancers? Nature Genetics. Vol. 47.

SILGE, MAX KUHN AND JULIA. 2023. Set Seed in r. (Date accessed: 08/30/2024). https://www.tmwr.org/models.

Trevor Hastie, Kenneth Tay, Junyang Qian. 2023. Forest Plot for Cox Proportional Hazards Model. https://glmnet.stanford.edu/articles/glmnet.html.

Wickham, Hadley. 2014. Tidy Data. Journal of Statistical Software. Vol. 59.

Wikipedia. 2024. Maximum Likelihood Estimation. (Date accessed: 08/29/2024). https://en.wikipedia.org/wiki/Maximum_likelihood_estimation.

Session Information

sessionInfo()

## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Big Sur 11.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] glmnet_4.1-8       Matrix_1.7-0       shiny_1.9.1        stringr_1.5.1     
##  [5] yardstick_1.3.1    workflowsets_1.1.0 workflows_1.1.4    tune_1.2.1        
##  [9] tidyr_1.3.1        tibble_3.2.1       rsample_1.2.1      recipes_1.1.0     
## [13] purrr_1.0.2        parsnip_1.2.1      modeldata_1.4.0    infer_1.0.7       
## [17] ggplot2_3.5.1      dplyr_1.1.4        dials_1.3.0        scales_1.3.0      
## [21] broom_1.0.6        tidymodels_1.2.0   readr_2.1.5       
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1    timeDate_4032.109   farver_2.1.2       
##  [4] fastmap_1.2.0       promises_1.3.0      digest_0.6.37      
##  [7] rpart_4.1.23        mime_0.12           timechange_0.3.0   
## [10] lifecycle_1.0.4     survival_3.7-0      magrittr_2.0.3     
## [13] compiler_4.4.1      rlang_1.1.4         sass_0.4.9         
## [16] tools_4.4.1         utf8_1.2.4          yaml_2.3.10        
## [19] data.table_1.15.4   knitr_1.48          labeling_0.4.3     
## [22] DiceDesign_1.10     withr_3.0.1         nnet_7.3-19        
## [25] grid_4.4.1          fansi_1.0.6         xtable_1.8-4       
## [28] colorspace_2.1-1    future_1.34.0       globals_0.16.3     
## [31] iterators_1.0.14    MASS_7.3-61         cli_3.6.3          
## [34] rmarkdown_2.28      generics_0.1.3      rstudioapi_0.16.0  
## [37] future.apply_1.11.2 tzdb_0.4.0          cachem_1.1.0       
## [40] splines_4.4.1       parallel_4.4.1      vctrs_0.6.5        
## [43] hardhat_1.4.0       jsonlite_1.8.8      hms_1.1.3          
## [46] listenv_0.9.1       foreach_1.5.2       gower_1.0.1        
## [49] jquerylib_0.1.4     glue_1.7.0          parallelly_1.38.0  
## [52] codetools_0.2-20    shape_1.4.6.1       stringi_1.8.4      
## [55] lubridate_1.9.3     gtable_0.3.5        later_1.3.2        
## [58] munsell_0.5.1       GPfit_1.0-8         pillar_1.9.0       
## [61] furrr_0.3.1         htmltools_0.5.8.1   ipred_0.9-15       
## [64] lava_1.8.0          R6_2.5.1            lhs_1.2.0          
## [67] evaluate_0.24.0     lattice_0.22-6      highr_0.11         
## [70] backports_1.5.0     httpuv_1.6.15       bslib_0.8.0        
## [73] class_7.3-22        Rcpp_1.0.13         prodlim_2024.06.25 
## [76] xfun_0.47           pkgconfig_2.0.3

Classification Using a Logistic Regression Model

Gepoliano Chaves

August 30th, 2024

Summary

1. Introduction

1.1. Supervised and Unsupervised Methods

Clustering

Algorithms for Clustering

1.2. Unsupervised Learning (Hierarchical Clustering) in Neuroblastoma Research

Heatmap using RNA-Seq and Protein Expression Blots

Heatmap using epigenetic profiles

Super-enhancers as spacial chromatin regulators

Algorithmic Steps for Identification of Super-enhancers

Super-enhancers Sequence and Location on Heatmap

1.3. Foundations of the Logistic Regression model

Maximum Likelihood Estimation (MLE)

Features that contribute to probability function

2. Data Upload and Processing

Load libraries

Load dataframe

3. Create Model: Look at the Train and Test dataframes

Split data

Fitting and Evaluating the Model

Model summary

4. Making predictions

Class

Class Probability

Evaluate the Model

The Accuracy Formula

5. Variable Importance

Hyperparameter Tuning

More Evaluation Metrics

References

Session Information