9 Model Validation

9.1 Cross-Validation Techniques

Cross-validation is a method used to evaluate the performance and generalization ability of predictive models. It involves partitioning the available data into multiple subsets, known as folds. The model is trained on a portion of the data (training set) and then evaluated on the remaining data (validation set). This process is repeated multiple times, with each fold serving as the validation set exactly once. Cross-validation provides a more robust estimate of the model’s performance compared to a single train-test split and helps identify overfitting.

In this section, we will explore various cross-validation techniques. Understanding these methods is essential as they play a significant role in subsequent hyperparameter tuning. Cross-validation ensures robust model evaluation and assists in selecting optimal hyperparameters for improved model performance.

9.1.1 K-Fold Cross-Validation

K-Fold Cross-Validation divides the data into K equal-sized folds. The model is trained K times, each time using K-1 folds as the training set and the remaining fold as the validation set.

# Example of K-Fold Cross-Validation
kfcv_ctrl <- caret::trainControl(method = "cv", number = 10)

9.1.2 Leave-One-Out Cross-Validation (LOOCV)

Leave-One-Out Cross-Validation involves using a single observation as the validation set and the remaining observations as the training set. This process is repeated for each observation in the dataset.

# Example of Leave-One-Out Cross-Validation
loocv_ctrl <- caret::trainControl(method = "LOOCV")

9.1.3 Stratified Cross-Validation

Stratified Cross-Validation ensures that each fold maintains the same class distribution as the original dataset. It is particularly useful for imbalanced datasets.

# Example of Stratified Cross-Validation
strcv_ctrl <- trainControl(method = "cv", 
                     number = 10, 
                     classProbs = TRUE, 
                     summaryFunction = twoClassSummary)

9.1.4 Nested Cross-Validation

Nested Cross-Validation is used to tune hyperparameters and evaluate model performance simultaneously. It involves an outer loop for model evaluation using K-Fold Cross-Validation and an inner loop for hyperparameter tuning.

# Example of Nested Cross-Validation
nestcv_ctrl_outer <- trainControl(method = "cv", number = 5)
netscv_ctrl_inner <- trainControl(method = "cv", number = 3)

9.2 Holdout Validation

Holdout validation involves splitting the dataset into two subsets: a training set used to train the model and a separate validation set used to evaluate its performance. This technique is straightforward and computationally efficient but may suffer from high variance if the validation set is small.

library(caret)

# Sample dataset (replace this with your own dataset)
# data <- train_data
data <- data_imputed

# Holdout Validation
set.seed(123)  # for reproducibility

train_indices <- sample(1:nrow(data), 0.8 * nrow(data))  # 80% of data for training
train_data <- data[train_indices, ]
validation_data <- data[-train_indices, ]

# Define your model
# model <- train(target ~ ., data = train_data, method = "glm", family = binomial)
model <- train(target ~ ., data = train_data, method = "glm", family = binomial)

# Make predictions on validation data
predictions <- predict(model, newdata = validation_data)

# Evaluate model performance
# evaluation_metrics <- confusionMatrix(predictions, validation_data$target)
evaluation_metrics <- confusionMatrix(predictions, validation_data$target)
print(evaluation_metrics)
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 16  6
         1 11 12
                                          
               Accuracy : 0.6222          
                 95% CI : (0.4654, 0.7623)
    No Information Rate : 0.6             
    P-Value [Acc > NIR] : 0.4436          
                                          
                  Kappa : 0.2478          
                                          
 Mcnemar's Test P-Value : 0.3320          
                                          
            Sensitivity : 0.5926          
            Specificity : 0.6667          
         Pos Pred Value : 0.7273          
         Neg Pred Value : 0.5217          
             Prevalence : 0.6000          
         Detection Rate : 0.3556          
   Detection Prevalence : 0.4889          
      Balanced Accuracy : 0.6296          
                                          
       'Positive' Class : 0               
                                          

9.3 Bootstrapping

Bootstrapping is a resampling technique where multiple datasets are generated by randomly sampling with replacement from the original dataset. Each dataset is used to train a separate model, and their aggregate predictions are used to assess the model’s performance. Bootstrapping provides robust estimates of model performance and can handle small datasets effectively.

library(caret)

# Sample dataset (replace this with your own dataset)
data <- data_imputed

# Define your model
model <- train(target ~ ., data = data, method = "glm", family = binomial)

# Perform bootstrapping
set.seed(123)  # for reproducibility
boot <- createResample(y = data$target, times = 5)
boot_results <- lapply(boot, function(index) {
  train_data <- data[index, ]
  model <- train(target ~ ., data = train_data, method = "glm", family = binomial)
  predict(model, newdata = data[-index, ])
})

# Aggregate bootstrapped results
boot_predictions <- do.call(c, boot_results)

# Ensure boot_predictions and data$target have the same length
min_length <- min(length(boot_predictions), length(data$target))
boot_predictions <- boot_predictions[1:min_length]
data$target <- data$target[1:min_length]

# Evaluate bootstrapped model performance
evaluation_metrics <- confusionMatrix(boot_predictions, data$target)
print(evaluation_metrics)
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 49 52
         1 74 47
                                          
               Accuracy : 0.4324          
                 95% CI : (0.3663, 0.5004)
    No Information Rate : 0.5541          
    P-Value [Acc > NIR] : 0.99989         
                                          
                  Kappa : -0.1242         
                                          
 Mcnemar's Test P-Value : 0.06137         
                                          
            Sensitivity : 0.3984          
            Specificity : 0.4747          
         Pos Pred Value : 0.4851          
         Neg Pred Value : 0.3884          
             Prevalence : 0.5541          
         Detection Rate : 0.2207          
   Detection Prevalence : 0.4550          
      Balanced Accuracy : 0.4366          
                                          
       'Positive' Class : 0