DCA and ROC curves for multivariable analysis #10

miguelrogr · 2023-11-09T11:15:24Z

miguelrogr
Nov 9, 2023

Hi all

I'm new to R (and specially with %>%).

I wonder whether someone could point me in the right direction? I'm interested in creating a 10-fold cross validation set for dca and also roc analysis for 3 or 4 models. The code below which is provided in the tutorial looks at one logistic regression model, I wonder how I could incorporate more than 1 model (and also ROC curves)?

create a 10-fold cross validation set

rsample::vfold_cv(df_cancer_dx, v = 10, repeats = 25) %>%

for each cut of the data, build logistic regression on the 90% (analysis set),

and perform DCA on the 10% (assessment set)

rowwise() %>%
mutate(
# build regression model on analysis set
glm_analysis =
glm(cancer ~ marker + age + famhistory,
data = rsample::analysis(splits),
family = binomial
) %>%
list(),
# get predictions for assessment set
df_assessment =
broom::augment(
glm_analysis,
newdata = rsample::assessment(splits),
type.predict = "response"
) %>%
list(),
# calculate net benefit on assessment set
dca_assessment =
dca(cancer ~ .fitted,
data = df_assessment,
thresholds = seq(0, 0.35, 0.01),
label = list(.fitted = "Cross-validated Prediction Model")
) %>%
as_tibble() %>%
list()
) %>%

pool results from the 10-fold cross validation

pull(dca_assessment) %>%
bind_rows() %>%
group_by(variable, label, threshold) %>%
summarise(net_benefit = mean(net_benefit), .groups = "drop") %>%

plot cross validated net benefit values

ggplot(aes(x = threshold, y = net_benefit, color = label)) +
stat_smooth(method = "loess", se = FALSE, formula = "y ~ x", span = 0.2) +
coord_cartesian(ylim = c(-0.014, 0.14)) +
scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
labs(x = "Threshold Probability", y = "Net Benefit", color = "") +
theme_bw()

Many thanks in advance!!
Miguel

Answered by shaunporwal

Nov 11, 2023

Hi Miguel,

After discussing with Dr. Vickers, it seems the cross validation from the tutorial is not ideal. It recalculates the net benefit for each cross validation assessment set and then takes the average for each patient from all sets at the end. Instead, we want to perform the cross validation, concatenate all of the resulting predictions, and generate mean predictions for each row. After doing so, only then should we generate metrics such as net benefit scores and ROC/AUC values. Thank you for asking this question! I will update the tutorial accordingly.

I have written such code below to iterate over 3 different models, and to generate both DCA and ROC curves. If you are confused ab…

View full answer

shaunporwal · 2023-11-11T00:03:28Z

shaunporwal
Nov 11, 2023
Maintainer

Hi Miguel,

After discussing with Dr. Vickers, it seems the cross validation from the tutorial is not ideal. It recalculates the net benefit for each cross validation assessment set and then takes the average for each patient from all sets at the end. Instead, we want to perform the cross validation, concatenate all of the resulting predictions, and generate mean predictions for each row. After doing so, only then should we generate metrics such as net benefit scores and ROC/AUC values. Thank you for asking this question! I will update the tutorial accordingly.

I have written such code below to iterate over 3 different models, and to generate both DCA and ROC curves. If you are confused about the R dplyr 'pipe' (%>%), I would highly advise you to thoroughly read dplyr pipe documentation online or watch explanatory YouTube videos, as the dplyr pipe is a fundamental aspect of R programming.

R Code:

# set seed for random process
set.seed(112358)

library(rsample)
library(broom)
library(dplyr)
library(pROC)
library(dcurves)
library(ggplot2)
library(readr)
library(purrr)
library(tidyr)

# import data
df_cancer_dx <-
  readr::read_csv(
    file = "https://raw.githubusercontent.com/ddsjoberg/dca-tutorial/main/data/df_cancer_dx.csv"
  ) %>%
  # assign variable labels. these labels will be carried through in the `dca()` output
  labelled::set_variable_labels(
    patientid = "Patient ID",
    cancer = "Cancer Diagnosis",
    risk_group = "Risk Group",
    age = "Patient Age",
    famhistory = "Family History",
    marker = "Marker",
    cancerpredmarker = "Prediction Model"
  )

model_list <- list(
  cancer ~ age + risk_group + famhistory,
  cancer ~ marker + age,
  cancer ~ cancerpredmarker + age
)

# Write a function to compute metrics for each model
output_crossvalidation_metrics <- function(data, formula, v = 10, repeats = 25, dca_thresholds = seq(0, 0.35, 0.01)){
  
  cross_validation_samples <- rsample::vfold_cv(df_cancer_dx, v = v, repeats = repeats)
  
  df_crossval_predictions <- 
    cross_validation_samples %>% 
    rowwise() %>% 
    mutate(
      # build regression model on analysis set
      glm_analysis =
        glm(formula = formula,
            data = rsample::analysis(splits),
            family = binomial
        ) %>%
        list(),
      # get predictions for assessment set
      df_assessment =
        broom::augment(
          glm_analysis,
          newdata = rsample::assessment(splits),
          type.predict = "response"
        ) %>%
        list()
    ) %>%
    ungroup() %>%
    pull(df_assessment) %>%
    bind_rows() %>%
    group_by(patientid) %>%
    summarise(cv_pred = mean(.fitted), .groups = "drop") %>%
    ungroup()
  
  df_cv_pred <-
    df_cancer_dx %>% 
    left_join(
      df_crossval_predictions,
      by = 'patientid'
    )
  
  # Calculate the ROC object
  roc_obj <- roc(response = df_cv_pred$cancer, predictor = df_cv_pred$cv_pred)
  
  # Extract the ROC curve coordinates
  roc_coords <- 
    coords(roc_obj, "all", ret = c("threshold", "sensitivity", "specificity"), transpose = FALSE) %>% 
    mutate(
      x = 1 - specificity,
      y = sensitivity
    )
  
  roc_curve_objects =
    list(
      'auc_score' = roc_obj,
      'roc_coords' = roc_coords
    )
  
  dca_results <- dca(df_cv_pred,
                     formula = cancer ~ cv_pred,
                     thresholds = dca_thresholds)$dca
  
  cv_metrics <-
    list(
      'pred_df' = df_cv_pred,
      'roc_objects' = roc_curve_objects,
      'dca_results' = dca_results
      
    )
  
  return(cv_metrics)
}

# generate a list of metrics objects per model 
metrics_per_model <- purrr::map(model_list, ~ output_crossvalidation_metrics(data = df_cancer_dx, formula = .x))

# write plotting functions. for DCA and ROC

# function to plot DCA curves
plot_dca <- function(dca_dataframe){
  
  dca_dataframe %>%
    ggplot(aes(x = threshold, y = net_benefit, color = label)) +
    stat_smooth(method = "loess", se = FALSE, formula = "y ~ x", span = 0.2) +
    coord_cartesian(ylim = c(-0.014, 0.14)) +
    scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
    labs(x = "Threshold Probability", y = "Net Benefit", color = "") +
    theme_bw()
  
}

# function to plot ROC curves
plot_roc <- function(roc_obj){
  # Extract the ROC curve coordinates
  roc_coords <- 
    coords(roc_obj, "all", ret = c("threshold", "sensitivity", "specificity"), transpose = FALSE) %>% 
    mutate(
      x = 1 - specificity,
      y = sensitivity
    )
  
  # Plot the ROC curve
  roc_plot <- ggplot(data = roc_coords, aes(x = x, y = y)) +
    geom_line(color = 'blue') +
    geom_abline(linetype = 'dashed') +
    labs(x = '1 - Specificity', y = 'Sensitivity', title = 'ROC Curve') +
    theme_minimal()
  
  # Optionally: Add AUC to the plot
  auc_value <- auc(roc_obj)
  auc_text <- paste0("AUC = ", round(auc_value, 2))
  roc_plot <- roc_plot + annotate("text", x = .2, y = .8, label = auc_text, color = 'red')
  
  return(roc_plot)
}

# generate and store DCA plots
dca_plots <- purrr::map(metrics_per_model, ~ plot_dca(dca_dataframe = .x$dca_results))
# generate and store ROC plots
roc_plots <- purrr::map(metrics_per_model, ~ plot_roc(roc_obj = .x$roc_objects$auc_score))

2 replies

miguelrogr Nov 13, 2023
Author

Hi Shaun

Many thanks for this! However, it looks like this code creates 3 different dca plots. I am also not entirely sure it runs the 10 fold cross validation with 25 repeats as it seems to take so little time to compute the results? Is there a way to combine DCA plots?
Again, many thanks for all your assistance

Best wishes
Miguel

shaunporwal Nov 13, 2023
Maintainer

Hi Miguel,

If you want the 3 models to be in the same DCA plot, you'd have to add the cross validation model prediction columns to the same dataframe with the cancer outcome column (dataframe: df_cancer_dx), and run the dca() function on it. It'll look something like this:

Add the 3 cv prediction columns to the original dataframe with the cancer outcome column

df_cancer_dx_cv <-
  df_cancer_dx %>% 
  dplyr::mutate(
    cv_pred_1 = 
      metrics_per_model[[1]]$pred_df$cv_pred,
    cv_pred_2 = 
      metrics_per_model[[2]]$pred_df$cv_pred,
    cv_pred_3 = 
      metrics_per_model[[3]]$pred_df$cv_pred
)

Run the DCA function. The formula parameter takes in the outcome and the 3 different models for which you want to plot net benefit scores

dca(
  formula = cancer ~ cv_pred_1 + cv_pred_2 + cv_pred_3,
  data = df_cancer_dx_cv
)

If you run the above code after the metrics_per_model object I created above in the post before this one, you should get the 3 cv_pred net benefit curves on the same plot like this:

I believe the main reason the current code for cross validation from the tutorial runs so slowly is because the net benefit calculation has to be done for each fold n times, where n is the number of repeats. Since we're only doing it once at the end, it's drastically faster. We can see that the cross validation is working because you can raise the number of repeats, say from 25 to 500, and it will take much more time to complete.

Shaun

miguelrogr · 2023-11-13T17:02:54Z

miguelrogr
Nov 13, 2023
Author

Thanks so much Shaun. I suppose I can then plot ROC curves easily with the values of cv_pred_1, cv_pred_2 etc.
Does the dcurves package allows us to change "Treat All" and "Treat None" with something else like "Biopsy All" or "Biopsy None"?
You've been incredibly helpful agin, than so much
Miguel

3 replies

shaunporwal Nov 13, 2023
Maintainer

Not a problem.

I also generated the ROC curve x & y values in my first answer within the output_crossvalidation_metrics() function. Alternatively, feel free to use your own method.

Label Change Code

dcurves::dca(
 data = dcurves::df_binary,
 formula = cancer ~ famhistory,
 label = list(all = 'Biopsy All',
              none = 'Biopsy None',
              famhistory = 'Family History'),
 thresholds = seq(0,0.31, 0.01)
)

Output

miguelrogr Nov 13, 2023
Author

Hi Shaun
Just out of curiosity - and to understand your code, to utilise the initial part where you generate ROC curve x&y values, how would you plot all models in a single graph?
Best wishes
Miguel

shaunporwal Nov 21, 2023
Maintainer

Hi Miguel,
Did you end up finding an answer to your question? Constructing ggplot plots is an essential R skill that you can't do without. I recommend reading documentation and going through examples online. If you still can't get it, feel free to reach out to me personally and I'll help out - shaun.porwal@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DCA and ROC curves for multivariable analysis #10

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DCA and ROC curves for multivariable analysis #10

Uh oh!

miguelrogr Nov 9, 2023

create a 10-fold cross validation set

for each cut of the data, build logistic regression on the 90% (analysis set),

and perform DCA on the 10% (assessment set)

pool results from the 10-fold cross validation

plot cross validated net benefit values

Replies: 2 comments · 5 replies

Uh oh!

Uh oh!

shaunporwal Nov 11, 2023 Maintainer

Uh oh!

miguelrogr Nov 13, 2023 Author

Uh oh!

shaunporwal Nov 13, 2023 Maintainer

Add the 3 cv prediction columns to the original dataframe with the cancer outcome column

Run the DCA function. The formula parameter takes in the outcome and the 3 different models for which you want to plot net benefit scores

Uh oh!

miguelrogr Nov 13, 2023 Author

Uh oh!

shaunporwal Nov 13, 2023 Maintainer

Label Change Code

Output

Uh oh!

miguelrogr Nov 13, 2023 Author

Uh oh!

shaunporwal Nov 21, 2023 Maintainer

miguelrogr
Nov 9, 2023

Replies: 2 comments 5 replies

shaunporwal
Nov 11, 2023
Maintainer

miguelrogr Nov 13, 2023
Author

shaunporwal Nov 13, 2023
Maintainer

miguelrogr
Nov 13, 2023
Author

shaunporwal Nov 13, 2023
Maintainer

miguelrogr Nov 13, 2023
Author

shaunporwal Nov 21, 2023
Maintainer