Skip to content

Conversation

hfrick
Copy link
Member

@hfrick hfrick commented Oct 13, 2025

I've written about how we make the automatic calibration splits. The article covers the guiding principles of our approach, both for how and why (although not in academic paper depth).

The goal is also to let people understand in detail what happens for the sliding resamples. It has gotten relatively lengthy, though. I'm wondering if it should stay here or, e.g., go into a separate article, similar to how we split out the details on how we deal with censoring for the dynamic survival metrics. I think there's value in working through those details somewhere (other than the source code directly), but we could also experiment with collapsible text. Do you have any preferences or suggestions? Or do you think the length is fine as it is?

Copy link
Member

@EmilHvitfeldt EmilHvitfeldt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is very valuable and high quality work.

I think this would be a good place for this content. It would also fit in rsample pkgdown for that matter.

there are small comments and nit picks but the overall structure and style i find very nice.

I'm also fine with the length. It is a complicated topic and without the prose and diagrams it would be hard to understand

description: |
Learn how tidymodels generates automatic calibration sets for post-processing.
toc: true
toc-depth: 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
toc-depth: 2
toc-depth: 3

due to the length of the article i think upping the TOC depth is nice, otherwise we don't update the toc that fast


While preprocessing is the transformation of the predictors prior to a model fit, post-processing is the transformation of the predictions after the model fit. This could be as straightforward as limiting predictions to a certain range of values to as complicated as transforming them based on a separate calibration model.

A calibration model is used to model the relationship between the predictions based on the primary model and the true outcomes. An additional model means an additional chance to accidentially overfit. So when working with calibration, this is crucial: we cannot use the same data to fit our calibration model as we use to assess the combination of primary and calibration model. Using the same data to fit the primary model and the calibration model means the predictions used to fit the calibration model are re-predictions of the same observations used to fit the primary model. Hence they are closer to the true values than predictions on new data would be and the calibration model doesn't have accurate information to estimate the right trends (so that they then can be removed).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A calibration model is used to model the relationship between the predictions based on the primary model and the true outcomes. An additional model means an additional chance to accidentially overfit. So when working with calibration, this is crucial: we cannot use the same data to fit our calibration model as we use to assess the combination of primary and calibration model. Using the same data to fit the primary model and the calibration model means the predictions used to fit the calibration model are re-predictions of the same observations used to fit the primary model. Hence they are closer to the true values than predictions on new data would be and the calibration model doesn't have accurate information to estimate the right trends (so that they then can be removed).
A calibration model is used to model the relationship between the predictions based on the primary model and the true outcomes. An additional model means an additional chance to Accidentally overfit. So when working with calibration, this is crucial: we cannot use the same data to fit our calibration model as we use to assess the combination of primary and calibration model. Using the same data to fit the primary model and the calibration model means the predictions used to fit the calibration model are re-predictions of the same observations used to fit the primary model. Hence they are closer to the true values than predictions on new data would be and the calibration model doesn't have accurate information to estimate the right trends (so that they then can be removed).

```

While preprocessing is the transformation of the predictors prior to a model fit, post-processing is the transformation of the predictions after the model fit. This could be as straightforward as limiting predictions to a certain range of values to as complicated as transforming them based on a separate calibration model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below we are using the term primary model which we just started using. I like the term but i think it be nice to properly define it, in terms of pre/model/post diagram/terminology


A calibration model is used to model the relationship between the predictions based on the primary model and the true outcomes. An additional model means an additional chance to accidentially overfit. So when working with calibration, this is crucial: we cannot use the same data to fit our calibration model as we use to assess the combination of primary and calibration model. Using the same data to fit the primary model and the calibration model means the predictions used to fit the calibration model are re-predictions of the same observations used to fit the primary model. Hence they are closer to the true values than predictions on new data would be and the calibration model doesn't have accurate information to estimate the right trends (so that they then can be removed).

rsample provides a collection of functions to make resamples for empirical validation of prediction models. So far, the assumption was that the prediction model is the only model that needs fitting, i.e., a resample consists of an analysis set and an assessment set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a rsample doc page for analysis set and assessment set?


rsample provides a collection of functions to make resamples for empirical validation of prediction models. So far, the assumption was that the prediction model is the only model that needs fitting, i.e., a resample consists of an analysis set and an assessment set.

If we include calibration into our workflow (bundeling preprocessing, (primary) model, and post-processing), we want an analysis set, a calibration set, and an assessment set.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If we include calibration into our workflow (bundeling preprocessing, (primary) model, and post-processing), we want an analysis set, a calibration set, and an assessment set.
If we include calibration into our workflow (bundling preprocessing, (primary) model, and post-processing), we want an analysis set, a calibration set, and an assessment set.


Let's start with the row-based splitting done by `sliding_window()`. We'll use a very small example dataset. This will make it easier to illustrate how the different subsets of the data are created but note that it is too small for real-world purposes. Let's use a data frame with 11 rows and say we want to use 5 for the analysis set, 3 for the assessment set, and leave a gap of 2 in between those two sets. We can make two such resamples from our data frame.

![](images/calibration-split-window.jpg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we do 4 for assessment set and a gap of 1?

right now there is very little air inside the assessment set with regards to the text

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm realizing that this would be a huge undertaking


![](images/calibration-split-index.jpg)

We still get two resamples, however, the analysis set contains only 4 rows because only those fall into the window defined by the index.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be beneficial to add the missing value at 1 or 5 such that the analysis sets have different length?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or i guess, that isn't interesting at all. because then we do the same as the previous section

analysis(r_split)
```

The sliding splits slide over _the data_, meaning they slide over observed values of the index and they slide only within the boundaries of the observed index values. So here, we can only slide within [3, 6] and thus cannot fit an inner analysis set of three and a calibration set of two into it. As established earlier, we fall back onto an empty calibration set in such a situation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we mention a couple of times that we fall back. should we mention that we fall back with a warning?


Calculating the length of the calibration set rather than the gap, together with rounding up when translating proportions to new lengths within the outer analysis set means that we prioritize allocating observations to the (inner) analysis and calibration set over allocating them to the gap. In this example here, this means that we are not leaving a gap between the analysis set and the calibration set.

However, rounding up for both (inner) analysis and calibration set when we don't have a gap could mean we end up allocating more observations than we actually have. So in that case, we try to take from the calibration set if possible and thus prioritzing fitting the prediction model over the calibration model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
However, rounding up for both (inner) analysis and calibration set when we don't have a gap could mean we end up allocating more observations than we actually have. So in that case, we try to take from the calibration set if possible and thus prioritzing fitting the prediction model over the calibration model.
However, rounding up for both (inner) analysis and calibration set when we don't have a gap could mean we end up allocating more observations than we actually have. So in that case, we try to take from the calibration set if possible and thus prioritizing fitting the prediction model over the calibration model.


![](images/calibration-split-period.jpg)

The principle of how to contruct a calibration split on the (outer) analysis set remains the same. The challenges of abstracting away from the rows, as illustrated for sliding over observed instances of an index also remain. Here, we slide over observed periods. We observe a period, if we observe an index within that period.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The principle of how to contruct a calibration split on the (outer) analysis set remains the same. The challenges of abstracting away from the rows, as illustrated for sliding over observed instances of an index also remain. Here, we slide over observed periods. We observe a period, if we observe an index within that period.
The principle of how to construct a calibration split on the (outer) analysis set remains the same. The challenges of abstracting away from the rows, as illustrated for sliding over observed instances of an index also remain. Here, we slide over observed periods. We observe a period, if we observe an index within that period.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants