-
Notifications
You must be signed in to change notification settings - Fork 68
Description
I often find myself in data-modeling situations where the existing functions in rsample for setting up a proper assessment/analysis or test/train do not suffice.
Example: A multivariate regression problem, where numeric predictor data distributions are very frequent and centered around a region and only fewer observation are more distant, while the intention is to learn on all data especially effects when moving outside those frequent centered regions.
The risk of just learning the effect in the center by sampling sampling randomly test/train or assessment/analsis or even with some univariate stratification is high.Also the risk of getting inconsistent model performance results is higher.
I suggest to add functionality to rsample which has extended capability for sampling for these cases:
They ensure maximum coverage of data space for both test/train, resp. Assessment/analysis.
The problem is adressed by calibration sampling methods:
Have a look here for some:
https://cran.r-project.org/web/packages/prospectr/vignettes/prospectr.html#duplex-duplex
Literature:
- Snee, R. D. (1977). Validation of regression models: methods and examples. Technometrics, 19(4), 415-428.
- https://delwende.github.io/thesis-final.pdf#page12
- https://www.scielo.br/j/cr/a/9SCp8CFXPRVtWgZHCxnSGwj/?format=html&lang=en