-
-
Notifications
You must be signed in to change notification settings - Fork 9
Add order statistic diagnostic for compare()
#237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| if candidate_sd == 0 or not np.isfinite(candidate_sd): | ||
| warnings.warn( | ||
| "All models have nearly identical performance.", | ||
| UserWarning, | ||
| ) | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we want to add more to this warning? This is mostly to avoid the Scipy runtime error that will happen if sd=0. Probably not very realistic in practice, but theoretically possible.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #237 +/- ##
==========================================
+ Coverage 84.72% 84.78% +0.05%
==========================================
Files 41 41
Lines 4950 4974 +24
==========================================
+ Hits 4194 4217 +23
- Misses 756 757 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Documentation build overview
Show files changed (12 files in total): 📝 12 modified | ➕ 0 added | ➖ 0 deleted
|
This implements the order statistic diagnostic for detecting selection induced bias when comparing many models in
compare(). When more than 11 models are compared using full PSIS-LOO-CV, the diagnostic estimates whether the observed performance difference could be due to chance by comparing the best model's ELPD difference against the expected maximum order statistic under a null hypothesis of equal performance.This diagnostic is not done for the subsampling case. As far as I can tell, it isn't done here either https://github.com/stan-dev/loo/blob/d6fe380161fcd3ba07065ce0a525146abdb2c1d7/R/loo_compare.psis_loo_ss_list.R. I'm not exactly sure why this is, but my guess is that the the theoretical assumptions underlying the test don't account for the additional variance and approximation bias introduced by subsampling, making it unclear whether the null distribution would be properly calibrated in that setting.
Resolves #234