Skip to content

Conversation

@jordandeklerk
Copy link
Member

@jordandeklerk jordandeklerk commented Oct 30, 2025

This implements the order statistic diagnostic for detecting selection induced bias when comparing many models in compare(). When more than 11 models are compared using full PSIS-LOO-CV, the diagnostic estimates whether the observed performance difference could be due to chance by comparing the best model's ELPD difference against the expected maximum order statistic under a null hypothesis of equal performance.

This diagnostic is not done for the subsampling case. As far as I can tell, it isn't done here either https://github.com/stan-dev/loo/blob/d6fe380161fcd3ba07065ce0a525146abdb2c1d7/R/loo_compare.psis_loo_ss_list.R. I'm not exactly sure why this is, but my guess is that the the theoretical assumptions underlying the test don't account for the additional variance and approximation bias introduced by subsampling, making it unclear whether the null distribution would be properly calibrated in that setting.


Resolves #234

Comment on lines +584 to +589
if candidate_sd == 0 or not np.isfinite(candidate_sd):
warnings.warn(
"All models have nearly identical performance.",
UserWarning,
)
return
Copy link
Member Author

@jordandeklerk jordandeklerk Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we want to add more to this warning? This is mostly to avoid the Scipy runtime error that will happen if sd=0. Probably not very realistic in practice, but theoretically possible.

@codecov-commenter
Copy link

codecov-commenter commented Oct 30, 2025

Codecov Report

❌ Patch coverage is 96.66667% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 84.78%. Comparing base (d6c4f58) to head (7c9560e).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/arviz_stats/loo/compare.py 96.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #237      +/-   ##
==========================================
+ Coverage   84.72%   84.78%   +0.05%     
==========================================
  Files          41       41              
  Lines        4950     4974      +24     
==========================================
+ Hits         4194     4217      +23     
- Misses        756      757       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@read-the-docs-community
Copy link

read-the-docs-community bot commented Oct 30, 2025

Documentation build overview

📚 arviz-stats | 🛠️ Build #30151928 | 📁 Comparing 7c9560e against latest (85cca24)


🔍 Preview build

Show files changed (12 files in total): 📝 12 modified | ➕ 0 added | ➖ 0 deleted
File Status
_modules/arviz_stats/sampling_diagnostics.html 📝 modified
api/generated/arviz_stats.compare.html 📝 modified
api/generated/arviz_stats.eti.html 📝 modified
api/generated/arviz_stats.hdi.html 📝 modified
api/generated/arviz_stats.histogram.html 📝 modified
api/generated/arviz_stats.kde.html 📝 modified
api/generated/arviz_stats.loo_kfold.html 📝 modified
api/generated/arviz_stats.mode.html 📝 modified
api/generated/arviz_stats.qds.html 📝 modified
api/generated/arviz_stats.rhat.html 📝 modified
_modules/arviz_stats/loo/compare.html 📝 modified
_modules/arviz_stats/loo/loo_moment_match.html 📝 modified

@jordandeklerk jordandeklerk marked this pull request as ready for review October 30, 2025 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add order statistic check for compare()

3 participants