Add order statistic diagnostic for `compare()` #237

jordandeklerk · 2025-10-30T19:38:22Z

This implements the order statistic diagnostic for detecting selection induced bias when comparing many models in compare(). When more than 11 models are compared using full PSIS-LOO-CV, the diagnostic estimates whether the observed performance difference could be due to chance by comparing the best model's ELPD difference against the expected maximum order statistic under a null hypothesis of equal performance.

This diagnostic is not done for the subsampling case. As far as I can tell, it isn't done here either https://github.com/stan-dev/loo/blob/d6fe380161fcd3ba07065ce0a525146abdb2c1d7/R/loo_compare.psis_loo_ss_list.R. I'm not exactly sure why this is, but my guess is that the the theoretical assumptions underlying the test don't account for the additional variance and approximation bias introduced by subsampling, making it unclear whether the null distribution would be properly calibrated in that setting.

Resolves #234

src/arviz_stats/loo/compare.py

codecov-commenter · 2025-10-30T19:39:56Z

Codecov Report

❌ Patch coverage is 96.55172% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 84.77%. Comparing base (d6c4f58) to head (685157b).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
src/arviz_stats/loo/compare.py	96.55%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #237      +/-   ##
==========================================
+ Coverage   84.72%   84.77%   +0.05%     
==========================================
  Files          41       41              
  Lines        4950     4973      +23     
==========================================
+ Hits         4194     4216      +22     
- Misses        756      757       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

read-the-docs-community · 2025-10-30T19:40:42Z

Documentation build overview

📚 arviz-stats | 🛠️ Build #30195884 | 📁 Comparing 685157b against latest (d957f57)

🔍 Preview build

Show files changed (16 files in total): 📝 16 modified | ➕ 0 added | ➖ 0 deleted

File	Status
_modules/arviz_stats/sampling_diagnostics.html	📝 modified
api/generated/arviz_stats.bayes_factor.html	📝 modified
api/generated/arviz_stats.compare.html	📝 modified
api/generated/arviz_stats.eti.html	📝 modified
api/generated/arviz_stats.hdi.html	📝 modified
api/generated/arviz_stats.histogram.html	📝 modified
api/generated/arviz_stats.kde.html	📝 modified
api/generated/arviz_stats.loo_kfold.html	📝 modified
api/generated/arviz_stats.loo_pit.html	📝 modified
api/generated/arviz_stats.loo_score.html	📝 modified
api/generated/arviz_stats.mode.html	📝 modified
api/generated/arviz_stats.qds.html	📝 modified
api/generated/arviz_stats.rhat.html	📝 modified
api/generated/arviz_stats.rhat_nested.html	📝 modified
_modules/arviz_stats/loo/compare.html	📝 modified
_modules/arviz_stats/loo/loo_moment_match.html	📝 modified

src/arviz_stats/loo/compare.py

tests/loo/test_compare.py

src/arviz_stats/loo/compare.py

jordandeklerk · 2025-11-04T14:34:10Z

Removed a test that is essentially testing the same thing as the first order stat test. With the centered_eight data, we only ghave 8 observations so it's a little challenging to compare more than 11 models and not raise the warning from the diagnostic because the ELPDs are all quite close typically.

feat: add order-statistic check for model comparison

ea9f2c0

jordandeklerk commented Oct 30, 2025

View reviewed changes

src/arviz_stats/loo/compare.py Outdated Show resolved Hide resolved

jordandeklerk commented Oct 30, 2025

View reviewed changes

src/arviz_stats/loo/compare.py Show resolved Hide resolved

jordandeklerk marked this pull request as ready for review October 30, 2025 19:48

jordandeklerk requested a review from aloctavodia October 30, 2025 19:48

aloctavodia reviewed Oct 31, 2025

View reviewed changes

src/arviz_stats/loo/compare.py Show resolved Hide resolved

docs: make diagnostic check less technical

7c9560e

aloctavodia approved these changes Nov 4, 2025

View reviewed changes

refactor: improve code quality and fix tests

685157b

jordandeklerk mentioned this pull request Nov 4, 2025

Add a section for selection-induced bias in model comparison arviz-devs/EABM#182

Open

aloctavodia approved these changes Nov 4, 2025

View reviewed changes

jordandeklerk merged commit a7d8868 into main Nov 4, 2025
11 checks passed

jordandeklerk deleted the stat-check branch November 4, 2025 19:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add order statistic diagnostic for `compare()` #237

Add order statistic diagnostic for `compare()` #237

Uh oh!

jordandeklerk commented Oct 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Oct 30, 2025 •

edited

Loading

Uh oh!

read-the-docs-community bot commented Oct 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jordandeklerk commented Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Add order statistic diagnostic for compare() #237

Add order statistic diagnostic for compare() #237

Uh oh!

Conversation

jordandeklerk commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

read-the-docs-community bot commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jordandeklerk commented Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add order statistic diagnostic for `compare()` #237

Add order statistic diagnostic for `compare()` #237

jordandeklerk commented Oct 30, 2025 •

edited

Loading

codecov-commenter commented Oct 30, 2025 •

edited

Loading

read-the-docs-community bot commented Oct 30, 2025 •

edited

Loading