Skip to content

Conversation

@jordandeklerk
Copy link
Member

@jordandeklerk jordandeklerk commented Oct 30, 2025

This implements the order statistic diagnostic for detecting selection induced bias when comparing many models in compare(). When more than 11 models are compared using full PSIS-LOO-CV, the diagnostic estimates whether the observed performance difference could be due to chance by comparing the best model's ELPD difference against the expected maximum order statistic under a null hypothesis of equal performance.

This diagnostic is not done for the subsampling case. As far as I can tell, it isn't done here either https://github.com/stan-dev/loo/blob/d6fe380161fcd3ba07065ce0a525146abdb2c1d7/R/loo_compare.psis_loo_ss_list.R. I'm not exactly sure why this is, but my guess is that the the theoretical assumptions underlying the test don't account for the additional variance and approximation bias introduced by subsampling, making it unclear whether the null distribution would be properly calibrated in that setting.


Resolves #234

@codecov-commenter
Copy link

codecov-commenter commented Oct 30, 2025

Codecov Report

❌ Patch coverage is 96.55172% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 84.77%. Comparing base (d6c4f58) to head (685157b).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/arviz_stats/loo/compare.py 96.55% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #237      +/-   ##
==========================================
+ Coverage   84.72%   84.77%   +0.05%     
==========================================
  Files          41       41              
  Lines        4950     4973      +23     
==========================================
+ Hits         4194     4216      +22     
- Misses        756      757       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@read-the-docs-community
Copy link

read-the-docs-community bot commented Oct 30, 2025

@jordandeklerk jordandeklerk marked this pull request as ready for review October 30, 2025 19:48
@jordandeklerk
Copy link
Member Author

Removed a test that is essentially testing the same thing as the first order stat test. With the centered_eight data, we only ghave 8 observations so it's a little challenging to compare more than 11 models and not raise the warning from the diagnostic because the ELPDs are all quite close typically.

@jordandeklerk jordandeklerk merged commit a7d8868 into main Nov 4, 2025
11 checks passed
@jordandeklerk jordandeklerk deleted the stat-check branch November 4, 2025 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add order statistic check for compare()

3 participants