-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Evaluation, Reproducibility, Benchmarks Meeting 40
Nicholas Heller edited this page Jan 28, 2026
·
1 revision
Date: 28th January, 2026
- Annika
- Olivier
- Carole
- Nick
- Carole has a conflict, maybe we should move to March 4th? -done
- Nick will have many conflicts moving forward. When more people are present, we should choose a new member for the secretary role
- Paper is submitted!
- Carole had some comments. If it goes to revisions, those can be integrated
- Thinking is that it could be integrated into metrics reloaded
- This is also a good stage for warnings to be presented to the user
- Metrics reloaded does not depend on SciKit Learn, but SciPy and (we think) Statsmodels is already imported
- In contrast with metrics reloaded, the CI project doesn't really give recommendations (yet)
- Maybe we think of it more as just elucidating "risks" of poor statistical power etc. for the user
- Can also have task-specific "defaults" that are not exactly recommendations, but would nudge the user in that direction
- One of the main findings is that with very low sample sizes, parametric methods work a bit better. This could be an example of a branching workflow in the implementation
- A warning/default here
- Also check for micro- vs. macro-averaging
- A subtle difference but often overlooked
- There is a known issue with approximating average precision in finite/small samples
- Re. approximating an integral of a function that is not monotonic
- Could be a warning here, but no great solution
- With small sample sizes, very narrow distributions by chance can violate I.I.D. assumption and lead to erroneously narrow CIs
- It is OK if we have to tell the user that computing the CI based on the data they provide is not possible -- or at least not possible to do well
- Want to make sure we decouple metrics reloaded from CI tool -- can use both, but each can also be used independently
- Specifics
- For segmentation -- just CIs over the mean and median -- CIs over differences can be implemented later (complication is that it currently accepts just 1 csv per model. Would need to cross-match these and handle asymmetric missing data, column names etc., also what to do when we have so many pairwise differences and may need to handle multiplicity etc.)
- Strategy
- If Olivier can share the code, then Carole can integrate these things into Metrics Reloaded
- The paper has associated code that's public, but documentation is somewhat lacking
- Most of the code was for simulation studies, which obviously will not be included
- Better to use the SciPy implementation rather than the custom implementation (both are tested/good, but no reason to reinvent the wheel)
- If Olivier can share the code, then Carole can integrate these things into Metrics Reloaded
- After this is implemented, it could be a feature of the Metrics Reloaded software paper
- The way we write this depends heavily on which journal we target
- JMLR is a well-respected journal that has a software section (4 pages only?)
- We did some brainstorming for which journal here back in July of 2024
- Who are we trying to reach with this paper? Just medical imaging community, or machine learning community more generally?
- Could maybe sent to TMI or MedIA and see what happens? No software track, but they are familiar with the tooling
- If they don't want software papers, then it would be fast (probably)
- Could maybe sent to TMI or MedIA and see what happens? No software track, but they are familiar with the tooling