Skip to content

Evaluation, Reproducibility, Benchmarks Meeting 40

Nicholas Heller edited this page Jan 28, 2026 · 1 revision

Minutes of Meeting 40

Date: 28th January, 2026

Present

  • Annika
  • Olivier
  • Carole
  • Nick

Next Meeting

  • Carole has a conflict, maybe we should move to March 4th? -done
  • Nick will have many conflicts moving forward. When more people are present, we should choose a new member for the secretary role

Update From CI Project

  • Paper is submitted!
  • Carole had some comments. If it goes to revisions, those can be integrated

Brainstorming Session re. CI Project Implementation

  • Thinking is that it could be integrated into metrics reloaded
    • This is also a good stage for warnings to be presented to the user
    • Metrics reloaded does not depend on SciKit Learn, but SciPy and (we think) Statsmodels is already imported
  • In contrast with metrics reloaded, the CI project doesn't really give recommendations (yet)
    • Maybe we think of it more as just elucidating "risks" of poor statistical power etc. for the user
    • Can also have task-specific "defaults" that are not exactly recommendations, but would nudge the user in that direction
  • One of the main findings is that with very low sample sizes, parametric methods work a bit better. This could be an example of a branching workflow in the implementation
    • A warning/default here
  • Also check for micro- vs. macro-averaging
    • A subtle difference but often overlooked
  • There is a known issue with approximating average precision in finite/small samples
    • Re. approximating an integral of a function that is not monotonic
    • Could be a warning here, but no great solution
  • With small sample sizes, very narrow distributions by chance can violate I.I.D. assumption and lead to erroneously narrow CIs
  • It is OK if we have to tell the user that computing the CI based on the data they provide is not possible -- or at least not possible to do well
  • Want to make sure we decouple metrics reloaded from CI tool -- can use both, but each can also be used independently
  • Specifics
    • For segmentation -- just CIs over the mean and median -- CIs over differences can be implemented later (complication is that it currently accepts just 1 csv per model. Would need to cross-match these and handle asymmetric missing data, column names etc., also what to do when we have so many pairwise differences and may need to handle multiplicity etc.)
  • Strategy
    • If Olivier can share the code, then Carole can integrate these things into Metrics Reloaded
      • The paper has associated code that's public, but documentation is somewhat lacking
      • Most of the code was for simulation studies, which obviously will not be included
      • Better to use the SciPy implementation rather than the custom implementation (both are tested/good, but no reason to reinvent the wheel)
  • After this is implemented, it could be a feature of the Metrics Reloaded software paper
    • The way we write this depends heavily on which journal we target
    • JMLR is a well-respected journal that has a software section (4 pages only?)
    • We did some brainstorming for which journal here back in July of 2024
    • Who are we trying to reach with this paper? Just medical imaging community, or machine learning community more generally?
      • Could maybe sent to TMI or MedIA and see what happens? No software track, but they are familiar with the tooling
        • If they don't want software papers, then it would be fast (probably)

Clone this wiki locally