Skip to content

Evaluation, Reproducibility, Benchmarks Meeting 39

Nicholas Heller edited this page Nov 26, 2025 · 1 revision

Minutes of Meeting 39

Date: 26th November, 2025

Present

  • Rucha
  • Annika
  • Olivier
  • Carole
  • Nick
  • Michela

Update From CI Project

  • Olivier anticipates that paper will be ready for feedback/submission by next meeting
  • Next steps
    • Something community-oriented to define guidelines?
    • Link to MONAI implementation
    • Ideally sometime ~January (Next meeting will be January since Dec. meeting falls on Xmas eve)
  • In the implementation, perhaps allowing users to choose whichever approach they prefer but raising warnings when it the choice might not be the optimal one
  • Olivier to create a brainstorming document and share it with the group (a few days before the next meeting) so that we don't start from scratch for our January meeting

Update From Data Licensing Project

  • Hoping to submit paper before the end of this year (might not be realistic)

Update From Updated BIAS Guidelines Project

  • Survey results came in but haven't had time yet to dig into them -- plan to talk about this in detail in Feb

Update From Benchmark Dataset Project

  • Goal is to come up with a sort of "identity card" for each benchmark dataset that is out there with relevant/useful information
  • Michela presented some progress on organizing datasets and certain metadata from each one
  • Could potentially assign a quality score to each dataset within a collection that could be endorsed or even hosted by a MONAI mirror
  • Aggregate datasets for broad benchmarking are becoming much more common now
  • Michela to create and share a word document with progress so far and Carole will add some of the references that she has found recently that are relevant to this topic
  • LLMs could be useful to extract this information automatically on a large scale
  • Could follow-up on this in January
  • Croissant format could be an interesting thing to look at (from Google, a metadata format for ML-ready datasets)
  • For each item, it would be nice to include some sort of justification/notes for why certain characteristics are important
    • E.g., if you want to look at label uncertainty, then you need to look at multiple annotations per case, etc.
  • Upstream of the above, it would be good to define the use cases that the criteria are meant to support
  • Generative tasks might be a good idea to include here as well. There's a dearth of benchmark datasets in this space, but it's an important emerging area
  • Would it be possible to try to probe dataset and case "representativeness" such as something like an outlier score applied to the segmentation mask/object shape
    • Relevant because datasets often have artificially boosted prevalence of certain features/classes for practicality purposes
    • Could be much more general that just the shape, can look at demographics etc.

Agenda for Future Meetings

  • Brainstorm next steps for CI project -- lead by Olivier (January)
  • Talk about BIAS survey results -- lead by Annika (February)
  • Talk more about quantifying representativeness either in Jan or Feb

Clone this wiki locally