Evaluation, Reproducibility, Benchmarks Meeting 39

Minutes of Meeting 39

Date: 26th November, 2025

Present

Rucha
Annika
Olivier
Carole
Nick
Michela

Update From CI Project

Olivier anticipates that paper will be ready for feedback/submission by next meeting
Next steps
- Something community-oriented to define guidelines?
- Link to MONAI implementation
- Ideally sometime ~January (Next meeting will be January since Dec. meeting falls on Xmas eve)
In the implementation, perhaps allowing users to choose whichever approach they prefer but raising warnings when it the choice might not be the optimal one
Olivier to create a brainstorming document and share it with the group (a few days before the next meeting) so that we don't start from scratch for our January meeting

Update From Data Licensing Project

Hoping to submit paper before the end of this year (might not be realistic)

Update From Updated BIAS Guidelines Project

Survey results came in but haven't had time yet to dig into them -- plan to talk about this in detail in Feb

Update From Benchmark Dataset Project

Goal is to come up with a sort of "identity card" for each benchmark dataset that is out there with relevant/useful information
Michela presented some progress on organizing datasets and certain metadata from each one
Could potentially assign a quality score to each dataset within a collection that could be endorsed or even hosted by a MONAI mirror
Aggregate datasets for broad benchmarking are becoming much more common now
Michela to create and share a word document with progress so far and Carole will add some of the references that she has found recently that are relevant to this topic
LLMs could be useful to extract this information automatically on a large scale
Could follow-up on this in January
Croissant format could be an interesting thing to look at (from Google, a metadata format for ML-ready datasets)
For each item, it would be nice to include some sort of justification/notes for why certain characteristics are important
- E.g., if you want to look at label uncertainty, then you need to look at multiple annotations per case, etc.
Upstream of the above, it would be good to define the use cases that the criteria are meant to support
Generative tasks might be a good idea to include here as well. There's a dearth of benchmark datasets in this space, but it's an important emerging area
Would it be possible to try to probe dataset and case "representativeness" such as something like an outlier score applied to the segmentation mask/object shape
- Relevant because datasets often have artificially boosted prevalence of certain features/classes for practicality purposes
- Could be much more general that just the shape, can look at demographics etc.

Agenda for Future Meetings

Brainstorm next steps for CI project -- lead by Olivier (January)
Talk about BIAS survey results -- lead by Annika (February)
Talk more about quantifying representativeness either in Jan or Feb

Copyright (c) MONAI Consortium

Evaluation, Reproducibility, Benchmarks Meeting 39

Minutes of Meeting 39

Present

Update From CI Project

Update From Data Licensing Project

Update From Updated BIAS Guidelines Project

Update From Benchmark Dataset Project

Agenda for Future Meetings

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!