-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Evaluation, Reproducibility, Benchmarks Meeting 39
Nicholas Heller edited this page Nov 26, 2025
·
1 revision
Date: 26th November, 2025
- Rucha
- Annika
- Olivier
- Carole
- Nick
- Michela
- Olivier anticipates that paper will be ready for feedback/submission by next meeting
- Next steps
- Something community-oriented to define guidelines?
- Link to MONAI implementation
- Ideally sometime ~January (Next meeting will be January since Dec. meeting falls on Xmas eve)
- In the implementation, perhaps allowing users to choose whichever approach they prefer but raising warnings when it the choice might not be the optimal one
- Olivier to create a brainstorming document and share it with the group (a few days before the next meeting) so that we don't start from scratch for our January meeting
- Hoping to submit paper before the end of this year (might not be realistic)
- Survey results came in but haven't had time yet to dig into them -- plan to talk about this in detail in Feb
- Goal is to come up with a sort of "identity card" for each benchmark dataset that is out there with relevant/useful information
- Michela presented some progress on organizing datasets and certain metadata from each one
- Could potentially assign a quality score to each dataset within a collection that could be endorsed or even hosted by a MONAI mirror
- Aggregate datasets for broad benchmarking are becoming much more common now
- Michela to create and share a word document with progress so far and Carole will add some of the references that she has found recently that are relevant to this topic
- LLMs could be useful to extract this information automatically on a large scale
- Could follow-up on this in January
- Croissant format could be an interesting thing to look at (from Google, a metadata format for ML-ready datasets)
- For each item, it would be nice to include some sort of justification/notes for why certain characteristics are important
- E.g., if you want to look at label uncertainty, then you need to look at multiple annotations per case, etc.
- Upstream of the above, it would be good to define the use cases that the criteria are meant to support
- Generative tasks might be a good idea to include here as well. There's a dearth of benchmark datasets in this space, but it's an important emerging area
- Would it be possible to try to probe dataset and case "representativeness" such as something like an outlier score applied to the segmentation mask/object shape
- Relevant because datasets often have artificially boosted prevalence of certain features/classes for practicality purposes
- Could be much more general that just the shape, can look at demographics etc.
- Brainstorm next steps for CI project -- lead by Olivier (January)
- Talk about BIAS survey results -- lead by Annika (February)
- Talk more about quantifying representativeness either in Jan or Feb