-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
Problem
Current benchmark datasets in tests/data/benchmarks/ have quality issues that limit their effectiveness for evaluation:
Known Gaps
- Non-existent HPO terms - Annotations reference HPO IDs that don't exist in the ontology
- Unrealistic language - Text is too eloquent/high-quality/carefully written compared to real clinical notes
- Dataset-specific issues - English phenobert dataset has these problems
Goal
Systematically evaluate existing benchmark datasets and document:
- Coverage gaps (missing languages, phenotypes, clinical scenarios)
- Quality issues (invalid HPO IDs, unrealistic text, annotation errors)
- Recommendations for dataset improvements or replacements
Datasets to Evaluate
tests/data/benchmarks/en/phenobert/- Known issues with text quality and annotations- Current German datasets (
tiny_v1.json,70cases_gemini_v1.json,200cases_gemini_v1.json) - Identify gaps for other languages (ES, FR, NL)
Related Issues
- feat(benchmark): Design and implement benchmark for full clinical text HPO extraction #17 - Full clinical text HPO extraction benchmarking
- feat(evaluation): Benchmark impact of different text chunking strategies #25 - Text chunking strategy benchmarking
Deliverables
- Quality audit report for each dataset
- List of invalid HPO term references
- Assessment of text realism vs. real clinical notes
- Recommendations for dataset curation or replacement
Labels
research, benchmarking, data-quality
Reactions are currently unavailable