Skip to content

Research: Evaluate existing benchmark datasets for quality and coverage gaps #130

@janpower

Description

@janpower

Problem

Current benchmark datasets in tests/data/benchmarks/ have quality issues that limit their effectiveness for evaluation:

Known Gaps

  1. Non-existent HPO terms - Annotations reference HPO IDs that don't exist in the ontology
  2. Unrealistic language - Text is too eloquent/high-quality/carefully written compared to real clinical notes
  3. Dataset-specific issues - English phenobert dataset has these problems

Goal

Systematically evaluate existing benchmark datasets and document:

  • Coverage gaps (missing languages, phenotypes, clinical scenarios)
  • Quality issues (invalid HPO IDs, unrealistic text, annotation errors)
  • Recommendations for dataset improvements or replacements

Datasets to Evaluate

  • tests/data/benchmarks/en/phenobert/ - Known issues with text quality and annotations
  • Current German datasets (tiny_v1.json, 70cases_gemini_v1.json, 200cases_gemini_v1.json)
  • Identify gaps for other languages (ES, FR, NL)

Related Issues

Deliverables

  1. Quality audit report for each dataset
  2. List of invalid HPO term references
  3. Assessment of text realism vs. real clinical notes
  4. Recommendations for dataset curation or replacement

Labels

research, benchmarking, data-quality

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions