Research: Evaluate existing benchmark datasets for quality and coverage gaps

## Problem

Current benchmark datasets in `tests/data/benchmarks/` have quality issues that limit their effectiveness for evaluation:

### Known Gaps
1. **Non-existent HPO terms** - Annotations reference HPO IDs that don't exist in the ontology
2. **Unrealistic language** - Text is too eloquent/high-quality/carefully written compared to real clinical notes
3. **Dataset-specific issues** - English phenobert dataset has these problems

## Goal

Systematically evaluate existing benchmark datasets and document:
- Coverage gaps (missing languages, phenotypes, clinical scenarios)
- Quality issues (invalid HPO IDs, unrealistic text, annotation errors)
- Recommendations for dataset improvements or replacements

## Datasets to Evaluate

- `tests/data/benchmarks/en/phenobert/` - Known issues with text quality and annotations
- Current German datasets (`tiny_v1.json`, `70cases_gemini_v1.json`, `200cases_gemini_v1.json`)
- Identify gaps for other languages (ES, FR, NL)

## Related Issues

- #17 - Full clinical text HPO extraction benchmarking
- #25 - Text chunking strategy benchmarking

## Deliverables

1. Quality audit report for each dataset
2. List of invalid HPO term references
3. Assessment of text realism vs. real clinical notes
4. Recommendations for dataset curation or replacement

## Labels

research, benchmarking, data-quality

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research: Evaluate existing benchmark datasets for quality and coverage gaps #130

Problem

Known Gaps

Goal

Datasets to Evaluate

Related Issues

Deliverables

Labels

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Research: Evaluate existing benchmark datasets for quality and coverage gaps #130

Description

Problem

Known Gaps

Goal

Datasets to Evaluate

Related Issues

Deliverables

Labels

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions