A toolkit for building an extended synonym dictionary for the DrivenData SNOMED CT Entity Linking challenge. It consolidates multiple terminology sources into a single lookup table and trains a section-aware, precision-filtered dictionary following the approach of the 1st-place KIRIs solution from the original challenge.
This project builds directly on the winning solution by Team KIRIs (Guy Amit, Yonatan Bilu, Irena Girshovitz & Chen Yanover):
1st Place -- SNOMED CT Entity Linking Challenge https://github.com/drivendataorg/snomed-ct-entity-linking/tree/main/1st%20Place Licensed under the MIT License.
The KIRIs approach uses dictionary matching rather than ML models: it maps (section header, mention) pairs to SNOMED CT concept IDs, builds two dictionaries (case-sensitive and case-insensitive) from training data and SNOMED synonyms, resolves overlaps by preferring longer and section-specific matches, and applies post-processing with SNOMED CT relational data. Their solution achieved ~0.62 macro character-level IoU on the original challenge split.
This repository extends the KIRIs dictionary with additional synonym sources and re-implements the trained-dictionary pipeline as a standalone toolkit. The key changes are:
The "Super Dictionary" merges three sources of SNOMED CT synonyms, compared to the KIRIs baseline which used SNOMED concept names + OMOP synonyms only:
- Athena OMOP vocabularies -- all SNOMED synonyms available in Athena's
CONCEPT_SYNONYM.csv(English, active concepts only) - SNOMED CT RF2 release descriptions -- official Fully Specified Names (FSN) and Synonyms (SYN) from the SNOMED CT International Edition snapshot files, which contain descriptions not present in Athena
- Training annotation spans -- domain-specific terms extracted directly from the annotated clinical notes provided by the challenge
The combined dictionary yields ~1.1M rows (~500k--800k unique concept/term pairs after case-folded deduplication), substantially larger than the KIRIs default synonym set.
train_dictionary.py re-implements the core KIRIs dictionary-training loop as a
single standalone script:
- Section-aware annotation of training notes using the super dictionary
- Precision filtering with configurable thresholds (20% for section-specific entries, 30% for section-agnostic entries)
- Separate uppercase-only dictionary for case-sensitive matching
- SNOMED synonym enrichment (2--5 token terms from RF2 + Athena)
- Linguistic variant expansion (abbreviations, fracture permutations)
- Dynamic word-frequency blacklisting of overly common terms
- Standalone dictionary matcher (
test_old_challenge_split.py) for evaluating raw dictionary coverage without any training step - Scoring, error analysis, and delta reporting scripts for comparing dictionary variants
- False-positive concept analysis for identifying terms to blocklist
.
├── run.py # Full pipeline: build dict + train
├── src/
│ ├── engine.py # Core: SNOMED loading, linguistic rules, section segmentation
│ ├── build_super_dictionary.py # Build full super dictionary (Athena + RF2 + spans)
│ ├── build_from_athena.py # Build CHV->SNOMED synonym table from Athena
│ ├── build_snomed_from_athena.py # Build SNOMED synonym table from Athena alone
│ ├── train_dictionary.py # Train section-aware dictionary (KIRI-style)
│ ├── export_to_kiri.py # Convert super dict to KIRI synonym format
│ ├── test_old_challenge_split.py # Standalone dictionary matcher + evaluation
│ ├── runtime_scoring.py # Character-level IoU scoring
│ ├── error_analysis.py # Categorize prediction errors
│ ├── kiri_delta_report.py # Compare default vs super at note/concept level
│ ├── fp_concept_analysis.py # Identify and quantify false positive concepts
│ └── fp_concept_triage.py # Risk-classify FP concepts for blocklisting
├── data/ # External data (not tracked, see below)
│ └── interim/ # Generated intermediate files
└── outputs/ # Evaluation outputs (not tracked)
The scripts require external data files placed under data/.
All paths below are relative to the repository root.
Location: data/athena/
| File | Required By | Description |
|---|---|---|
CONCEPT.csv |
build_super_dictionary.py, engine.py |
Concept definitions with concept_id, concept_name, vocabulary_id, concept_code, invalid_reason |
CONCEPT_SYNONYM.csv |
build_super_dictionary.py, engine.py |
Synonym names with concept_id, concept_synonym_name, language_concept_id |
CONCEPT_RELATIONSHIP.csv |
build_from_athena.py (optional CHV mapping) |
Concept-to-concept mappings with concept_id_1, concept_id_2, relationship_id |
How to obtain: Download from OHDSI Athena.
- Create an account and go to the Download tab.
- Select at minimum the SNOMED vocabulary. The code filters
CONCEPT.csvtovocabulary_id = 'SNOMED'withinvalid_reasonempty (active concepts only), then joinsCONCEPT_SYNONYM.csvonconcept_idfiltered tolanguage_concept_id = '4180186'(English). - Tick the "Include Concept Synonym" checkbox before downloading --
CONCEPT_SYNONYM.csvis not included by default. CONCEPT_RELATIONSHIP.csvis only needed if you runbuild_from_athena.pyfor CHV-to-SNOMED mapping (not used by the main build pipeline).
The version used for development was downloaded on 2026-02-13 with SNOMED content dated through 2025-04-09 (~412k active SNOMED concepts, ~1.1M synonym rows after combining with RF2).
Location: data/SnomedCT_InternationalRF2_PRODUCTION_20251101T120000Z/ (or
similar, the exact directory name varies by release date)
| File | Required By | Description |
|---|---|---|
Snapshot/Terminology/sct2_Description_Snapshot-en_INT_*.txt |
build_super_dictionary.py |
SNOMED descriptions with conceptId, term, typeId, active |
How to obtain: Download the SNOMED CT International Edition from SNOMED International. Requires a license (free for many countries via their National Release Center).
| File | Required By | Description |
|---|---|---|
data/train_annotations.csv |
build_super_dictionary.py |
Training annotations with concept_id, span columns |
data/old-challenge-split/test_notes.csv |
test_old_challenge_split.py, error_analysis.py |
Test notes with note_id, text |
data/old-challenge-split/test_annotations.csv |
test_old_challenge_split.py, error_analysis.py |
Test annotations for the split |
data/old-challenge-split/train_annotations.csv |
test_old_challenge_split.py (with --train-boost) |
Training annotations for domain adaptation |
| File | Required By | Description |
|---|---|---|
data/flattened_terminology.csv |
export_to_kiri.py, kiri_delta_report.py, train_dictionary.py |
SNOMED concept names with semantic tags (for section limiting) |
| File | Generated By | Description |
|---|---|---|
data/interim/super_dictionary_full.tsv |
build_super_dictionary.py |
Full super dictionary (~78 MB) |
data/interim/trained_dict.pkl |
train_dictionary.py |
Trained dictionary pickle |
outputs/old_challenge_split/ |
test_old_challenge_split.py |
Predictions, per-concept IoU |
Requires Python 3.12. Install dependencies with uv:
uv syncOptional (for dependency-parser coordination splitting):
uv sync --extra dev
uv run python -m spacy download en_core_web_smrun.py builds the super dictionary and trains the section-aware dictionary in
one step:
uv run python run.py \
--athena-dir data/athena \
--snomed-dir data/SnomedCT_InternationalRF2_PRODUCTION_20251101T120000Z \
--train-annotations data/train_annotations.csv \
--train-notes data/train_notes.csvOutput: data/interim/super_dictionary_full.tsv and data/interim/trained_dict.pkl.
All scripts live under src/ and can be run individually with
uv run python -m src.<module>.
uv run python -m src.build_super_dictionary \
--athena-dir data/athena \
--snomed-dir data/SnomedCT_InternationalRF2_PRODUCTION_20251101T120000Z \
--train-annotations data/train_annotations.csv \
--out data/interim/super_dictionary_full.tsv \
--include-concept-nameYou can disable any source with --no-athena, --no-snomed-rf2, or --no-train-spans.
Output TSV columns: snomed_concept_id, term, source, source_detail
CHV-to-SNOMED mapping (requires CONCEPT_RELATIONSHIP.csv):
uv run python -m src.build_from_athena \
--athena-dir data/athena \
--out data/interim/super_dictionary_chv_athena.tsvSNOMED synonyms from Athena:
uv run python -m src.build_snomed_from_athena \
--athena-dir data/athena \
--out data/interim/super_dictionary_snomed_athena.tsv \
--include-concept-nameuv run python -m src.export_to_kiri \
--super-dict data/interim/super_dictionary_full.tsv \
--flattened-terminology data/flattened_terminology.csv \
--out data/interim/flattened_terminology_syn_super.csvRuns the standalone dictionary matcher on the old challenge train/test split and reports macro-averaged character-level IoU:
uv run python -m src.test_old_challenge_split \
--super-dict data/interim/super_dictionary_full.tsv \
--split-dir data/old-challenge-split \
--output-dir outputs/old_challenge_split \
--train-boostKey options for tuning precision/recall:
| Flag | Default | Description |
|---|---|---|
--single-token-min-len |
8 |
Min chars for single-token dictionary terms. Higher = fewer FPs. Use 100 to disable single-token matching entirely. |
--min-term-len |
4 |
Min chars for any dictionary term |
--max-term-tokens |
10 |
Max tokens per dictionary term |
--train-boost |
off | Add training annotation spans to dictionary |
--no-blacklist |
off | Disable internal FP-term blacklist |
--skip-preamble |
100 |
Skip matches in first N chars of each note |
Typical results on the old challenge split:
--single-token-min-len |
Predictions | Macro char IoU |
|---|---|---|
| 6 | ~30,000 | ~0.15 |
| 8 (default) | ~22,600 | ~0.20 |
| 10 | ~15,700 | ~0.24 |
| 100 (multi-token only) | ~10,200 | ~0.28 |
Note: The standalone matcher achieves ~0.20--0.28 IoU depending on filtering aggressiveness. The gap vs KIRI (~0.62) is due to KIRI's training-based concept filtering, section-aware matching, and postprocessing. The standalone matcher is useful for evaluating raw dictionary coverage without any ML pipeline.
uv run python -m src.runtime_scoring \
path/to/predictions.csv \
path/to/ground_truth.csvBoth CSVs must have columns: note_id, start, end, concept_id.
uv run python -m src.error_analysis \
--pred outputs/old_challenge_split/super_dict_pred.csv \
--gold data/old-challenge-split/test_annotations.csv \
--notes data/old-challenge-split/test_notes.csv \
--output-dir outputs/error_analysisuv run python -m src.kiri_delta_report \
--default-pred outputs/kiri_default_pred.csv \
--super-pred outputs/kiri_super_pred.csv \
--gold data/old-challenge-split/train_annotations.csv \
--default-class-iou outputs/kiri_default_class_iou.csv \
--super-class-iou outputs/kiri_super_class_iou.csv \
--concept-names data/flattened_terminology.csv \
--out-dir outputs/delta_reportuv run python -m src.fp_concept_analysis \
--test-pred outputs/old_challenge_split/super_dict_pred.csv \
--test-gold data/old-challenge-split/test_annotations.csv \
--oof-pred outputs/kiri_super_oof_pred.csv \
--train-gold data/old-challenge-split/train_annotations.csv \
--test-notes data/old-challenge-split/test_notes.csv \
--train-notes data/old-challenge-split/train_notes.csv \
--dict-path data/interim/flattened_terminology_syn_super.csv \
--output-dir outputs/fp_analysisThe engine includes several linguistic transformations to improve recall:
- Abbreviation expansion:
pt->patient,fx->fracture,L->Left - Coordination splitting:
"fracture of left and right femur"-> two variants - Fracture permutations:
"femur fracture"<->"fracture of femur"<->"fx of femur" - Stopword-transparent matching:
"fracture femur"matches"fracture of the femur" - Dependency-parser coordination (optional, requires spaCy): uses parse tree for conjuncts
The primary metric is macro-averaged character-level IoU:
- For each concept, compute the intersection and union of character positions across all notes
- Per-concept IoU = intersection / union
- Macro-average across all concepts with non-zero union
The implementation uses coordinate compression to avoid allocating dense character arrays for large clinical notes (100k+ characters), operating in O(k) memory where k is the number of unique span boundaries.
- Linguistic rules may slightly increase false positives from generic matches
- The standalone matcher (
test_old_challenge_split.py) does not perform training-based filtering, so it produces more false positives than the trained pipeline
The dictionary-training approach in this repository is derived from the KIRIs 1st-place solution, which is released under the MIT License (Copyright 2024 Guy Amit, Yonatan Bilu, Irena Girshovitz & Chen Yanover).