Skip to content

IHTSDO/SNOMED-CT-Entity-Linking-Challenge-Super-Dictionary

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SNOMED CT Entity Linking -- Super Dictionary

A toolkit for building an extended synonym dictionary for the DrivenData SNOMED CT Entity Linking challenge. It consolidates multiple terminology sources into a single lookup table and trains a section-aware, precision-filtered dictionary following the approach of the 1st-place KIRIs solution from the original challenge.

Acknowledgements

This project builds directly on the winning solution by Team KIRIs (Guy Amit, Yonatan Bilu, Irena Girshovitz & Chen Yanover):

1st Place -- SNOMED CT Entity Linking Challenge https://github.com/drivendataorg/snomed-ct-entity-linking/tree/main/1st%20Place Licensed under the MIT License.

The KIRIs approach uses dictionary matching rather than ML models: it maps (section header, mention) pairs to SNOMED CT concept IDs, builds two dictionaries (case-sensitive and case-insensitive) from training data and SNOMED synonyms, resolves overlaps by preferring longer and section-specific matches, and applies post-processing with SNOMED CT relational data. Their solution achieved ~0.62 macro character-level IoU on the original challenge split.

What This Repo Changes

This repository extends the KIRIs dictionary with additional synonym sources and re-implements the trained-dictionary pipeline as a standalone toolkit. The key changes are:

Extended dictionary sources

The "Super Dictionary" merges three sources of SNOMED CT synonyms, compared to the KIRIs baseline which used SNOMED concept names + OMOP synonyms only:

  1. Athena OMOP vocabularies -- all SNOMED synonyms available in Athena's CONCEPT_SYNONYM.csv (English, active concepts only)
  2. SNOMED CT RF2 release descriptions -- official Fully Specified Names (FSN) and Synonyms (SYN) from the SNOMED CT International Edition snapshot files, which contain descriptions not present in Athena
  3. Training annotation spans -- domain-specific terms extracted directly from the annotated clinical notes provided by the challenge

The combined dictionary yields ~1.1M rows (~500k--800k unique concept/term pairs after case-folded deduplication), substantially larger than the KIRIs default synonym set.

Re-implemented training pipeline

train_dictionary.py re-implements the core KIRIs dictionary-training loop as a single standalone script:

  • Section-aware annotation of training notes using the super dictionary
  • Precision filtering with configurable thresholds (20% for section-specific entries, 30% for section-agnostic entries)
  • Separate uppercase-only dictionary for case-sensitive matching
  • SNOMED synonym enrichment (2--5 token terms from RF2 + Athena)
  • Linguistic variant expansion (abbreviations, fracture permutations)
  • Dynamic word-frequency blacklisting of overly common terms

Additional tooling

  • Standalone dictionary matcher (test_old_challenge_split.py) for evaluating raw dictionary coverage without any training step
  • Scoring, error analysis, and delta reporting scripts for comparing dictionary variants
  • False-positive concept analysis for identifying terms to blocklist

Directory Structure

.
├── run.py                              # Full pipeline: build dict + train
├── src/
│   ├── engine.py                       # Core: SNOMED loading, linguistic rules, section segmentation
│   ├── build_super_dictionary.py       # Build full super dictionary (Athena + RF2 + spans)
│   ├── build_from_athena.py            # Build CHV->SNOMED synonym table from Athena
│   ├── build_snomed_from_athena.py     # Build SNOMED synonym table from Athena alone
│   ├── train_dictionary.py             # Train section-aware dictionary (KIRI-style)
│   ├── export_to_kiri.py              # Convert super dict to KIRI synonym format
│   ├── test_old_challenge_split.py     # Standalone dictionary matcher + evaluation
│   ├── runtime_scoring.py              # Character-level IoU scoring
│   ├── error_analysis.py               # Categorize prediction errors
│   ├── kiri_delta_report.py            # Compare default vs super at note/concept level
│   ├── fp_concept_analysis.py          # Identify and quantify false positive concepts
│   └── fp_concept_triage.py            # Risk-classify FP concepts for blocklisting
├── data/                               # External data (not tracked, see below)
│   └── interim/                        # Generated intermediate files
└── outputs/                            # Evaluation outputs (not tracked)

External Resources Required

The scripts require external data files placed under data/.

Data Files

All paths below are relative to the repository root.

Athena OMOP Vocabulary Files

Location: data/athena/

File Required By Description
CONCEPT.csv build_super_dictionary.py, engine.py Concept definitions with concept_id, concept_name, vocabulary_id, concept_code, invalid_reason
CONCEPT_SYNONYM.csv build_super_dictionary.py, engine.py Synonym names with concept_id, concept_synonym_name, language_concept_id
CONCEPT_RELATIONSHIP.csv build_from_athena.py (optional CHV mapping) Concept-to-concept mappings with concept_id_1, concept_id_2, relationship_id

How to obtain: Download from OHDSI Athena.

  1. Create an account and go to the Download tab.
  2. Select at minimum the SNOMED vocabulary. The code filters CONCEPT.csv to vocabulary_id = 'SNOMED' with invalid_reason empty (active concepts only), then joins CONCEPT_SYNONYM.csv on concept_id filtered to language_concept_id = '4180186' (English).
  3. Tick the "Include Concept Synonym" checkbox before downloading -- CONCEPT_SYNONYM.csv is not included by default.
  4. CONCEPT_RELATIONSHIP.csv is only needed if you run build_from_athena.py for CHV-to-SNOMED mapping (not used by the main build pipeline).

The version used for development was downloaded on 2026-02-13 with SNOMED content dated through 2025-04-09 (~412k active SNOMED concepts, ~1.1M synonym rows after combining with RF2).

SNOMED CT RF2 Release

Location: data/SnomedCT_InternationalRF2_PRODUCTION_20251101T120000Z/ (or similar, the exact directory name varies by release date)

File Required By Description
Snapshot/Terminology/sct2_Description_Snapshot-en_INT_*.txt build_super_dictionary.py SNOMED descriptions with conceptId, term, typeId, active

How to obtain: Download the SNOMED CT International Edition from SNOMED International. Requires a license (free for many countries via their National Release Center).

Training Data

File Required By Description
data/train_annotations.csv build_super_dictionary.py Training annotations with concept_id, span columns
data/old-challenge-split/test_notes.csv test_old_challenge_split.py, error_analysis.py Test notes with note_id, text
data/old-challenge-split/test_annotations.csv test_old_challenge_split.py, error_analysis.py Test annotations for the split
data/old-challenge-split/train_annotations.csv test_old_challenge_split.py (with --train-boost) Training annotations for domain adaptation

KIRI Intermediate Files (optional)

File Required By Description
data/flattened_terminology.csv export_to_kiri.py, kiri_delta_report.py, train_dictionary.py SNOMED concept names with semantic tags (for section limiting)

Generated Output Files

File Generated By Description
data/interim/super_dictionary_full.tsv build_super_dictionary.py Full super dictionary (~78 MB)
data/interim/trained_dict.pkl train_dictionary.py Trained dictionary pickle
outputs/old_challenge_split/ test_old_challenge_split.py Predictions, per-concept IoU

Setup

Requires Python 3.12. Install dependencies with uv:

uv sync

Optional (for dependency-parser coordination splitting):

uv sync --extra dev
uv run python -m spacy download en_core_web_sm

Usage

Quick start -- full pipeline

run.py builds the super dictionary and trains the section-aware dictionary in one step:

uv run python run.py \
  --athena-dir data/athena \
  --snomed-dir data/SnomedCT_InternationalRF2_PRODUCTION_20251101T120000Z \
  --train-annotations data/train_annotations.csv \
  --train-notes data/train_notes.csv

Output: data/interim/super_dictionary_full.tsv and data/interim/trained_dict.pkl.

Individual scripts

All scripts live under src/ and can be run individually with uv run python -m src.<module>.

1. Build the Super Dictionary

uv run python -m src.build_super_dictionary \
  --athena-dir data/athena \
  --snomed-dir data/SnomedCT_InternationalRF2_PRODUCTION_20251101T120000Z \
  --train-annotations data/train_annotations.csv \
  --out data/interim/super_dictionary_full.tsv \
  --include-concept-name

You can disable any source with --no-athena, --no-snomed-rf2, or --no-train-spans.

Output TSV columns: snomed_concept_id, term, source, source_detail

2. Build from Athena Only

CHV-to-SNOMED mapping (requires CONCEPT_RELATIONSHIP.csv):

uv run python -m src.build_from_athena \
  --athena-dir data/athena \
  --out data/interim/super_dictionary_chv_athena.tsv

SNOMED synonyms from Athena:

uv run python -m src.build_snomed_from_athena \
  --athena-dir data/athena \
  --out data/interim/super_dictionary_snomed_athena.tsv \
  --include-concept-name

3. Export to KIRI Format

uv run python -m src.export_to_kiri \
  --super-dict data/interim/super_dictionary_full.tsv \
  --flattened-terminology data/flattened_terminology.csv \
  --out data/interim/flattened_terminology_syn_super.csv

4. Test Dictionary on Old Challenge Split

Runs the standalone dictionary matcher on the old challenge train/test split and reports macro-averaged character-level IoU:

uv run python -m src.test_old_challenge_split \
  --super-dict data/interim/super_dictionary_full.tsv \
  --split-dir data/old-challenge-split \
  --output-dir outputs/old_challenge_split \
  --train-boost

Key options for tuning precision/recall:

Flag Default Description
--single-token-min-len 8 Min chars for single-token dictionary terms. Higher = fewer FPs. Use 100 to disable single-token matching entirely.
--min-term-len 4 Min chars for any dictionary term
--max-term-tokens 10 Max tokens per dictionary term
--train-boost off Add training annotation spans to dictionary
--no-blacklist off Disable internal FP-term blacklist
--skip-preamble 100 Skip matches in first N chars of each note

Typical results on the old challenge split:

--single-token-min-len Predictions Macro char IoU
6 ~30,000 ~0.15
8 (default) ~22,600 ~0.20
10 ~15,700 ~0.24
100 (multi-token only) ~10,200 ~0.28

Note: The standalone matcher achieves ~0.20--0.28 IoU depending on filtering aggressiveness. The gap vs KIRI (~0.62) is due to KIRI's training-based concept filtering, section-aware matching, and postprocessing. The standalone matcher is useful for evaluating raw dictionary coverage without any ML pipeline.

5. Score Predictions

uv run python -m src.runtime_scoring \
  path/to/predictions.csv \
  path/to/ground_truth.csv

Both CSVs must have columns: note_id, start, end, concept_id.

6. Error Analysis

uv run python -m src.error_analysis \
  --pred outputs/old_challenge_split/super_dict_pred.csv \
  --gold data/old-challenge-split/test_annotations.csv \
  --notes data/old-challenge-split/test_notes.csv \
  --output-dir outputs/error_analysis

7. Delta Report

uv run python -m src.kiri_delta_report \
  --default-pred outputs/kiri_default_pred.csv \
  --super-pred outputs/kiri_super_pred.csv \
  --gold data/old-challenge-split/train_annotations.csv \
  --default-class-iou outputs/kiri_default_class_iou.csv \
  --super-class-iou outputs/kiri_super_class_iou.csv \
  --concept-names data/flattened_terminology.csv \
  --out-dir outputs/delta_report

8. FP Concept Analysis

uv run python -m src.fp_concept_analysis \
  --test-pred outputs/old_challenge_split/super_dict_pred.csv \
  --test-gold data/old-challenge-split/test_annotations.csv \
  --oof-pred outputs/kiri_super_oof_pred.csv \
  --train-gold data/old-challenge-split/train_annotations.csv \
  --test-notes data/old-challenge-split/test_notes.csv \
  --train-notes data/old-challenge-split/train_notes.csv \
  --dict-path data/interim/flattened_terminology_syn_super.csv \
  --output-dir outputs/fp_analysis

Linguistic Rules (engine.py)

The engine includes several linguistic transformations to improve recall:

  • Abbreviation expansion: pt -> patient, fx -> fracture, L -> Left
  • Coordination splitting: "fracture of left and right femur" -> two variants
  • Fracture permutations: "femur fracture" <-> "fracture of femur" <-> "fx of femur"
  • Stopword-transparent matching: "fracture femur" matches "fracture of the femur"
  • Dependency-parser coordination (optional, requires spaCy): uses parse tree for conjuncts

Scoring Methodology

The primary metric is macro-averaged character-level IoU:

  1. For each concept, compute the intersection and union of character positions across all notes
  2. Per-concept IoU = intersection / union
  3. Macro-average across all concepts with non-zero union

The implementation uses coordinate compression to avoid allocating dense character arrays for large clinical notes (100k+ characters), operating in O(k) memory where k is the number of unique span boundaries.

Limitations

  • Linguistic rules may slightly increase false positives from generic matches
  • The standalone matcher (test_old_challenge_split.py) does not perform training-based filtering, so it produces more false positives than the trained pipeline

License

The dictionary-training approach in this repository is derived from the KIRIs 1st-place solution, which is released under the MIT License (Copyright 2024 Guy Amit, Yonatan Bilu, Irena Girshovitz & Chen Yanover).

About

SNOMED CT entity linking via extended synonym dictionary and section-aware trained matching, building on the previous (2023) challenge 1st-place KIRI solution

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages