SNOMED CT Entity Linking -- Super Dictionary

A toolkit for building an extended synonym dictionary for the DrivenData SNOMED CT Entity Linking challenge. It consolidates multiple terminology sources into a single lookup table and trains a section-aware, precision-filtered dictionary following the approach of the 1st-place KIRIs solution from the original challenge.

Acknowledgements

This project builds directly on the winning solution by Team KIRIs (Guy Amit, Yonatan Bilu, Irena Girshovitz & Chen Yanover):

1st Place -- SNOMED CT Entity Linking Challenge https://github.com/drivendataorg/snomed-ct-entity-linking/tree/main/1st%20Place Licensed under the MIT License.

The KIRIs approach uses dictionary matching rather than ML models: it maps (section header, mention) pairs to SNOMED CT concept IDs, builds two dictionaries (case-sensitive and case-insensitive) from training data and SNOMED synonyms, resolves overlaps by preferring longer and section-specific matches, and applies post-processing with SNOMED CT relational data. Their solution achieved ~0.62 macro character-level IoU on the original challenge split.

What This Repo Changes

This repository extends the KIRIs dictionary with additional synonym sources and re-implements the trained-dictionary pipeline as a standalone toolkit. The key changes are:

Extended dictionary sources

The "Super Dictionary" merges three sources of SNOMED CT synonyms, compared to the KIRIs baseline which used SNOMED concept names + OMOP synonyms only:

Athena OMOP vocabularies -- all SNOMED synonyms available in Athena's CONCEPT_SYNONYM.csv (English, active concepts only)
SNOMED CT RF2 release descriptions -- official Fully Specified Names (FSN) and Synonyms (SYN) from the SNOMED CT International Edition snapshot files, which contain descriptions not present in Athena
Training annotation spans -- domain-specific terms extracted directly from the annotated clinical notes provided by the challenge

The combined dictionary yields ~1.1M rows (~500k--800k unique concept/term pairs after case-folded deduplication), substantially larger than the KIRIs default synonym set.

Re-implemented training pipeline

train_dictionary.py re-implements the core KIRIs dictionary-training loop as a single standalone script:

Section-aware annotation of training notes using the super dictionary
Precision filtering with configurable thresholds (20% for section-specific entries, 30% for section-agnostic entries)
Separate uppercase-only dictionary for case-sensitive matching
SNOMED synonym enrichment (2--5 token terms from RF2 + Athena)
Linguistic variant expansion (abbreviations, fracture permutations)
Dynamic word-frequency blacklisting of overly common terms

Additional tooling

Standalone dictionary matcher (test_old_challenge_split.py) for evaluating raw dictionary coverage without any training step
Scoring, error analysis, and delta reporting scripts for comparing dictionary variants
False-positive concept analysis for identifying terms to blocklist

Directory Structure

.
├── run.py                              # Full pipeline: build dict + train
├── src/
│   ├── engine.py                       # Core: SNOMED loading, linguistic rules, section segmentation
│   ├── build_super_dictionary.py       # Build full super dictionary (Athena + RF2 + spans)
│   ├── build_from_athena.py            # Build CHV->SNOMED synonym table from Athena
│   ├── build_snomed_from_athena.py     # Build SNOMED synonym table from Athena alone
│   ├── train_dictionary.py             # Train section-aware dictionary (KIRI-style)
│   ├── export_to_kiri.py              # Convert super dict to KIRI synonym format
│   ├── test_old_challenge_split.py     # Standalone dictionary matcher + evaluation
│   ├── runtime_scoring.py              # Character-level IoU scoring
│   ├── error_analysis.py               # Categorize prediction errors
│   ├── kiri_delta_report.py            # Compare default vs super at note/concept level
│   ├── fp_concept_analysis.py          # Identify and quantify false positive concepts
│   └── fp_concept_triage.py            # Risk-classify FP concepts for blocklisting
├── data/                               # External data (not tracked, see below)
│   └── interim/                        # Generated intermediate files
└── outputs/                            # Evaluation outputs (not tracked)

External Resources Required

The scripts require external data files placed under data/.

Data Files

All paths below are relative to the repository root.

Athena OMOP Vocabulary Files

Location: data/athena/

File	Required By	Description
`CONCEPT.csv`	`build_super_dictionary.py`, `engine.py`	Concept definitions with `concept_id`, `concept_name`, `vocabulary_id`, `concept_code`, `invalid_reason`
`CONCEPT_SYNONYM.csv`	`build_super_dictionary.py`, `engine.py`	Synonym names with `concept_id`, `concept_synonym_name`, `language_concept_id`
`CONCEPT_RELATIONSHIP.csv`	`build_from_athena.py` (optional CHV mapping)	Concept-to-concept mappings with `concept_id_1`, `concept_id_2`, `relationship_id`

How to obtain: Download from OHDSI Athena.

Create an account and go to the Download tab.
Select at minimum the SNOMED vocabulary. The code filters CONCEPT.csv to vocabulary_id = 'SNOMED' with invalid_reason empty (active concepts only), then joins CONCEPT_SYNONYM.csv on concept_id filtered to language_concept_id = '4180186' (English).
Tick the "Include Concept Synonym" checkbox before downloading -- CONCEPT_SYNONYM.csv is not included by default.
CONCEPT_RELATIONSHIP.csv is only needed if you run build_from_athena.py for CHV-to-SNOMED mapping (not used by the main build pipeline).

The version used for development was downloaded on 2026-02-13 with SNOMED content dated through 2025-04-09 (~412k active SNOMED concepts, ~1.1M synonym rows after combining with RF2).

SNOMED CT RF2 Release

Location: data/SnomedCT_InternationalRF2_PRODUCTION_20251101T120000Z/ (or similar, the exact directory name varies by release date)

File	Required By	Description
`Snapshot/Terminology/sct2_Description_Snapshot-en_INT_*.txt`	`build_super_dictionary.py`	SNOMED descriptions with `conceptId`, `term`, `typeId`, `active`

How to obtain: Download the SNOMED CT International Edition from SNOMED International. Requires a license (free for many countries via their National Release Center).

Training Data

File	Required By	Description
`data/train_annotations.csv`	`build_super_dictionary.py`	Training annotations with `concept_id`, `span` columns
`data/old-challenge-split/test_notes.csv`	`test_old_challenge_split.py`, `error_analysis.py`	Test notes with `note_id`, `text`
`data/old-challenge-split/test_annotations.csv`	`test_old_challenge_split.py`, `error_analysis.py`	Test annotations for the split
`data/old-challenge-split/train_annotations.csv`	`test_old_challenge_split.py` (with `--train-boost`)	Training annotations for domain adaptation

KIRI Intermediate Files (optional)

File	Required By	Description
`data/flattened_terminology.csv`	`export_to_kiri.py`, `kiri_delta_report.py`, `train_dictionary.py`	SNOMED concept names with semantic tags (for section limiting)

Generated Output Files

File	Generated By	Description
`data/interim/super_dictionary_full.tsv`	`build_super_dictionary.py`	Full super dictionary (~78 MB)
`data/interim/trained_dict.pkl`	`train_dictionary.py`	Trained dictionary pickle
`outputs/old_challenge_split/`	`test_old_challenge_split.py`	Predictions, per-concept IoU

Setup

Requires Python 3.12. Install dependencies with uv:

uv sync

Optional (for dependency-parser coordination splitting):

uv sync --extra dev
uv run python -m spacy download en_core_web_sm

Usage

Quick start -- full pipeline

run.py builds the super dictionary and trains the section-aware dictionary in one step:

uv run python run.py \
  --athena-dir data/athena \
  --snomed-dir data/SnomedCT_InternationalRF2_PRODUCTION_20251101T120000Z \
  --train-annotations data/train_annotations.csv \
  --train-notes data/train_notes.csv

Output: data/interim/super_dictionary_full.tsv and data/interim/trained_dict.pkl.

Individual scripts

All scripts live under src/ and can be run individually with uv run python -m src.<module>.

1. Build the Super Dictionary

uv run python -m src.build_super_dictionary \
  --athena-dir data/athena \
  --snomed-dir data/SnomedCT_InternationalRF2_PRODUCTION_20251101T120000Z \
  --train-annotations data/train_annotations.csv \
  --out data/interim/super_dictionary_full.tsv \
  --include-concept-name

You can disable any source with --no-athena, --no-snomed-rf2, or --no-train-spans.

Output TSV columns: snomed_concept_id, term, source, source_detail

2. Build from Athena Only

CHV-to-SNOMED mapping (requires CONCEPT_RELATIONSHIP.csv):

uv run python -m src.build_from_athena \
  --athena-dir data/athena \
  --out data/interim/super_dictionary_chv_athena.tsv

SNOMED synonyms from Athena:

uv run python -m src.build_snomed_from_athena \
  --athena-dir data/athena \
  --out data/interim/super_dictionary_snomed_athena.tsv \
  --include-concept-name

3. Export to KIRI Format

uv run python -m src.export_to_kiri \
  --super-dict data/interim/super_dictionary_full.tsv \
  --flattened-terminology data/flattened_terminology.csv \
  --out data/interim/flattened_terminology_syn_super.csv

4. Test Dictionary on Old Challenge Split

Runs the standalone dictionary matcher on the old challenge train/test split and reports macro-averaged character-level IoU:

uv run python -m src.test_old_challenge_split \
  --super-dict data/interim/super_dictionary_full.tsv \
  --split-dir data/old-challenge-split \
  --output-dir outputs/old_challenge_split \
  --train-boost

Key options for tuning precision/recall:

Flag	Default	Description
`--single-token-min-len`	`8`	Min chars for single-token dictionary terms. Higher = fewer FPs. Use `100` to disable single-token matching entirely.
`--min-term-len`	`4`	Min chars for any dictionary term
`--max-term-tokens`	`10`	Max tokens per dictionary term
`--train-boost`	off	Add training annotation spans to dictionary
`--no-blacklist`	off	Disable internal FP-term blacklist
`--skip-preamble`	`100`	Skip matches in first N chars of each note

Typical results on the old challenge split:

`--single-token-min-len`	Predictions	Macro char IoU
6	~30,000	~0.15
8 (default)	~22,600	~0.20
10	~15,700	~0.24
100 (multi-token only)	~10,200	~0.28

Note: The standalone matcher achieves ~0.20--0.28 IoU depending on filtering aggressiveness. The gap vs KIRI (~0.62) is due to KIRI's training-based concept filtering, section-aware matching, and postprocessing. The standalone matcher is useful for evaluating raw dictionary coverage without any ML pipeline.

5. Score Predictions

uv run python -m src.runtime_scoring \
  path/to/predictions.csv \
  path/to/ground_truth.csv

Both CSVs must have columns: note_id, start, end, concept_id.

6. Error Analysis

uv run python -m src.error_analysis \
  --pred outputs/old_challenge_split/super_dict_pred.csv \
  --gold data/old-challenge-split/test_annotations.csv \
  --notes data/old-challenge-split/test_notes.csv \
  --output-dir outputs/error_analysis

7. Delta Report

uv run python -m src.kiri_delta_report \
  --default-pred outputs/kiri_default_pred.csv \
  --super-pred outputs/kiri_super_pred.csv \
  --gold data/old-challenge-split/train_annotations.csv \
  --default-class-iou outputs/kiri_default_class_iou.csv \
  --super-class-iou outputs/kiri_super_class_iou.csv \
  --concept-names data/flattened_terminology.csv \
  --out-dir outputs/delta_report

8. FP Concept Analysis

uv run python -m src.fp_concept_analysis \
  --test-pred outputs/old_challenge_split/super_dict_pred.csv \
  --test-gold data/old-challenge-split/test_annotations.csv \
  --oof-pred outputs/kiri_super_oof_pred.csv \
  --train-gold data/old-challenge-split/train_annotations.csv \
  --test-notes data/old-challenge-split/test_notes.csv \
  --train-notes data/old-challenge-split/train_notes.csv \
  --dict-path data/interim/flattened_terminology_syn_super.csv \
  --output-dir outputs/fp_analysis

Linguistic Rules (engine.py)

The engine includes several linguistic transformations to improve recall:

Abbreviation expansion: pt -> patient, fx -> fracture, L -> Left
Coordination splitting: "fracture of left and right femur" -> two variants
Fracture permutations: "femur fracture" <-> "fracture of femur" <-> "fx of femur"
Stopword-transparent matching: "fracture femur" matches "fracture of the femur"
Dependency-parser coordination (optional, requires spaCy): uses parse tree for conjuncts

Scoring Methodology

The primary metric is macro-averaged character-level IoU:

For each concept, compute the intersection and union of character positions across all notes
Per-concept IoU = intersection / union
Macro-average across all concepts with non-zero union

The implementation uses coordinate compression to avoid allocating dense character arrays for large clinical notes (100k+ characters), operating in O(k) memory where k is the number of unique span boundaries.

Limitations

Linguistic rules may slightly increase false positives from generic matches
The standalone matcher (test_old_challenge_split.py) does not perform training-based filtering, so it produces more false positives than the trained pipeline

License

The dictionary-training approach in this repository is derived from the KIRIs 1st-place solution, which is released under the MIT License (Copyright 2024 Guy Amit, Yonatan Bilu, Irena Girshovitz & Chen Yanover).

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
outputs		outputs
src		src
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
run.py		run.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

SNOMED CT Entity Linking -- Super Dictionary

Acknowledgements

What This Repo Changes

Extended dictionary sources

Re-implemented training pipeline

Additional tooling

Directory Structure

External Resources Required

Data Files

Athena OMOP Vocabulary Files

SNOMED CT RF2 Release

Training Data

KIRI Intermediate Files (optional)

Generated Output Files

Setup

Usage

Quick start -- full pipeline

Individual scripts

1. Build the Super Dictionary

2. Build from Athena Only

3. Export to KIRI Format

4. Test Dictionary on Old Challenge Split

5. Score Predictions

6. Error Analysis

7. Delta Report

8. FP Concept Analysis

Linguistic Rules (engine.py)

Scoring Methodology

Limitations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages