Skip to content

Comments

WIP - fuzzy strain name matching#273

Closed
jameshadfield wants to merge 1 commit intomasterfrom
james/strain-name-fuzzer
Closed

WIP - fuzzy strain name matching#273
jameshadfield wants to merge 1 commit intomasterfrom
james/strain-name-fuzzer

Conversation

@jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Dec 8, 2025

With our move away from fauna and to a curated all-influenza ingest pipeline we have introduced a number of strain name changes. This immediately presents a problem as we have myriad lists of hardcoded strain names, such as outliers-to-drop and force-include-lists. This workflow (which is rather ad-hoc!) attempts to match up the old (i.e. fauna) strain names with their updated strain names.

We use a combination of fuzzy-matching and a hardcoded map of fauna strain names to new strain names.

See added readme for how to run (not very user friendly at the moment!)

Hardcoded strain maps

For seasonal-flu datasets we can query the EPI_ISLs of the fauna data against the curated data to create lookups.
This table reports how many of the fauna strain names have matches in our data. For avian-flu it's a little tricker so we leverage the existing diff-avian-flu script to create lookups.

Dataset Updated Missing Unchanged
H1N1pdm 1,729 1,659 149,143
H3N2 2,125 2,674 177,009
vic 1,338 286 66,283
yam 367 10 21,946
avian-flu 34,426 1,531 26,754

Example results

Not able to be matched:

  • Fauna strain name A/Environment/InnerMongolia/23285/2019 is now A/Environment/Neimenggu/23285/2019 (the location change is via this fauna TSV.

Automatically matched:

  • Fauna A/Indiana/08/2012 remapped to A/Indiana/8/2012 via fuzzy matching (this looks right)
  • Fauna A/India/3/2019 remapped to A/Indiana/3/2019 via fuzzy matching (this is wrong!)

Full Results

$ grep '=== Summary' -A 6 results/strain-name-matching/*/*
=== Summary .snakemake/storage/http/raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h5n1/dropped_strains_h5n1.txt ===
Direct matches:         0 / 40 (0.0%)
Changed via lookoup:    34 / 40 (85.0%)
Perfect fuzzy matches:  1 / 40 (2.5%)
Fuzzy matches:          0 / 40 (0.0%)
No matches:             5 / 40 (12.5%)
--
=== Summary .snakemake/storage/http/raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h5n1/include_strains_h5n1_2y.txt ===
Direct matches:         0 / 382 (0.0%)
Changed via lookoup:    347 / 382 (90.8%)
Perfect fuzzy matches:  3 / 382 (0.8%)
Fuzzy matches:          7 / 382 (1.8%)
No matches:             25 / 382 (6.5%)
--
=== Summary .snakemake/storage/http/raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h5n1/include_strains_h5n1_all-time.txt ===
Direct matches:         1 / 66 (1.5%)
Changed via lookoup:    56 / 66 (84.8%)
Perfect fuzzy matches:  0 / 66 (0.0%)
Fuzzy matches:          1 / 66 (1.5%)
No matches:             8 / 66 (12.1%)
--
=== Summary .snakemake/storage/http/raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h5nx/dropped_strains_h5nx.txt ===
Direct matches:         0 / 55 (0.0%)
Changed via lookoup:    45 / 55 (81.8%)
Perfect fuzzy matches:  2 / 55 (3.6%)
Fuzzy matches:          0 / 55 (0.0%)
No matches:             8 / 55 (14.5%)
--
=== Summary .snakemake/storage/http/raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h5nx/include_strains_h5nx_2y.txt ===
Direct matches:         0 / 386 (0.0%)
Changed via lookoup:    348 / 386 (90.2%)
Perfect fuzzy matches:  3 / 386 (0.8%)
Fuzzy matches:          10 / 386 (2.6%)
No matches:             25 / 386 (6.5%)
--
=== Summary .snakemake/storage/http/raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h5nx/include_strains_h5nx_all-time.txt ===
Direct matches:         1 / 56 (1.8%)
Changed via lookoup:    46 / 56 (82.1%)
Perfect fuzzy matches:  0 / 56 (0.0%)
Fuzzy matches:          1 / 56 (1.8%)
No matches:             8 / 56 (14.3%)
--
=== Summary .snakemake/storage/http/raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h7n9/dropped_strains_h7n9.txt ===
Direct matches:         0 / 63 (0.0%)
Changed via lookoup:    2 / 63 (3.2%)
Perfect fuzzy matches:  0 / 63 (0.0%)
Fuzzy matches:          0 / 63 (0.0%)
No matches:             61 / 63 (96.8%)
--
=== Summary .snakemake/storage/http/raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h7n9/include_strains_h7n9_all-time.txt ===
Direct matches:         2 / 2 (100.0%)
Changed via lookoup:    0 / 2 (0.0%)
Perfect fuzzy matches:  0 / 2 (0.0%)
Fuzzy matches:          0 / 2 (0.0%)
No matches:             0 / 2 (0.0%)
--
=== Summary .snakemake/storage/http/raw.githubusercontent.com/nextstrain/avian-flu/refs/heads/master/config/h9n2/dropped_strains_h9n2.txt ===
Direct matches:         0 / 8 (0.0%)
Changed via lookoup:    3 / 8 (37.5%)
Perfect fuzzy matches:  0 / 8 (0.0%)
Fuzzy matches:          0 / 8 (0.0%)
No matches:             5 / 8 (62.5%)
--
=== Summary ../config/h1n1pdm/outliers.txt ===
Direct matches:         145 / 168 (86.3%)
Changed via lookoup:    6 / 168 (3.6%)
Perfect fuzzy matches:  0 / 168 (0.0%)
Fuzzy matches:          0 / 168 (0.0%)
No matches:             17 / 168 (10.1%)
--
=== Summary ../config/h1n1pdm/reference_strains.txt ===
Direct matches:         176 / 219 (80.4%)
Changed via lookoup:    37 / 219 (16.9%)
Perfect fuzzy matches:  0 / 219 (0.0%)
Fuzzy matches:          0 / 219 (0.0%)
No matches:             6 / 219 (2.7%)
--
=== Summary ../config/h3n2/ha/prioritized_seqs_file.tsv ===
Direct matches:         2 / 4 (50.0%)
Changed via lookoup:    2 / 4 (50.0%)
Perfect fuzzy matches:  0 / 4 (0.0%)
Fuzzy matches:          0 / 4 (0.0%)
No matches:             0 / 4 (0.0%)
--
=== Summary ../config/h3n2/outliers.txt ===
Direct matches:         511 / 566 (90.3%)
Changed via lookoup:    27 / 566 (4.8%)
Perfect fuzzy matches:  1 / 566 (0.2%)
Fuzzy matches:          6 / 566 (1.1%)
No matches:             21 / 566 (3.7%)
--
=== Summary ../config/h3n2/reference_strains.txt ===
Direct matches:         215 / 254 (84.6%)
Changed via lookoup:    33 / 254 (13.0%)
Perfect fuzzy matches:  0 / 254 (0.0%)
Fuzzy matches:          0 / 254 (0.0%)
No matches:             6 / 254 (2.4%)
--
=== Summary ../config/vic/outliers.txt ===
Direct matches:         54 / 60 (90.0%)
Changed via lookoup:    1 / 60 (1.7%)
Perfect fuzzy matches:  0 / 60 (0.0%)
Fuzzy matches:          2 / 60 (3.3%)
No matches:             3 / 60 (5.0%)
--
=== Summary ../config/vic/reference_strains.txt ===
Direct matches:         182 / 209 (87.1%)
Changed via lookoup:    25 / 209 (12.0%)
Perfect fuzzy matches:  0 / 209 (0.0%)
Fuzzy matches:          1 / 209 (0.5%)
No matches:             1 / 209 (0.5%)
--
=== Summary ../config/yam/outliers.txt ===
Direct matches:         15 / 15 (100.0%)
Changed via lookoup:    0 / 15 (0.0%)
Perfect fuzzy matches:  0 / 15 (0.0%)
Fuzzy matches:          0 / 15 (0.0%)
No matches:             0 / 15 (0.0%)
--
=== Summary ../config/yam/reference_strains.txt ===
Direct matches:         29 / 33 (87.9%)
Changed via lookoup:    3 / 33 (9.1%)
Perfect fuzzy matches:  0 / 33 (0.0%)
Fuzzy matches:          0 / 33 (0.0%)
No matches:             1 / 33 (3.0%)

cc @joverlee521 - you may want to adapt this for titers matching?

@jameshadfield jameshadfield force-pushed the james/strain-name-fuzzer branch from cb2e978 to fcda4b1 Compare December 9, 2025 00:26
@jameshadfield jameshadfield force-pushed the james/snakemake-surgery branch 2 times, most recently from a26269f to 7e76787 Compare December 15, 2025 22:12
Base automatically changed from james/snakemake-surgery to master December 16, 2025 21:22
@jameshadfield jameshadfield force-pushed the james/strain-name-fuzzer branch from fcda4b1 to ceed6b8 Compare December 17, 2025 00:36
@jameshadfield
Copy link
Member Author

Superseded by #291 and #292

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant