Cross reference titer strains against metadata strains by jameshadfield · Pull Request #292 · nextstrain/seasonal-flu

jameshadfield · 2026-01-12T19:59:56Z

Match titer strain names against fauna (the current status-quo) and against the newly curated metadata. For the latter, do both an exact match and a more complex match, with the idea being that we can add a simple strain-name remapping layer when we download the titers from fauna.

The "more complex" match either uses EPI_ISL to link strain names if it was a fauna match or simplifies (normalizes?) the strain name by removing punctuation, lowercasing etc and compares against similarly simplified metadata strain names.

Match rate against all titers

The following plot shows that the match rate for CDC & Crick titers is now almost 100%, especially for recent years. NIID and VIDRL titers match rates see minor improvements but are almost unchanged. Circle size corresponds to number of titers for that year (extracted from the titer strain name).

Are we loosing some titers and gaining others?

TL;DR - if we incorporate a thin titer-strain-remapping layer we essentially keep all titers we (fauna) currently match.

To check we're not seeing big changes in which titers are matched despite the overall percentage improvements (above) we log summary stats as well as verbose per-strain log files. Following is a summary of the summary.

subtype   contrib year      will_lose(simple_match)  will_gain(simple_match)  will_lose(after_remap)   will_gain(after_remap)   
h3n2      cdc     ALL       9                        180                      4                        191                      
h3n2      vidrl   ALL       21                       410                      2                        736                      
h3n2      niid    ALL       4                        30                       4                        32                       
h3n2      crick   ALL       27                       210                      3                        232                      
h3n2      ALL     ALL       61                       830                      13                       1191                     
h1n1pdm   vidrl   ALL       4                        80                       0                        156                      
h1n1pdm   niid    ALL       1                        8                        0                        8                        
h1n1pdm   cdc     ALL       9                        272                      3                        276                      
h1n1pdm   crick   ALL       27                       157                      2                        186                      
h1n1pdm   ALL     ALL       41                       517                      5                        626                      
vic       vidrl   ALL       9                        32                       0                        66                       
vic       niid    ALL       0                        8                        0                        9                        
vic       cdc     ALL       5                        223                      0                        237                      
vic       crick   ALL       19                       77                       0                        91                       
vic       ALL     ALL       33                       340                      0                        403                      
yam       vidrl   ALL       3                        35                       0                        195                      
yam       niid    ALL       1                        1                        0                        1                        
yam       cdc     ALL       4                        350                      0                        351                      
yam       crick   ALL       2                        70                       0                        92                       
yam       ALL     ALL       10                       456                      0                        639

joverlee521

The Snakemake workflow changes make sense to me! Also reviewed the remap-titer-strain-names script since I was here.

scripts/remap-titer-strain-names.py

workflow/snakemake_rules/download_from_fauna.smk

scripts/remap-titer-strain-names.py

workflow/snakemake_rules/download_from_fauna.smk

Exploratory scripts which match titer strain names against fauna (the current status-quo) and against the newly curated metadata. For the latter, do both an exact match and a more complex match, with the idea being that we can add a simple strain-name remapping layer when we download the titers from fauna. These can be removed in due course, but are useful to have around for a while.

Titers and metadata are entirely matched on strain names. Recently we changed our approach to creating (curating) metadata which inevitably resulted in some strain name changes. To preserve our previous matches we use a lookup of the previous fauna strain names to EPI_ISL to allow titer strain names to be updated. There are a number of other comparison approaches we can use, such as lowercasing & removing punctuation. If such a match is found we log it and report a "maybe" match in the TSV, the idea being that we can come back and add manual fixes for those flagged / candidate strain names.

Visualises the stats from the matching approach in the previous commit

jameshadfield · 2026-01-14T03:57:21Z

I've updated this PR - it's good to review (and expand on, as needed), but we're probably better holding off merging until after this week's report has been done.

The automated upload_all_titers rule (or a dev rule dev_only_all_titers if running locally) will now correct titer strain names so that strains which previously matched fauna titers are renamed¹ and thus can be used for downstream analysis. We also identify a (small) number of "maybe" strain-name matches by comparing simplified versions² of the strain names; we log such matches as potential fixes, but don't actually change the strain names in the titers TSV. We add two columns to the titers TSV: virus_strain_match and serum_strain_match, with values of "yes", "maybe" or "no".

We visualize the matching results (one figure per lineage) and upload PNGs to S3. The visualisation is in small-multiples of center, passage and assay, but could be further split by host if required.

I don't have a great answer for why specific titer strains don't match.

¹ Not all of them, but nearly all of them. Since we use EPI_ISL as the merge key, if the fauna sample's EPI_ISL wasn't in our curated metadata then we can't match it.

² Simplification means: lowercase, punctuation removed, leading zeros removed from numbers, and some other ad-hoc fixes.

scripts/remap-titer-strain-names.py

joverlee521 · 2026-01-16T01:26:53Z

scripts/remap-titer-strain-names.py

+    # TODO XXX add a way to provide a hardcoded mapping of titer-strain-name to new-strain-name
+    # and apply those matches here.


I'd push to implement this as part of this PR. We don't need to add any fixes to it just yet, but I think it'd be nice to have the mapping option available if we notice important titer strains are mismatched.

jameshadfield mentioned this pull request Jan 12, 2026

WIP - fuzzy strain name matching #273

Closed

Base automatically changed from james/update-strain-names to master January 12, 2026 20:19

joverlee521 linked an issue Jan 12, 2026 that may be closed by this pull request

Check strain names in titer data #286

Open

jameshadfield force-pushed the james/cross-reference-titer-strains branch from d2f05d7 to 261e421 Compare January 13, 2026 03:58

joverlee521 reviewed Jan 13, 2026

View reviewed changes

jameshadfield added 3 commits January 14, 2026 14:13

[titer download] viz for strain name matches

b3f7d45

Visualises the stats from the matching approach in the previous commit

jameshadfield force-pushed the james/cross-reference-titer-strains branch from 261e421 to b3f7d45 Compare January 14, 2026 03:43

jameshadfield commented Jan 14, 2026

View reviewed changes

scripts/remap-titer-strain-names.py Show resolved Hide resolved

joverlee521 reviewed Jan 16, 2026

View reviewed changes

jameshadfield requested a review from huddlej January 18, 2026 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Cross reference titer strains against metadata strains#292

Cross reference titer strains against metadata strains#292
jameshadfield wants to merge 3 commits intomasterfrom
james/cross-reference-titer-strains

jameshadfield commented Jan 12, 2026

Uh oh!

joverlee521 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jameshadfield commented Jan 14, 2026

Uh oh!

Uh oh!

joverlee521 Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# TODO XXX add a way to provide a hardcoded mapping of titer-strain-name to new-strain-name
		# and apply those matches here.

Comments

Conversation

jameshadfield commented Jan 12, 2026

Match rate against all titers

Are we loosing some titers and gaining others?

Uh oh!

joverlee521 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jameshadfield commented Jan 14, 2026

Uh oh!

Uh oh!

joverlee521 Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants