Skip to content

Comments

Cross reference titer strains against metadata strains#292

Open
jameshadfield wants to merge 3 commits intomasterfrom
james/cross-reference-titer-strains
Open

Cross reference titer strains against metadata strains#292
jameshadfield wants to merge 3 commits intomasterfrom
james/cross-reference-titer-strains

Conversation

@jameshadfield
Copy link
Member

Match titer strain names against fauna (the current status-quo) and against the newly curated metadata. For the latter, do both an exact match and a more complex match, with the idea being that we can add a simple strain-name remapping layer when we download the titers from fauna.

The "more complex" match either uses EPI_ISL to link strain names if it was a fauna match or simplifies (normalizes?) the strain name by removing punctuation, lowercasing etc and compares against similarly simplified metadata strain names.

Match rate against all titers

The following plot shows that the match rate for CDC & Crick titers is now almost 100%, especially for recent years. NIID and VIDRL titers match rates see minor improvements but are almost unchanged. Circle size corresponds to number of titers for that year (extracted from the titer strain name).

titer-matching

Are we loosing some titers and gaining others?

TL;DR - if we incorporate a thin titer-strain-remapping layer we essentially keep all titers we (fauna) currently match.

To check we're not seeing big changes in which titers are matched despite the overall percentage improvements (above) we log summary stats as well as verbose per-strain log files. Following is a summary of the summary.

subtype   contrib year      will_lose(simple_match)  will_gain(simple_match)  will_lose(after_remap)   will_gain(after_remap)   
h3n2      cdc     ALL       9                        180                      4                        191                      
h3n2      vidrl   ALL       21                       410                      2                        736                      
h3n2      niid    ALL       4                        30                       4                        32                       
h3n2      crick   ALL       27                       210                      3                        232                      
h3n2      ALL     ALL       61                       830                      13                       1191                     
h1n1pdm   vidrl   ALL       4                        80                       0                        156                      
h1n1pdm   niid    ALL       1                        8                        0                        8                        
h1n1pdm   cdc     ALL       9                        272                      3                        276                      
h1n1pdm   crick   ALL       27                       157                      2                        186                      
h1n1pdm   ALL     ALL       41                       517                      5                        626                      
vic       vidrl   ALL       9                        32                       0                        66                       
vic       niid    ALL       0                        8                        0                        9                        
vic       cdc     ALL       5                        223                      0                        237                      
vic       crick   ALL       19                       77                       0                        91                       
vic       ALL     ALL       33                       340                      0                        403                      
yam       vidrl   ALL       3                        35                       0                        195                      
yam       niid    ALL       1                        1                        0                        1                        
yam       cdc     ALL       4                        350                      0                        351                      
yam       crick   ALL       2                        70                       0                        92                       
yam       ALL     ALL       10                       456                      0                        639 

Base automatically changed from james/update-strain-names to master January 12, 2026 20:19
@joverlee521 joverlee521 linked an issue Jan 12, 2026 that may be closed by this pull request
@jameshadfield jameshadfield force-pushed the james/cross-reference-titer-strains branch from d2f05d7 to 261e421 Compare January 13, 2026 03:58
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Snakemake workflow changes make sense to me! Also reviewed the remap-titer-strain-names script since I was here.

Exploratory scripts which match titer strain names against fauna (the
current status-quo) and against the newly curated metadata. For the
latter, do both an exact match and a more complex match, with the idea
being that we can add a simple strain-name remapping layer when we
download the titers from fauna. These can be removed in due course, but
are useful to have around for a while.
Titers and metadata are entirely matched on strain names. Recently we
changed our approach to creating (curating) metadata which inevitably
resulted in some strain name changes. To preserve our previous matches
we use a lookup of the previous fauna strain names to EPI_ISL to allow
titer strain names to be updated.

There are a number of other comparison approaches we can use, such as
lowercasing & removing punctuation. If such a match is found we log it
and report a "maybe" match in the TSV, the idea being that we can come
back and add manual fixes for those flagged / candidate strain names.
Visualises the stats from the matching approach in the previous commit
@jameshadfield jameshadfield force-pushed the james/cross-reference-titer-strains branch from 261e421 to b3f7d45 Compare January 14, 2026 03:43
@jameshadfield
Copy link
Member Author

I've updated this PR - it's good to review (and expand on, as needed), but we're probably better holding off merging until after this week's report has been done.

The automated upload_all_titers rule (or a dev rule dev_only_all_titers if running locally) will now correct titer strain names so that strains which previously matched fauna titers are renamed¹ and thus can be used for downstream analysis. We also identify a (small) number of "maybe" strain-name matches by comparing simplified versions² of the strain names; we log such matches as potential fixes, but don't actually change the strain names in the titers TSV. We add two columns to the titers TSV: virus_strain_match and serum_strain_match, with values of "yes", "maybe" or "no".

We visualize the matching results (one figure per lineage) and upload PNGs to S3. The visualisation is in small-multiples of center, passage and assay, but could be further split by host if required.

I don't have a great answer for why specific titer strains don't match.


¹ Not all of them, but nearly all of them. Since we use EPI_ISL as the merge key, if the fauna sample's EPI_ISL wasn't in our curated metadata then we can't match it.

² Simplification means: lowercase, punctuation removed, leading zeros removed from numbers, and some other ad-hoc fixes.

Comment on lines +125 to +126
# TODO XXX add a way to provide a hardcoded mapping of titer-strain-name to new-strain-name
# and apply those matches here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd push to implement this as part of this PR. We don't need to add any fixes to it just yet, but I think it'd be nice to have the mapping option available if we notice important titer strains are mismatched.

@jameshadfield jameshadfield requested a review from huddlej January 18, 2026 21:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Check strain names in titer data

2 participants