Cross reference titer strains against metadata strains#292
Cross reference titer strains against metadata strains#292jameshadfield wants to merge 3 commits intomasterfrom
Conversation
d2f05d7 to
261e421
Compare
joverlee521
left a comment
There was a problem hiding this comment.
The Snakemake workflow changes make sense to me! Also reviewed the remap-titer-strain-names script since I was here.
Exploratory scripts which match titer strain names against fauna (the current status-quo) and against the newly curated metadata. For the latter, do both an exact match and a more complex match, with the idea being that we can add a simple strain-name remapping layer when we download the titers from fauna. These can be removed in due course, but are useful to have around for a while.
Titers and metadata are entirely matched on strain names. Recently we changed our approach to creating (curating) metadata which inevitably resulted in some strain name changes. To preserve our previous matches we use a lookup of the previous fauna strain names to EPI_ISL to allow titer strain names to be updated. There are a number of other comparison approaches we can use, such as lowercasing & removing punctuation. If such a match is found we log it and report a "maybe" match in the TSV, the idea being that we can come back and add manual fixes for those flagged / candidate strain names.
Visualises the stats from the matching approach in the previous commit
261e421 to
b3f7d45
Compare
|
I've updated this PR - it's good to review (and expand on, as needed), but we're probably better holding off merging until after this week's report has been done. The automated We visualize the matching results (one figure per lineage) and upload PNGs to S3. The visualisation is in small-multiples of center, passage and assay, but could be further split by host if required. I don't have a great answer for why specific titer strains don't match. ¹ Not all of them, but nearly all of them. Since we use EPI_ISL as the merge key, if the fauna sample's EPI_ISL wasn't in our curated metadata then we can't match it. ² Simplification means: lowercase, punctuation removed, leading zeros removed from numbers, and some other ad-hoc fixes. |
| # TODO XXX add a way to provide a hardcoded mapping of titer-strain-name to new-strain-name | ||
| # and apply those matches here. |
There was a problem hiding this comment.
I'd push to implement this as part of this PR. We don't need to add any fixes to it just yet, but I think it'd be nice to have the mapping option available if we notice important titer strains are mismatched.
Match titer strain names against fauna (the current status-quo) and against the newly curated metadata. For the latter, do both an exact match and a more complex match, with the idea being that we can add a simple strain-name remapping layer when we download the titers from fauna.
The "more complex" match either uses EPI_ISL to link strain names if it was a fauna match or simplifies (normalizes?) the strain name by removing punctuation, lowercasing etc and compares against similarly simplified metadata strain names.
Match rate against all titers
The following plot shows that the match rate for CDC & Crick titers is now almost 100%, especially for recent years. NIID and VIDRL titers match rates see minor improvements but are almost unchanged. Circle size corresponds to number of titers for that year (extracted from the titer strain name).
Are we loosing some titers and gaining others?
TL;DR - if we incorporate a thin titer-strain-remapping layer we essentially keep all titers we (fauna) currently match.
To check we're not seeing big changes in which titers are matched despite the overall percentage improvements (above) we log summary stats as well as verbose per-strain log files. Following is a summary of the summary.