Conversation
Adds the framework for prioritizing records for duplicate strains. This currently just replicates the previous behavior of choosing the hardcoded prioritized id or the first record in the NDJSON.
For records that have the same strain name, prioritize records that have HA segments and then records that have NA segments. Otherwise, keep the first record in the NDJSON.
| # Compare against already prioritized record | ||
| if prioritized := prioritized_ids.get(strain): | ||
| new_reasons = get_reasons(sequences, first_record = False) | ||
| if use_new_record(prioritized["reasons"], new_reasons): | ||
| prioritized_ids[strain] = { | ||
| "id": record_id, | ||
| "reasons": new_reasons | ||
| } | ||
| else: | ||
| prioritized_ids[strain] = { | ||
| "id": record_id, | ||
| "reasons": get_reasons(sequences, first_record = True) | ||
| } |
There was a problem hiding this comment.
When reading this I wondered about a slightly different design. Instead of keeping track of a single "prioritised" ID per strain, for each strain store a list of IDs and their reasons . You can then do a second pass through this to make a descision about the ID to pick. E.g.:
- Clear winner? (e.g. hardcoded, only one with HA, only one with HA + NA etc)
- Multiple potential winners? (e.g. multiple with both HA & NA)
We could start by logging the multiple potential winners ("unresolved duplicates"), and over time expand our decision algorithm to incorporate num segments sequenced, date submitted etc. At the least this would be interesting information to log.
P.S. This would allow you to write out a list of EPI_ISLs to take and then replace ./scripts/dedup-by-strain with a simple filter.
There was a problem hiding this comment.
Ah, I can see that being interesting to log...but realistically, I'm not sure who would dig into these details? In practice, we generally only care to pick specific records for titer references (at least in seasonal-flu).
There was a problem hiding this comment.
Fair enough. I still think I like the idea of doing all the dedup in this script (i.e. pick-strains + filter, rather than prioritisation + dedup layers) but it's not a big deal if we're not interested in looking at duplicate-resolution more systematically.
There was a problem hiding this comment.
Coming back to this because I'm realizing we probably want a similar prioritization for the INSDC data.
Fair enough. I still think I like the idea of doing all the dedup in this script (i.e. pick-strains + filter, rather than prioritisation + dedup layers)
I think we are thinking the same thing here, I just need to rename/update ./scripts/dedup-by-strain to do a simple filter on the prioritized ids. Will do this tomorrow.
There was a problem hiding this comment.
I didn't get to this, happy to continue this when I get back. If the need arises for prioritizing by segments for flu reports, I think this is also okay to merge as-is.
Description of proposed changes
For records that have the same strain name, prioritize records that have HA segments and then records that have NA segments. Otherwise, keep the first record in the NDJSON.
Related issue(s)
Resolves #253
Checklist