Skip to content

Comments

ingest: Prioritize records by segment#294

Open
joverlee521 wants to merge 2 commits intomasterfrom
prioritize-records-by-segment
Open

ingest: Prioritize records by segment#294
joverlee521 wants to merge 2 commits intomasterfrom
prioritize-records-by-segment

Conversation

@joverlee521
Copy link
Contributor

Description of proposed changes

For records that have the same strain name, prioritize records that have HA segments and then records that have NA segments. Otherwise, keep the first record in the NDJSON.

Related issue(s)

Resolves #253

Checklist

  • Checks pass
  • Update changelog

Adds the framework for prioritizing records for duplicate strains.
This currently just replicates the previous behavior of choosing the
hardcoded prioritized id or the first record in the NDJSON.
For records that have the same strain name, prioritize records that
have HA segments and then records that have NA segments. Otherwise,
keep the first record in the NDJSON.
@joverlee521 joverlee521 linked an issue Jan 15, 2026 that may be closed by this pull request
Comment on lines +79 to +91
# Compare against already prioritized record
if prioritized := prioritized_ids.get(strain):
new_reasons = get_reasons(sequences, first_record = False)
if use_new_record(prioritized["reasons"], new_reasons):
prioritized_ids[strain] = {
"id": record_id,
"reasons": new_reasons
}
else:
prioritized_ids[strain] = {
"id": record_id,
"reasons": get_reasons(sequences, first_record = True)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading this I wondered about a slightly different design. Instead of keeping track of a single "prioritised" ID per strain, for each strain store a list of IDs and their reasons . You can then do a second pass through this to make a descision about the ID to pick. E.g.:

  • Clear winner? (e.g. hardcoded, only one with HA, only one with HA + NA etc)
  • Multiple potential winners? (e.g. multiple with both HA & NA)

We could start by logging the multiple potential winners ("unresolved duplicates"), and over time expand our decision algorithm to incorporate num segments sequenced, date submitted etc. At the least this would be interesting information to log.

P.S. This would allow you to write out a list of EPI_ISLs to take and then replace ./scripts/dedup-by-strain with a simple filter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I can see that being interesting to log...but realistically, I'm not sure who would dig into these details? In practice, we generally only care to pick specific records for titer references (at least in seasonal-flu).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. I still think I like the idea of doing all the dedup in this script (i.e. pick-strains + filter, rather than prioritisation + dedup layers) but it's not a big deal if we're not interested in looking at duplicate-resolution more systematically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this because I'm realizing we probably want a similar prioritization for the INSDC data.

Fair enough. I still think I like the idea of doing all the dedup in this script (i.e. pick-strains + filter, rather than prioritisation + dedup layers)

I think we are thinking the same thing here, I just need to rename/update ./scripts/dedup-by-strain to do a simple filter on the prioritized ids. Will do this tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't get to this, happy to continue this when I get back. If the need arises for prioritizing by segments for flu reports, I think this is also okay to merge as-is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ingest: prioritize records by available segments

2 participants