ingest: Prioritize records by segment by joverlee521 · Pull Request #294 · nextstrain/seasonal-flu

joverlee521 · 2026-01-15T00:11:45Z

Description of proposed changes

For records that have the same strain name, prioritize records that have HA segments and then records that have NA segments. Otherwise, keep the first record in the NDJSON.

Related issue(s)

Resolves #253

Checklist

Checks pass
Update changelog

Adds the framework for prioritizing records for duplicate strains. This currently just replicates the previous behavior of choosing the hardcoded prioritized id or the first record in the NDJSON.

For records that have the same strain name, prioritize records that have HA segments and then records that have NA segments. Otherwise, keep the first record in the NDJSON.

jameshadfield · 2026-01-15T03:04:41Z

ingest/scripts/prioritize_id_per_strain

+        # Compare against already prioritized record
+        if prioritized := prioritized_ids.get(strain):
+            new_reasons = get_reasons(sequences, first_record = False)
+            if use_new_record(prioritized["reasons"], new_reasons):
+                prioritized_ids[strain] = {
+                    "id": record_id,
+                    "reasons": new_reasons
+                }
+        else:
+            prioritized_ids[strain] = {
+                "id": record_id,
+                "reasons": get_reasons(sequences, first_record = True)
+            }


When reading this I wondered about a slightly different design. Instead of keeping track of a single "prioritised" ID per strain, for each strain store a list of IDs and their reasons . You can then do a second pass through this to make a descision about the ID to pick. E.g.:

Clear winner? (e.g. hardcoded, only one with HA, only one with HA + NA etc)

Multiple potential winners? (e.g. multiple with both HA & NA)

We could start by logging the multiple potential winners ("unresolved duplicates"), and over time expand our decision algorithm to incorporate num segments sequenced, date submitted etc. At the least this would be interesting information to log.

P.S. This would allow you to write out a list of EPI_ISLs to take and then replace ./scripts/dedup-by-strain with a simple filter.

Ah, I can see that being interesting to log...but realistically, I'm not sure who would dig into these details? In practice, we generally only care to pick specific records for titer references (at least in seasonal-flu).

Fair enough. I still think I like the idea of doing all the dedup in this script (i.e. pick-strains + filter, rather than prioritisation + dedup layers) but it's not a big deal if we're not interested in looking at duplicate-resolution more systematically.

Coming back to this because I'm realizing we probably want a similar prioritization for the INSDC data.

Fair enough. I still think I like the idea of doing all the dedup in this script (i.e. pick-strains + filter, rather than prioritisation + dedup layers)

I think we are thinking the same thing here, I just need to rename/update ./scripts/dedup-by-strain to do a simple filter on the prioritized ids. Will do this tomorrow.

I didn't get to this, happy to continue this when I get back. If the need arises for prioritizing by segments for flu reports, I think this is also okay to merge as-is.

joverlee521 added 2 commits January 14, 2026 15:43

ingest: Add step to prioritize id per strain

4a37a25

Adds the framework for prioritizing records for duplicate strains. This currently just replicates the previous behavior of choosing the hardcoded prioritized id or the first record in the NDJSON.

ingest: prioritize records by segment

affb798

For records that have the same strain name, prioritize records that have HA segments and then records that have NA segments. Otherwise, keep the first record in the NDJSON.

joverlee521 linked an issue Jan 15, 2026 that may be closed by this pull request

ingest: prioritize records by available segments #253

Open

jameshadfield reviewed Jan 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

ingest: Prioritize records by segment#294

ingest: Prioritize records by segment#294
joverlee521 wants to merge 2 commits intomasterfrom
prioritize-records-by-segment

joverlee521 commented Jan 15, 2026

Uh oh!

jameshadfield Jan 15, 2026

Uh oh!

joverlee521 Jan 16, 2026

Uh oh!

jameshadfield Jan 20, 2026

Uh oh!

joverlee521 Feb 5, 2026

Uh oh!

joverlee521 Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

joverlee521 commented Jan 15, 2026

Description of proposed changes

Related issue(s)

Checklist

Uh oh!

jameshadfield Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

joverlee521 Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

jameshadfield Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

joverlee521 Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

joverlee521 Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants