Skip to content

Conversation

@jameshadfield
Copy link
Member

WIP - here for discussion with @joverlee521

The only "disagreements" (which I haven't yet resolved) are a handful of strains which have multiple sequences for (all) segments. So that's reassuring!

The phylo workflow hasn't been updated to use the new metadata format

DAG is a bit simpler (before: above, after: below):

image

@jameshadfield
Copy link
Member Author

jameshadfield commented Oct 7, 2024

Here's the 3 (yes, only 3) strains which were dropped:

Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment pb2. Accessions: PP761255, PP761574. Skipping this segment.
Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment pb1. Accessions: PP761260, PP761572. Skipping this segment.
Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment pa. Accessions: PP761262, PP761577. Skipping this segment.
Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment ha. Accessions: PP761257, PP761548, PP761557, PP761576. Skipping this segment.
Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment np. Accessions: PP761261, PP761550, PP761553, PP761571. Skipping this segment.
Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment na. Accessions: PP761256, PP761552, PP761555, PP761578. Skipping this segment.
Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment mp. Accessions: PP761259, PP761551, PP761554, PP761573. Skipping this segment.
Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment ns. Accessions: PP761258, PP761549, PP761556, PP761575. Skipping this segment.
Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had zero or multiple accessions for all segments. Dropping this entire strain.

Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment pb2. Accessions: PP761569, PP766982. Skipping this segment.
Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment pb1. Accessions: PP761570, PP766984. Skipping this segment.
Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment pa. Accessions: PP761563, PP766987. Skipping this segment.
Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment ha. Accessions: PP761566, PP766985. Skipping this segment.
Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment np. Accessions: PP761567, PP766983. Skipping this segment.
Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment na. Accessions: PP761568, PP766981. Skipping this segment.
Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment mp. Accessions: PP761564, PP766980. Skipping this segment.
Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment ns. Accessions: PP761565, PP766986. Skipping this segment.
Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had zero or multiple accessions for all segments. Dropping this entire strain.

Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment pb2. Accessions: PP862906, PQ367318. Skipping this segment.
Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment pb1. Accessions: PP862905, PQ367313. Skipping this segment.
Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment pa. Accessions: PP862901, PQ367316. Skipping this segment.
Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment ha. Accessions: PP862902, PQ367314. Skipping this segment.
Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment np. Accessions: PP862907, PQ367312. Skipping this segment.
Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment na. Accessions: PP862903, PQ367315. Skipping this segment.
Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment mp. Accessions: PP862904, PQ367317. Skipping this segment.
Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment ns. Accessions: PP862908, PQ367311. Skipping this segment.
Strain 'A/sanderling/Virginia/W24-190K/2024' had zero or multiple accessions for all segments. Dropping this entire strain.

@joverlee521 and I discussed this today and we're going to leave this PR open for the moment and explore NCBI's new API in #82 which promises to group segments together and compare those results to ours from this PR.

@jameshadfield jameshadfield changed the title James/dedup ncbi segments [on hold] dedup ncbi segments Oct 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants