Skip to content

Fix BLAST results protein to taxonomic accession assignment #317

@chasemc

Description

@chasemc

Currently the documentation instructs and the code downloads prot.accession2taxid.gz which doesn't have all of the nr accessions.
Proteins that aren't found in prot.accession2taxid.gz are assigned to root which results in contigs becoming unclassified.
Currently this is ameliorated by using prot.accession2taxid.FULL.gz instead of prot.accession2taxid.gz, as shown below. But the code needs to be changed to handle missing accessions. Per our meeting today these should probably be assigned to None and then should be dropped before handing over to LCA.

image

Assignment to root that needs to be changed:

# If we still have missing taxids, we will set the sseqid value to the root taxid
# fill missing taxids with root_taxid
sseqid_to_taxid_df["cleaned_taxid"] = sseqid_to_taxid_df.merged_taxid.fillna(
root_taxid
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions