-
Notifications
You must be signed in to change notification settings - Fork 16
Open
Description
Currently the documentation instructs and the code downloads prot.accession2taxid.gz which doesn't have all of the nr accessions.
Proteins that aren't found in prot.accession2taxid.gz are assigned to root which results in contigs becoming unclassified.
Currently this is ameliorated by using prot.accession2taxid.FULL.gz instead of prot.accession2taxid.gz, as shown below. But the code needs to be changed to handle missing accessions. Per our meeting today these should probably be assigned to None and then should be dropped before handing over to LCA.
Assignment to root that needs to be changed:
Autometa/autometa/taxonomy/ncbi.py
Lines 453 to 457 in baf61c0
| # If we still have missing taxids, we will set the sseqid value to the root taxid | |
| # fill missing taxids with root_taxid | |
| sseqid_to_taxid_df["cleaned_taxid"] = sseqid_to_taxid_df.merged_taxid.fillna( | |
| root_taxid | |
| ) |
Metadata
Metadata
Assignees
Labels
No labels
