Skip to content

How to deal with binning artifacts in the database? #159

@alephreish

Description

@alephreish

What are the recommended ways of dealing with contamination in the database? A scaffold closely related to the query automatically yields a high-scoring match, yet if its taxonomic identification is based on a binning artifact, this results in incorrect identification - even if there are matches to other, correctly identified but more remotely related scaffolds in the database.

Take e.g. this contig. It's an archaeal contig that ended up in a flavobacterial bin. The contig is present in gtdb226db and thus receives a high-scoring identification when searched against it:

#query_id  name               taxID      query_length  score    rank     lineage                                                                                                                             taxID:match_count
1          JARRKS010000295.1  258006470  4863          0.86387  no rank  d_Bacteria;p_Bacteroidota;c_Bacteroidia;o_Flavobacteriales;f_Flavobacteriaceae;g_DATGBV01;s_DATGBV01 sp037387865;-_GCA_037387865.1  258006470:464 

Is there a way to exclude certain genomes (e.g. based on its taxID) or scaffolds from the search? One strategy I envision is to do two rounds of classify searches: once against the complete database and the second time excluding the genomes matched in the first round. This would help to deal with singleton binning artifacts as in the case of JARRKS010000295.1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions