How to deal with binning artifacts in the database?

What are the recommended ways of dealing with contamination in the database? A scaffold closely related to the query automatically yields a high-scoring match, yet if its taxonomic identification is based on a binning artifact, this results in incorrect identification - even if there are matches to other, correctly identified but more remotely related scaffolds in the database.

Take e.g. [this contig](https://www.ncbi.nlm.nih.gov/nuccore/JARRKS010000295.1). It's an archaeal contig that ended up in a flavobacterial bin. The contig is present in [gtdb226db](https://hulk.mmseqs.com/jaebeom/gtdb226db/) and thus receives a high-scoring identification when searched against it:

```
#query_id  name               taxID      query_length  score    rank     lineage                                                                                                                             taxID:match_count
1          JARRKS010000295.1  258006470  4863          0.86387  no rank  d_Bacteria;p_Bacteroidota;c_Bacteroidia;o_Flavobacteriales;f_Flavobacteriaceae;g_DATGBV01;s_DATGBV01 sp037387865;-_GCA_037387865.1  258006470:464 
````

Is there a way to exclude certain genomes (e.g. based on its taxID) or scaffolds from the search? One strategy I envision is to do two rounds of `classify` searches: once against the complete database and the second time excluding the genomes matched in the first round. This would help to deal with singleton binning artifacts as in the case of JARRKS010000295.1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deal with binning artifacts in the database? #159

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to deal with binning artifacts in the database? #159

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions