-
Notifications
You must be signed in to change notification settings - Fork 15
Description
What are the recommended ways of dealing with contamination in the database? A scaffold closely related to the query automatically yields a high-scoring match, yet if its taxonomic identification is based on a binning artifact, this results in incorrect identification - even if there are matches to other, correctly identified but more remotely related scaffolds in the database.
Take e.g. this contig. It's an archaeal contig that ended up in a flavobacterial bin. The contig is present in gtdb226db and thus receives a high-scoring identification when searched against it:
#query_id name taxID query_length score rank lineage taxID:match_count
1 JARRKS010000295.1 258006470 4863 0.86387 no rank d_Bacteria;p_Bacteroidota;c_Bacteroidia;o_Flavobacteriales;f_Flavobacteriaceae;g_DATGBV01;s_DATGBV01 sp037387865;-_GCA_037387865.1 258006470:464
Is there a way to exclude certain genomes (e.g. based on its taxID) or scaffolds from the search? One strategy I envision is to do two rounds of classify searches: once against the complete database and the second time excluding the genomes matched in the first round. This would help to deal with singleton binning artifacts as in the case of JARRKS010000295.1.