Fixing maximum kmer count kmc flag#10
Conversation
|
Hi @oxygen311 It's been a while since I last worked with CoLoRd code, so I must think carefully about this. Oh, I found in an internal conversation of our group from the end of 2020 that we initially had I guess you are referring to this part of a paper: I think this statement is oversimplification, and the H filtering is performed later. The initial idea remains the same, i.e., we don't want to have non-informative (frequent) k-mers. In such a case, frequent K-mers are informative, and discarding them at the KMC stage leads to worse reference-reads finding (as far as I remember :)). On the other hand, keeping them all increases the size of a graph, so we still do some kind of filtering, i.e., adding them to the graph only up to the value of the -H parameter. I wonder what your data is. Could you share it? |
|
Hello @marekkokot, Sorry for the late response!
The problem we've faced can be reproduced with the following data: long genome reference (eg human) and relatively short reads (eg illumina/ont adaptive sampling). I understand that's not the typical case for using but slow down is extreme in that case, up to 100x times. And it may be the case in some other applications.
I understand this change may require additional testing, but I think it's crucial for the algorithm to work correctly. Best regards, |
Dear CoLoRd developers team!
Hope you are doing well.
Issue:
While working with a CoLoRd in reference mode we have noticed strange behaviour in the case of length of reads << length of reference. Our investigations led to huge nodes in the similarity graph which are way more frequent than expected.
The only reason it usually works is this line:
colord/src/colord/reads_sim_graph.cpp
Line 390 in 25b2860
But this condition is supposed to be true in the case of proper filtering.
We've noticed this in reference-based mode because that's no such condition to add a node to a graph from pseudo-reads.
Proposed Fix:
The fix is just switching to use the right flag of kmc tool,
-cxinstead of-cs. Here is a quotation from kmc help:It is also supposed to fix the logic of compression since kmers are supposed to be chosen based on count as well as hash.
Testing:
We have performed thorough testing, including specific scenarios with short reads and long references. Feel free to test it yourself.
Acknowledgments:
A special thanks to @iam28th for their assistance in tracing back to the kmc flag.
Hope this improvement gonna be helpful and improve results.
Best regards,
Alexey