Fixing maximum kmer count kmc flag by oxygen311 · Pull Request #10 · refresh-bio/colord

oxygen311 · 2024-02-03T00:42:49Z

Dear CoLoRd developers team!

Hope you are doing well.

Issue:

While working with a CoLoRd in reference mode we have noticed strange behaviour in the case of length of reads << length of reference. Our investigations led to huge nodes in the similarity graph which are way more frequent than expected.

The only reason it usually works is this line:

colord/src/colord/reads_sim_graph.cpp

Line 390 in 25b2860

if (acceptRefRead && kmer_card < maxKmerCount)

But this condition is supposed to be true in the case of proper filtering.

We've noticed this in reference-based mode because that's no such condition to add a node to a graph from pseudo-reads.

Proposed Fix:

The fix is just switching to use the right flag of kmc tool, -cx instead of -cs. Here is a quotation from kmc help:

>  -ci<value> - exclude k-mers occurring less than <value> times (default: 2)
>  -cs<value> - maximal value of a counter (default: 255)
>  -cx<value> - exclude k-mers occurring more of than <value> times (default: 1e9)

It is also supposed to fix the logic of compression since kmers are supposed to be chosen based on count as well as hash.

Testing:

We have performed thorough testing, including specific scenarios with short reads and long references. Feel free to test it yourself.

Acknowledgments:

A special thanks to @iam28th for their assistance in tracing back to the kmc flag.

Hope this improvement gonna be helpful and improve results.

Best regards,
Alexey

marekkokot · 2024-03-13T08:18:52Z

Hi @oxygen311

It's been a while since I last worked with CoLoRd code, so I must think carefully about this.
Thank you for your deep analyses and digging into our code (also, thank you, @iam28th !).
I recall that there was a reason to use -cs instead of -cx for KMC, but now I am unsure what that reason was :(.

Oh, I found in an internal conversation of our group from the end of 2020 that we initially had -cx, and we intentionally changed it to -cs and applied filtering when adding to the graph. I guess it was introduced before reference mode, so maybe this is indeed some detail we missed.

I guess you are referring to this part of a paper:
"The initial algorithm step is filtering k-mers of the input reads. For this purpose KMC package [28] is executed with k automatically adjusted to the data set. K-mers with less than L = 4 (possibly, sequencing errors), or more than H = 80 occurrences (non-informative) in the entire read set are discarded"

I think this statement is oversimplification, and the H filtering is performed later.

The initial idea remains the same, i.e., we don't want to have non-informative (frequent) k-mers.
We faced a problem in the case of high-quality reads (PacBio HiFi).
In such a case, frequent k-mers are informative, and discarding them at KMC stage leads to worse reference-reads detection (as far as I am starting to remember now :))

In such a case, frequent K-mers are informative, and discarding them at the KMC stage leads to worse reference-reads finding (as far as I remember :)). On the other hand, keeping them all increases the size of a graph, so we still do some kind of filtering, i.e., adding them to the graph only up to the value of the -H parameter.

I wonder what your data is. Could you share it?
I guess your change makes the CoLoRd require less RAM and maybe compute faster. What about the compression ratio?
Maybe we should introduce a new special case for a reference genome instead of using cx, but I am not sure yet how it should look.
The problem is that we also performed quite a broad testing to produce the final form of code that works well across different kinds of data (at least, we hoped it would work well), and the change you propose may be risky. Have you tried this change for PacBio HiFi?

oxygen311 · 2024-03-25T23:09:17Z

Hello @marekkokot,

Sorry for the late response!

I think this statement is oversimplification, and the H filtering is performed later.
I think it's just a different approach and not a simplification.
Implemented logic is different, kmers are not filtered by frequency but are limited. They are just hitting the maximum kmer frequency limit and then new reads are not accepted to the similarity graph with this kmer node. But those added remain.

The problem we've faced can be reproduced with the following data: long genome reference (eg human) and relatively short reads (eg illumina/ont adaptive sampling). I understand that's not the typical case for using but slow down is extreme in that case, up to 100x times. And it may be the case in some other applications.

Have you tried this change for PacBio HiFi?
No, I didn't, nor for ONT simplex/duplex high-accuracy reads.

I understand this change may require additional testing, but I think it's crucial for the algorithm to work correctly.
If you need help with testing, feel free to ask.

Best regards,
Alexey

fixing maximum kmer count kmc flag

bc2153a

oxygen311 mentioned this pull request Mar 12, 2024

maxKmerCount not working #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing maximum kmer count kmc flag#10

Fixing maximum kmer count kmc flag#10
oxygen311 wants to merge 1 commit intorefresh-bio:masterfrom
oxygen311:master

oxygen311 commented Feb 3, 2024

Uh oh!

marekkokot commented Mar 13, 2024

Uh oh!

oxygen311 commented Mar 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

oxygen311 commented Feb 3, 2024

Issue:

Proposed Fix:

Testing:

Acknowledgments:

Uh oh!

marekkokot commented Mar 13, 2024

Uh oh!

oxygen311 commented Mar 25, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants