Implement memory-efficient, optionally pair-aware deduplication in Rust

To this day, we still haven't been able to find a deduplication solution that checks all our boxes:

- is fast
- supports pair-aware Illumina deduplication, which is to say: it knows when reads are paired, and if they are and both mates are found, it does not treat reads as independent during deduplication
- it can also work with single-end Illumina reads or other single-read platforms
- it has controls to nudge it more in the direction of clustering for noisier reads
- it's not single-threaded
- its memory usage is optimized enough to make it laptop friendly for arbitrarily large FASTQ files (this is the hardest box to check). We had high hopes for pair-aware deduplication for `fastp`, but it still appears to read the entire input FASTQ's bytes into memory.
- nice to have: it uses SIMD to compare reads

Following the precedent set by our amplicon-finding Rust script, I think it would be worth at least trying to roll our own high-performance deduplication logic. In Rust we could check all the above boxes and more. Together with sequence-based sorting, paired-read merging, and amplicon-finding, deduplication would be part of compute-intensive core of oneroof's operations that would benefit from a native implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement memory-efficient, optionally pair-aware deduplication in Rust #66

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Implement memory-efficient, optionally pair-aware deduplication in Rust #66

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions