-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
To this day, we still haven't been able to find a deduplication solution that checks all our boxes:
- is fast
- supports pair-aware Illumina deduplication, which is to say: it knows when reads are paired, and if they are and both mates are found, it does not treat reads as independent during deduplication
- it can also work with single-end Illumina reads or other single-read platforms
- it has controls to nudge it more in the direction of clustering for noisier reads
- it's not single-threaded
- its memory usage is optimized enough to make it laptop friendly for arbitrarily large FASTQ files (this is the hardest box to check). We had high hopes for pair-aware deduplication for
fastp, but it still appears to read the entire input FASTQ's bytes into memory. - nice to have: it uses SIMD to compare reads
Following the precedent set by our amplicon-finding Rust script, I think it would be worth at least trying to roll our own high-performance deduplication logic. In Rust we could check all the above boxes and more. Together with sequence-based sorting, paired-read merging, and amplicon-finding, deduplication would be part of compute-intensive core of oneroof's operations that would benefit from a native implementation.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels