Skip to content

Implement memory-efficient, optionally pair-aware deduplication in Rust #66

@nrminor

Description

@nrminor

To this day, we still haven't been able to find a deduplication solution that checks all our boxes:

  • is fast
  • supports pair-aware Illumina deduplication, which is to say: it knows when reads are paired, and if they are and both mates are found, it does not treat reads as independent during deduplication
  • it can also work with single-end Illumina reads or other single-read platforms
  • it has controls to nudge it more in the direction of clustering for noisier reads
  • it's not single-threaded
  • its memory usage is optimized enough to make it laptop friendly for arbitrarily large FASTQ files (this is the hardest box to check). We had high hopes for pair-aware deduplication for fastp, but it still appears to read the entire input FASTQ's bytes into memory.
  • nice to have: it uses SIMD to compare reads

Following the precedent set by our amplicon-finding Rust script, I think it would be worth at least trying to roll our own high-performance deduplication logic. In Rust we could check all the above boxes and more. Together with sequence-based sorting, paired-read merging, and amplicon-finding, deduplication would be part of compute-intensive core of oneroof's operations that would benefit from a native implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions