Skip to content

Implement memory-efficient sequencing read sorting for improved compression in Rust #67

@nrminor

Description

@nrminor

Sorting input FASTQ or FASTA files by sequence homology, together with compression, is a surprisingly effect means of reducing disk footprint, especially if paired with quality score binning. And yet surprisingly, it doesn't seem to be particularly well-served by the current bioinformatics ecosystem. All solutions we've seen have any number of the following downsides:

  • bringing entire FASTQ files into memory
  • requiring CPU-intensive phylogenetic inference just to sort a FASTA
  • cutting corners by only sorting with a "prefix kmer" meant to represent the whole read's sequence, which becomes effectively useless for long reads.
  • ignoring quality scores
  • not supporting quality score binning
  • not supporting multiple compression codecs. For example, zstd --long appears to have the best compression ratio for big FASTAs
  • becomes unusable in terms of walltime and memory expediture above datasets of a certain size
  • no incremental or hierarchical sorting
  • relatedly, limited usage of parallelism while sorting
  • inconsistent usage of dumping temporary batches of sorted reads to disk
  • little or no usage of SIMD for read comparisons
  • limited pair-awareness for paired reads

Following the precedent set by our amplicon-finding Rust script, I think it would be worth at least trying to roll our own high-performance read sorting logic. In Rust we could check all the above boxes and more. Together with pair-aware deduplication, paired-read merging, and amplicon-finding, read sorting would be part of compute-intensive core of oneroof's operations that would benefit from a native implementation. Additionally, the methods developed in the rust-scripts here could inform standalone implementations that could contribute to publishable tools.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions