Implement memory-efficient sequencing read sorting for improved compression in Rust

Sorting input FASTQ or FASTA files by sequence homology, together with compression, is a surprisingly effect means of reducing disk footprint, especially if paired with quality score binning. And yet surprisingly, it doesn't seem to be particularly well-served by the current bioinformatics ecosystem. All solutions we've seen have any number of the following downsides:

- bringing entire FASTQ files into memory
- requiring CPU-intensive phylogenetic inference just to sort a FASTA
- cutting corners by only sorting with a "prefix kmer" meant to represent the whole read's sequence, which becomes effectively useless for long reads.
- ignoring quality scores
- not supporting quality score binning
- not supporting multiple compression codecs. For example, `zstd --long` appears to have the best compression ratio for big FASTAs
- becomes unusable in terms of walltime and memory expediture above datasets of a certain size
- no incremental or hierarchical sorting
- relatedly, limited usage of parallelism while sorting
- inconsistent usage of dumping temporary batches of sorted reads to disk
- little or no usage of SIMD for read comparisons
- limited pair-awareness for paired reads

Following the precedent set by our amplicon-finding Rust script, I think it would be worth at least trying to roll our own high-performance read sorting logic. In Rust we could check all the above boxes and more. Together with pair-aware deduplication, paired-read merging, and amplicon-finding, read sorting would be part of compute-intensive core of oneroof's operations that would benefit from a native implementation. Additionally, the methods developed in the rust-scripts here could inform standalone implementations that could contribute to publishable tools.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement memory-efficient sequencing read sorting for improved compression in Rust #67

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Implement memory-efficient sequencing read sorting for improved compression in Rust #67

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions