-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Sorting input FASTQ or FASTA files by sequence homology, together with compression, is a surprisingly effect means of reducing disk footprint, especially if paired with quality score binning. And yet surprisingly, it doesn't seem to be particularly well-served by the current bioinformatics ecosystem. All solutions we've seen have any number of the following downsides:
- bringing entire FASTQ files into memory
- requiring CPU-intensive phylogenetic inference just to sort a FASTA
- cutting corners by only sorting with a "prefix kmer" meant to represent the whole read's sequence, which becomes effectively useless for long reads.
- ignoring quality scores
- not supporting quality score binning
- not supporting multiple compression codecs. For example,
zstd --longappears to have the best compression ratio for big FASTAs - becomes unusable in terms of walltime and memory expediture above datasets of a certain size
- no incremental or hierarchical sorting
- relatedly, limited usage of parallelism while sorting
- inconsistent usage of dumping temporary batches of sorted reads to disk
- little or no usage of SIMD for read comparisons
- limited pair-awareness for paired reads
Following the precedent set by our amplicon-finding Rust script, I think it would be worth at least trying to roll our own high-performance read sorting logic. In Rust we could check all the above boxes and more. Together with pair-aware deduplication, paired-read merging, and amplicon-finding, read sorting would be part of compute-intensive core of oneroof's operations that would benefit from a native implementation. Additionally, the methods developed in the rust-scripts here could inform standalone implementations that could contribute to publishable tools.