subsample: Add per-tile binning with skip-ahead sampling and threading#44
subsample: Add per-tile binning with skip-ahead sampling and threading#44jdidion wants to merge 11 commits intostjude-rust-labs:masterfrom
Conversation
Add `--record-count-per-tile` option to the subsample command. This bins reads by their lane and tile from the Illumina read header, discards bins with fewer than the specified number of records, and randomly subsamples exactly that many records from each retained bin. This normalizes FASTQ files based on physical sequencing location to mitigate batch effects from uneven coverage or flow cell edge effects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When --bin-by-tile is combined with --record-count or --probability, the per-tile count is computed automatically: - --record-count N: per_tile = N / num_bins - --probability P: per_tile = floor(P * total_records / num_bins) The --record-count-per-tile option remains for explicit per-tile counts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Major refactor of per-tile binning to use temporary files instead of in-memory bitmaps. This reduces memory usage from O(total_records) to O(records_per_tile * num_tiles_sampled). Changes: - First pass writes records to per-tile temp files - Skip-ahead sampling (default): uses exponential-distributed byte jumps through temp files for O(n_sampled) I/O - Exact indexed sampling (--exact): builds record offset index for precise seeking - --sampling-threads: parallel tile sampling via std::thread::scope - --compression-threads: parallel gzip output via concatenated streams - --record-count without --exact on uncompressed input also uses skip-ahead - Non-tile --record-count on gzipped input falls back to exact mode Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When --in-memory is used with tile binning, records are kept in RAM organized by tile instead of writing to temporary files. This trades memory for disk I/O and avoids temporary file creation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change default behavior to exact sampling (backward compatible). The --fast flag opts into skip-ahead approximate sampling. Add --temp-dir option to specify where tile temp files are written (defaults to system temp directory). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Skip-ahead sampling cannot reliably pair R1 and R2 records because it maps byte positions proportionally, which breaks when R1 and R2 records have different sizes. Paired-end now always uses exact indexed sampling regardless of the --fast flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ccdd43a to
cd6e340
Compare
- Empty input now produces empty output instead of crashing with "invalid uniform range" from Uniform::new(0, 0) - .fai index counts are cross-checked against actual line counts for uncompressed files (cheap); gzipped files trust the index since decompressing to cross-check would negate the performance benefit Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the collect-then-concatenate pattern with a bounded mpsc channel. Sampling threads send per-tile results to a writer thread that writes directly to the output file(s), avoiding holding all sampled data in memory at once. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove .bgz from is_gzipped() to match fastq::fs which only recognizes .gz - Tile processing order is deterministic: tiles sorted by bin_key, per-tile RNG derived from base_seed + bin_key (no shared mutex) - Temp file creation errors propagate as SubsampleError instead of panicking via unwrap() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Hi, I suspect this feature may be a bit too niche to include in fq. I am curious, however, about the motivation behind per-tile binning. With a sufficiently large number of reads, is there a measurable difference in downstream quality compared to a uniform sampling? |
|
Hi @zaeleus - this paper was the original motivation: https://www.biorxiv.org/content/10.64898/2026.03.08.710357v1. They describe the per-tile binning method, but do not publish any tool to replicate it. It seemed a straight-forward and logical thing to add to a tool like |
|
fq only has one downsampling method, i.e., uniform random sampling. Choosing between approximate and exact affects the number of records that are emitted. I don't think an experimental and untested methodology is a good fit here. Have you tested the practicality of the output, and is there an meaningful improvement over random sampling? |
|
Yes, I've been using the per-tile subsampled output (using the build from this branch) in an active R&D project. I'll note that it's not straight-forward to select the value for the per-tile target number of reads, so if this PR is accepted, it would be good to have a companion tool that counts the number of reads per tile and outputs the distribution. It also seems that, at least for NovaSeq X, BCL collates the read by tile (i.e. all the reads from each tile are consecutive in the fastq). So an indexing procedure that notes the offset of the first read in each tile during the per-tile counting pass would enable parallelization of the per-tile downsampling. |
Summary
Adds per-tile binning to the
subsamplecommand for normalizing FASTQ files based on physical sequencing location, mitigating batch effects from uneven coverage or flow cell edge effects.@<instrument>:<run>:<flowcell>:<lane>:<tile>:<x>:<y>)New CLI options
--record-count-per-tile N-n/-p)--bin-by-tile-nor-p(auto-computes per-tile count)--fast--in-memory--temp-dir DIR--sampling-threads N--compression-threads NThree tile count modes
--record-count-per-tile N— explicit per-tile count-n N --bin-by-tile— per-tile =N / num_bins-p P --bin-by-tile— per-tile =floor(P * total / num_bins)Three storage/sampling strategies
--fast--in-memoryUsage examples
Dependencies
tempfilecrate for temp directory managementTest plan
--record-count🤖 Generated with Claude Code