subsample: Add per-tile binning with skip-ahead sampling and threading by jdidion · Pull Request #44 · stjude-rust-labs/fq

jdidion · 2026-03-13T14:52:09Z

Summary

Adds per-tile binning to the subsample command for normalizing FASTQ files based on physical sequencing location, mitigating batch effects from uneven coverage or flow cell edge effects.

Reads are binned by lane and tile parsed from Illumina read headers (@<instrument>:<run>:<flowcell>:<lane>:<tile>:<x>:<y>)
Bins with fewer than the target number of records are discarded
Records are randomly sampled from each retained bin
Supports single-end and paired-end reads

New CLI options

Option	Description
`--record-count-per-tile N`	Explicit per-tile count (exclusive with `-n`/`-p`)
`--bin-by-tile`	Enable tile binning with `-n` or `-p` (auto-computes per-tile count)
`--fast`	Use faster skip-ahead sampling instead of default exact method
`--in-memory`	Keep tiles in RAM instead of writing temp files
`--temp-dir DIR`	Directory for temp tile files (default: system temp)
`--sampling-threads N`	Parallel tile processing (default: 1)
`--compression-threads N`	Parallel gzip output (default: 1)

Three tile count modes

--record-count-per-tile N — explicit per-tile count
-n N --bin-by-tile — per-tile = N / num_bins
-p P --bin-by-tile — per-tile = floor(P * total / num_bins)

Three storage/sampling strategies

Mode	Storage	Sampling	Exact counts?
Default	temp files + offset index	indexed seek	yes
`--fast`	temp files	skip-ahead (exponential byte jumps)	approximate
`--in-memory`	RAM	random index selection	yes

Usage examples

# Exact 10,000 reads per tile (paper's method)
fq subsample --record-count-per-tile 10000 --seed 42 \
  --r1-dst out.R1.fq.gz --r2-dst out.R2.fq.gz in.R1.fq.gz in.R2.fq.gz

# Auto-compute per-tile from target count, parallel sampling
fq subsample -n 100000 --bin-by-tile --sampling-threads 4 --seed 42 \
  --r1-dst out.R1.fq.gz in.R1.fq.gz

# Fast approximate mode with custom temp dir
fq subsample --record-count-per-tile 10000 --fast --temp-dir /scratch --seed 42 \
  --r1-dst out.R1.fq.gz in.R1.fq.gz

# In-memory mode (no temp files, higher RAM usage)
fq subsample --record-count-per-tile 10000 --in-memory --seed 42 \
  --r1-dst out.R1.fq.gz in.R1.fq.gz

Dependencies

Added tempfile crate for temp directory management

Test plan

🤖 Generated with Claude Code

Add `--record-count-per-tile` option to the subsample command. This bins reads by their lane and tile from the Illumina read header, discards bins with fewer than the specified number of records, and randomly subsamples exactly that many records from each retained bin. This normalizes FASTQ files based on physical sequencing location to mitigate batch effects from uneven coverage or flow cell edge effects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When --bin-by-tile is combined with --record-count or --probability, the per-tile count is computed automatically: - --record-count N: per_tile = N / num_bins - --probability P: per_tile = floor(P * total_records / num_bins) The --record-count-per-tile option remains for explicit per-tile counts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Major refactor of per-tile binning to use temporary files instead of in-memory bitmaps. This reduces memory usage from O(total_records) to O(records_per_tile * num_tiles_sampled). Changes: - First pass writes records to per-tile temp files - Skip-ahead sampling (default): uses exponential-distributed byte jumps through temp files for O(n_sampled) I/O - Exact indexed sampling (--exact): builds record offset index for precise seeking - --sampling-threads: parallel tile sampling via std::thread::scope - --compression-threads: parallel gzip output via concatenated streams - --record-count without --exact on uncompressed input also uses skip-ahead - Non-tile --record-count on gzipped input falls back to exact mode Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When --in-memory is used with tile binning, records are kept in RAM organized by tile instead of writing to temporary files. This trades memory for disk I/O and avoids temporary file creation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Change default behavior to exact sampling (backward compatible). The --fast flag opts into skip-ahead approximate sampling. Add --temp-dir option to specify where tile temp files are written (defaults to system temp directory). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Skip-ahead sampling cannot reliably pair R1 and R2 records because it maps byte positions proportionally, which breaks when R1 and R2 records have different sizes. Paired-end now always uses exact indexed sampling regardless of the --fast flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Empty input now produces empty output instead of crashing with "invalid uniform range" from Uniform::new(0, 0) - .fai index counts are cross-checked against actual line counts for uncompressed files (cheap); gzipped files trust the index since decompressing to cross-check would negate the performance benefit Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the collect-then-concatenate pattern with a bounded mpsc channel. Sampling threads send per-tile results to a writer thread that writes directly to the output file(s), avoiding holding all sampled data in memory at once. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove .bgz from is_gzipped() to match fastq::fs which only recognizes .gz - Tile processing order is deterministic: tiles sorted by bin_key, per-tile RNG derived from base_seed + bin_key (no shared mutex) - Temp file creation errors propagate as SubsampleError instead of panicking via unwrap() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zaeleus · 2026-03-18T21:41:56Z

Hi,

I suspect this feature may be a bit too niche to include in fq.

I am curious, however, about the motivation behind per-tile binning. With a sufficiently large number of reads, is there a measurable difference in downstream quality compared to a uniform sampling?

jdidion · 2026-03-20T21:20:51Z

Hi @zaeleus - this paper was the original motivation: https://www.biorxiv.org/content/10.64898/2026.03.08.710357v1. They describe the per-tile binning method, but do not publish any tool to replicate it. It seemed a straight-forward and logical thing to add to a tool like fq that already supports multiple different downsampling methods.

zaeleus · 2026-03-24T20:18:36Z

fq only has one downsampling method, i.e., uniform random sampling. Choosing between approximate and exact affects the number of records that are emitted.

I don't think an experimental and untested methodology is a good fit here. Have you tested the practicality of the output, and is there an meaningful improvement over random sampling?

jdidion · 2026-03-24T20:25:05Z

Yes, I've been using the per-tile subsampled output (using the build from this branch) in an active R&D project.

I'll note that it's not straight-forward to select the value for the per-tile target number of reads, so if this PR is accepted, it would be good to have a companion tool that counts the number of reads per tile and outputs the distribution.

It also seems that, at least for NovaSeq X, BCL collates the read by tile (i.e. all the reads from each tile are consecutive in the fastq). So an indexing procedure that notes the offset of the first read in each tile during the per-tile counting pass would enable parallelization of the per-tile downsampling.

jdidion and others added 5 commits March 13, 2026 07:51

Remove accidentally committed .omc state files

0b1d326

jdidion changed the title ~~subsample: Add per-tile binning mode~~ subsample: Add per-tile binning with skip-ahead sampling and threading Mar 13, 2026

jdidion mentioned this pull request Mar 14, 2026

subsample: Use FASTQ index for fast record counting #46

Closed

4 tasks

jdidion force-pushed the feat/subsample-by-tile branch from ccdd43a to cd6e340 Compare March 14, 2026 13:36

jdidion and others added 2 commits March 14, 2026 06:41

jdidion mentioned this pull request Mar 14, 2026

fastq: Add RecordIndex and IndexedReader for O(1) random access #56

Open

9 tasks

jdidion and others added 2 commits March 15, 2026 08:30

Add review response documenting dispositions for all findings

34f7e78

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

subsample: Add per-tile binning with skip-ahead sampling and threading#44

subsample: Add per-tile binning with skip-ahead sampling and threading#44
jdidion wants to merge 11 commits intostjude-rust-labs:masterfrom
jdidion:feat/subsample-by-tile

jdidion commented Mar 13, 2026 •

edited

Loading

Uh oh!

zaeleus commented Mar 18, 2026

Uh oh!

jdidion commented Mar 20, 2026

Uh oh!

zaeleus commented Mar 24, 2026

Uh oh!

jdidion commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jdidion commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New CLI options

Three tile count modes

Three storage/sampling strategies

Usage examples

Dependencies

Test plan

Uh oh!

zaeleus commented Mar 18, 2026

Uh oh!

jdidion commented Mar 20, 2026

Uh oh!

zaeleus commented Mar 24, 2026

Uh oh!

jdidion commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jdidion commented Mar 13, 2026 •

edited

Loading