Skip to content

subsample: Add per-tile binning with skip-ahead sampling and threading#44

Open
jdidion wants to merge 11 commits intostjude-rust-labs:masterfrom
jdidion:feat/subsample-by-tile
Open

subsample: Add per-tile binning with skip-ahead sampling and threading#44
jdidion wants to merge 11 commits intostjude-rust-labs:masterfrom
jdidion:feat/subsample-by-tile

Conversation

@jdidion
Copy link
Copy Markdown

@jdidion jdidion commented Mar 13, 2026

Summary

Adds per-tile binning to the subsample command for normalizing FASTQ files based on physical sequencing location, mitigating batch effects from uneven coverage or flow cell edge effects.

  • Reads are binned by lane and tile parsed from Illumina read headers (@<instrument>:<run>:<flowcell>:<lane>:<tile>:<x>:<y>)
  • Bins with fewer than the target number of records are discarded
  • Records are randomly sampled from each retained bin
  • Supports single-end and paired-end reads

New CLI options

Option Description
--record-count-per-tile N Explicit per-tile count (exclusive with -n/-p)
--bin-by-tile Enable tile binning with -n or -p (auto-computes per-tile count)
--fast Use faster skip-ahead sampling instead of default exact method
--in-memory Keep tiles in RAM instead of writing temp files
--temp-dir DIR Directory for temp tile files (default: system temp)
--sampling-threads N Parallel tile processing (default: 1)
--compression-threads N Parallel gzip output (default: 1)

Three tile count modes

  • --record-count-per-tile N — explicit per-tile count
  • -n N --bin-by-tile — per-tile = N / num_bins
  • -p P --bin-by-tile — per-tile = floor(P * total / num_bins)

Three storage/sampling strategies

Mode Storage Sampling Exact counts?
Default temp files + offset index indexed seek yes
--fast temp files skip-ahead (exponential byte jumps) approximate
--in-memory RAM random index selection yes

Usage examples

# Exact 10,000 reads per tile (paper's method)
fq subsample --record-count-per-tile 10000 --seed 42 \
  --r1-dst out.R1.fq.gz --r2-dst out.R2.fq.gz in.R1.fq.gz in.R2.fq.gz

# Auto-compute per-tile from target count, parallel sampling
fq subsample -n 100000 --bin-by-tile --sampling-threads 4 --seed 42 \
  --r1-dst out.R1.fq.gz in.R1.fq.gz

# Fast approximate mode with custom temp dir
fq subsample --record-count-per-tile 10000 --fast --temp-dir /scratch --seed 42 \
  --r1-dst out.R1.fq.gz in.R1.fq.gz

# In-memory mode (no temp files, higher RAM usage)
fq subsample --record-count-per-tile 10000 --in-memory --seed 42 \
  --r1-dst out.R1.fq.gz in.R1.fq.gz

Dependencies

  • Added tempfile crate for temp directory management

Test plan

  • Illumina header parsing (standard, with description, edge cases)
  • Exponential distribution sampling
  • FASTQ record boundary detection after seeking
  • Record offset index building
  • Exact tile sampling (single-end and paired-end)
  • Skip-ahead tile sampling (single-end)
  • In-memory tile sampling (paired-end)
  • Auto per-tile count from --record-count
  • Empty bins edge case
  • All 73 tests pass (63 original + 10 new)
  • Test with real Illumina FASTQ data

🤖 Generated with Claude Code

jdidion and others added 5 commits March 13, 2026 07:51
Add `--record-count-per-tile` option to the subsample command. This bins
reads by their lane and tile from the Illumina read header, discards bins
with fewer than the specified number of records, and randomly subsamples
exactly that many records from each retained bin.

This normalizes FASTQ files based on physical sequencing location to
mitigate batch effects from uneven coverage or flow cell edge effects.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When --bin-by-tile is combined with --record-count or --probability,
the per-tile count is computed automatically:
  - --record-count N: per_tile = N / num_bins
  - --probability P: per_tile = floor(P * total_records / num_bins)

The --record-count-per-tile option remains for explicit per-tile counts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Major refactor of per-tile binning to use temporary files instead of
in-memory bitmaps. This reduces memory usage from O(total_records) to
O(records_per_tile * num_tiles_sampled).

Changes:
- First pass writes records to per-tile temp files
- Skip-ahead sampling (default): uses exponential-distributed byte
  jumps through temp files for O(n_sampled) I/O
- Exact indexed sampling (--exact): builds record offset index for
  precise seeking
- --sampling-threads: parallel tile sampling via std::thread::scope
- --compression-threads: parallel gzip output via concatenated streams
- --record-count without --exact on uncompressed input also uses
  skip-ahead
- Non-tile --record-count on gzipped input falls back to exact mode

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When --in-memory is used with tile binning, records are kept in RAM
organized by tile instead of writing to temporary files. This trades
memory for disk I/O and avoids temporary file creation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jdidion jdidion changed the title subsample: Add per-tile binning mode subsample: Add per-tile binning with skip-ahead sampling and threading Mar 13, 2026
Change default behavior to exact sampling (backward compatible).
The --fast flag opts into skip-ahead approximate sampling.

Add --temp-dir option to specify where tile temp files are written
(defaults to system temp directory).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Skip-ahead sampling cannot reliably pair R1 and R2 records because
it maps byte positions proportionally, which breaks when R1 and R2
records have different sizes. Paired-end now always uses exact
indexed sampling regardless of the --fast flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jdidion jdidion force-pushed the feat/subsample-by-tile branch from ccdd43a to cd6e340 Compare March 14, 2026 13:36
jdidion and others added 2 commits March 14, 2026 06:41
- Empty input now produces empty output instead of crashing with
  "invalid uniform range" from Uniform::new(0, 0)
- .fai index counts are cross-checked against actual line counts for
  uncompressed files (cheap); gzipped files trust the index since
  decompressing to cross-check would negate the performance benefit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the collect-then-concatenate pattern with a bounded
mpsc channel. Sampling threads send per-tile results to a writer
thread that writes directly to the output file(s), avoiding holding
all sampled data in memory at once.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
jdidion and others added 2 commits March 15, 2026 08:30
- Remove .bgz from is_gzipped() to match fastq::fs which only
  recognizes .gz
- Tile processing order is deterministic: tiles sorted by bin_key,
  per-tile RNG derived from base_seed + bin_key (no shared mutex)
- Temp file creation errors propagate as SubsampleError instead of
  panicking via unwrap()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@zaeleus
Copy link
Copy Markdown
Collaborator

zaeleus commented Mar 18, 2026

Hi,

I suspect this feature may be a bit too niche to include in fq.

I am curious, however, about the motivation behind per-tile binning. With a sufficiently large number of reads, is there a measurable difference in downstream quality compared to a uniform sampling?

@jdidion
Copy link
Copy Markdown
Author

jdidion commented Mar 20, 2026

Hi @zaeleus - this paper was the original motivation: https://www.biorxiv.org/content/10.64898/2026.03.08.710357v1. They describe the per-tile binning method, but do not publish any tool to replicate it. It seemed a straight-forward and logical thing to add to a tool like fq that already supports multiple different downsampling methods.

@zaeleus
Copy link
Copy Markdown
Collaborator

zaeleus commented Mar 24, 2026

fq only has one downsampling method, i.e., uniform random sampling. Choosing between approximate and exact affects the number of records that are emitted.

I don't think an experimental and untested methodology is a good fit here. Have you tested the practicality of the output, and is there an meaningful improvement over random sampling?

@jdidion
Copy link
Copy Markdown
Author

jdidion commented Mar 24, 2026

Yes, I've been using the per-tile subsampled output (using the build from this branch) in an active R&D project.

I'll note that it's not straight-forward to select the value for the per-tile target number of reads, so if this PR is accepted, it would be good to have a companion tool that counts the number of reads per tile and outputs the distribution.

It also seems that, at least for NovaSeq X, BCL collates the read by tile (i.e. all the reads from each tile are consecutive in the fastq). So an indexing procedure that notes the offset of the first read in each tile during the per-tile counting pass would enable parallelization of the per-tile downsampling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants