fastq: Add RecordIndex and IndexedReader for O(1) random access by jdidion · Pull Request #56 · stjude-rust-labs/fq

jdidion · 2026-03-14T14:39:54Z

Depends on PR #44 (per-tile binning). Merge #44 first.

Summary

Introduces RecordIndex and IndexedReader in src/fastq/io/, replacing several ad-hoc functions in the subsample command with a unified indexed FASTQ reader.

`RecordIndex`

Built by a single scan of an uncompressed FASTQ file. Stores the byte offset of every record start. Provides:

record_count() — O(1)
avg_record_size() — O(1)
offset(i) — O(1) lookup
from_fai(path) — load record count from a .fai sidecar
build_from_path(path) — scan and index a file on disk

`IndexedReader`

Wraps a seekable reader + RecordIndex. Provides:

read_record_at(i) — seek to known offset, decode record
read_records_at(indices, callback) — batch reads at sorted indices
skip_ahead_sample(target, rng, callback) — geometric jumps in record index space (not byte space), eliminating boundary detection

What this replaces

Old function	Replacement
`count_lines` + `/ 4`	`RecordIndex::build_from_path().record_count()`
`build_record_index`	`RecordIndex::build_from_path()`
`estimate_avg_record_size`	`RecordIndex::build_from_path().avg_record_size()`
`count_records_from_index`	`RecordIndex::from_fai()`
`find_record_after` (byte-level boundary scan)	`IndexedReader::read_record_at()` (direct seek)
`skip_ahead_sample_file` (byte jumps + boundary scan)	`IndexedReader::skip_ahead_sample()` (index jumps)
`read_record_at` (raw byte reads)	`IndexedReader::read_record_at()` (returns `Record`)

Why this matters

Skip-ahead is now correct: jumps happen in record-index space, not byte space. No more heuristic boundary detection that could misidentify quality-score @ as record starts.
Single scan: one pass builds record count + offsets + avg size, instead of separate passes for each.
Reusable: RecordIndex and IndexedReader are in fastq::io, available to any command.

count_lines and open_maybe_gz are retained for the gzipped-input fallback path in subsample_exact, since gzipped files can't be seeked.

References

The per-tile binning method implemented in PR #44 (which this PR supports) is described in:

Volkov H, Raitses-Gurevich M, Grad M, et al., "Exploring per-base quality scores as a surrogate marker of cell-free DNA fragmentome," bioRxiv 2026.03.08.710357v1, Methods p. 18. https://doi.org/10.64898/2026.03.08.710357

Test plan

RecordIndex::build — basic, empty, single record, variable length
RecordIndex::build_from_path — file on disk
RecordIndex::from_fai — found and missing cases
IndexedReader::read_record_at — random access by index
IndexedReader::read_records_at — batch reads
IndexedReader::skip_ahead_sample — approximate target count
IndexedReader::open — from file path
All existing subsample tests still pass
81 tests total (63 original + 18 new)

🤖 Generated with Claude Code

Add `--record-count-per-tile` option to the subsample command. This bins reads by their lane and tile from the Illumina read header, discards bins with fewer than the specified number of records, and randomly subsamples exactly that many records from each retained bin. This normalizes FASTQ files based on physical sequencing location to mitigate batch effects from uneven coverage or flow cell edge effects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When --bin-by-tile is combined with --record-count or --probability, the per-tile count is computed automatically: - --record-count N: per_tile = N / num_bins - --probability P: per_tile = floor(P * total_records / num_bins) The --record-count-per-tile option remains for explicit per-tile counts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Major refactor of per-tile binning to use temporary files instead of in-memory bitmaps. This reduces memory usage from O(total_records) to O(records_per_tile * num_tiles_sampled). Changes: - First pass writes records to per-tile temp files - Skip-ahead sampling (default): uses exponential-distributed byte jumps through temp files for O(n_sampled) I/O - Exact indexed sampling (--exact): builds record offset index for precise seeking - --sampling-threads: parallel tile sampling via std::thread::scope - --compression-threads: parallel gzip output via concatenated streams - --record-count without --exact on uncompressed input also uses skip-ahead - Non-tile --record-count on gzipped input falls back to exact mode Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When --in-memory is used with tile binning, records are kept in RAM organized by tile instead of writing to temporary files. This trades memory for disk I/O and avoids temporary file creation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Change default behavior to exact sampling (backward compatible). The --fast flag opts into skip-ahead approximate sampling. Add --temp-dir option to specify where tile temp files are written (defaults to system temp directory). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Skip-ahead sampling cannot reliably pair R1 and R2 records because it maps byte positions proportionally, which breaks when R1 and R2 records have different sizes. Paired-end now always uses exact indexed sampling regardless of the --fast flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Empty input now produces empty output instead of crashing with "invalid uniform range" from Uniform::new(0, 0) - .fai index counts are cross-checked against actual line counts for uncompressed files (cheap); gzipped files trust the index since decompressing to cross-check would negate the performance benefit Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the collect-then-concatenate pattern with a bounded mpsc channel. Sampling threads send per-tile results to a writer thread that writes directly to the output file(s), avoiding holding all sampled data in memory at once. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove .bgz from is_gzipped() to match fastq::fs which only recognizes .gz - Tile processing order is deterministic: tiles sorted by bin_key, per-tile RNG derived from base_seed + bin_key (no shared mutex) - Temp file creation errors propagate as SubsampleError instead of panicking via unwrap() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

RecordIndex builds an offset index in a single scan, replacing the separate count_lines, build_record_index, and estimate_avg_record_size functions. IndexedReader wraps a seekable reader + index and provides: - read_record_at(i): seek to known offset, no boundary scanning - read_records_at(indices): batch sorted reads - skip_ahead_sample(target, rng): geometric jumps in record space This eliminates the byte-level boundary detection (find_record_after) that was unreliable for paired-end data. Skip-ahead now jumps between index entries instead of byte positions. Removed from subsample.rs: skip_ahead_sample_file, build_record_index, read_record_at, find_record_after, exponential_sample, estimate_avg_record_size, count_records_from_index. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Skip-ahead no longer always starts at record 0; the first jump is drawn before selecting any record - Tile processing order is deterministic: tiles sorted by bin_key, per-tile RNG derived from base_seed + bin_key (no shared mutex) - Remove .bgz from is_gzipped() to match fastq::fs which only recognizes .gz - Temp file creation errors propagate as SubsampleError instead of panicking via unwrap() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jdidion and others added 13 commits March 13, 2026 07:51

Remove accidentally committed .omc state files

0b1d326

Add review response documenting dispositions for all findings

34f7e78

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jdidion force-pushed the feat/indexed-reader branch from 34b771b to 01825ae Compare March 15, 2026 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fastq: Add RecordIndex and IndexedReader for O(1) random access#56

fastq: Add RecordIndex and IndexedReader for O(1) random access#56
jdidion wants to merge 13 commits intostjude-rust-labs:masterfrom
jdidion:feat/indexed-reader

jdidion commented Mar 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdidion commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

RecordIndex

IndexedReader

What this replaces

Why this matters

References

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jdidion commented Mar 14, 2026 •

edited

Loading

`RecordIndex`

`IndexedReader`