Skip to content

fastq: Add RecordIndex and IndexedReader for O(1) random access#56

Open
jdidion wants to merge 13 commits intostjude-rust-labs:masterfrom
jdidion:feat/indexed-reader
Open

fastq: Add RecordIndex and IndexedReader for O(1) random access#56
jdidion wants to merge 13 commits intostjude-rust-labs:masterfrom
jdidion:feat/indexed-reader

Conversation

@jdidion
Copy link
Copy Markdown

@jdidion jdidion commented Mar 14, 2026

Depends on PR #44 (per-tile binning). Merge #44 first.

Summary

Introduces RecordIndex and IndexedReader in src/fastq/io/, replacing several ad-hoc functions in the subsample command with a unified indexed FASTQ reader.

RecordIndex

Built by a single scan of an uncompressed FASTQ file. Stores the byte offset of every record start. Provides:

  • record_count() — O(1)
  • avg_record_size() — O(1)
  • offset(i) — O(1) lookup
  • from_fai(path) — load record count from a .fai sidecar
  • build_from_path(path) — scan and index a file on disk

IndexedReader

Wraps a seekable reader + RecordIndex. Provides:

  • read_record_at(i) — seek to known offset, decode record
  • read_records_at(indices, callback) — batch reads at sorted indices
  • skip_ahead_sample(target, rng, callback) — geometric jumps in record index space (not byte space), eliminating boundary detection

What this replaces

Old function Replacement
count_lines + / 4 RecordIndex::build_from_path().record_count()
build_record_index RecordIndex::build_from_path()
estimate_avg_record_size RecordIndex::build_from_path().avg_record_size()
count_records_from_index RecordIndex::from_fai()
find_record_after (byte-level boundary scan) IndexedReader::read_record_at() (direct seek)
skip_ahead_sample_file (byte jumps + boundary scan) IndexedReader::skip_ahead_sample() (index jumps)
read_record_at (raw byte reads) IndexedReader::read_record_at() (returns Record)

Why this matters

  • Skip-ahead is now correct: jumps happen in record-index space, not byte space. No more heuristic boundary detection that could misidentify quality-score @ as record starts.
  • Single scan: one pass builds record count + offsets + avg size, instead of separate passes for each.
  • Reusable: RecordIndex and IndexedReader are in fastq::io, available to any command.

count_lines and open_maybe_gz are retained for the gzipped-input fallback path in subsample_exact, since gzipped files can't be seeked.

References

The per-tile binning method implemented in PR #44 (which this PR supports) is described in:

Volkov H, Raitses-Gurevich M, Grad M, et al., "Exploring per-base quality scores as a surrogate marker of cell-free DNA fragmentome," bioRxiv 2026.03.08.710357v1, Methods p. 18. https://doi.org/10.64898/2026.03.08.710357

Test plan

  • RecordIndex::build — basic, empty, single record, variable length
  • RecordIndex::build_from_path — file on disk
  • RecordIndex::from_fai — found and missing cases
  • IndexedReader::read_record_at — random access by index
  • IndexedReader::read_records_at — batch reads
  • IndexedReader::skip_ahead_sample — approximate target count
  • IndexedReader::open — from file path
  • All existing subsample tests still pass
  • 81 tests total (63 original + 18 new)

🤖 Generated with Claude Code

jdidion and others added 13 commits March 13, 2026 07:51
Add `--record-count-per-tile` option to the subsample command. This bins
reads by their lane and tile from the Illumina read header, discards bins
with fewer than the specified number of records, and randomly subsamples
exactly that many records from each retained bin.

This normalizes FASTQ files based on physical sequencing location to
mitigate batch effects from uneven coverage or flow cell edge effects.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When --bin-by-tile is combined with --record-count or --probability,
the per-tile count is computed automatically:
  - --record-count N: per_tile = N / num_bins
  - --probability P: per_tile = floor(P * total_records / num_bins)

The --record-count-per-tile option remains for explicit per-tile counts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Major refactor of per-tile binning to use temporary files instead of
in-memory bitmaps. This reduces memory usage from O(total_records) to
O(records_per_tile * num_tiles_sampled).

Changes:
- First pass writes records to per-tile temp files
- Skip-ahead sampling (default): uses exponential-distributed byte
  jumps through temp files for O(n_sampled) I/O
- Exact indexed sampling (--exact): builds record offset index for
  precise seeking
- --sampling-threads: parallel tile sampling via std::thread::scope
- --compression-threads: parallel gzip output via concatenated streams
- --record-count without --exact on uncompressed input also uses
  skip-ahead
- Non-tile --record-count on gzipped input falls back to exact mode

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When --in-memory is used with tile binning, records are kept in RAM
organized by tile instead of writing to temporary files. This trades
memory for disk I/O and avoids temporary file creation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change default behavior to exact sampling (backward compatible).
The --fast flag opts into skip-ahead approximate sampling.

Add --temp-dir option to specify where tile temp files are written
(defaults to system temp directory).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Skip-ahead sampling cannot reliably pair R1 and R2 records because
it maps byte positions proportionally, which breaks when R1 and R2
records have different sizes. Paired-end now always uses exact
indexed sampling regardless of the --fast flag.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Empty input now produces empty output instead of crashing with
  "invalid uniform range" from Uniform::new(0, 0)
- .fai index counts are cross-checked against actual line counts for
  uncompressed files (cheap); gzipped files trust the index since
  decompressing to cross-check would negate the performance benefit

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the collect-then-concatenate pattern with a bounded
mpsc channel. Sampling threads send per-tile results to a writer
thread that writes directly to the output file(s), avoiding holding
all sampled data in memory at once.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove .bgz from is_gzipped() to match fastq::fs which only
  recognizes .gz
- Tile processing order is deterministic: tiles sorted by bin_key,
  per-tile RNG derived from base_seed + bin_key (no shared mutex)
- Temp file creation errors propagate as SubsampleError instead of
  panicking via unwrap()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RecordIndex builds an offset index in a single scan, replacing the
separate count_lines, build_record_index, and estimate_avg_record_size
functions. IndexedReader wraps a seekable reader + index and provides:
  - read_record_at(i): seek to known offset, no boundary scanning
  - read_records_at(indices): batch sorted reads
  - skip_ahead_sample(target, rng): geometric jumps in record space

This eliminates the byte-level boundary detection (find_record_after)
that was unreliable for paired-end data. Skip-ahead now jumps between
index entries instead of byte positions.

Removed from subsample.rs: skip_ahead_sample_file, build_record_index,
read_record_at, find_record_after, exponential_sample,
estimate_avg_record_size, count_records_from_index.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Skip-ahead no longer always starts at record 0; the first jump is
  drawn before selecting any record
- Tile processing order is deterministic: tiles sorted by bin_key,
  per-tile RNG derived from base_seed + bin_key (no shared mutex)
- Remove .bgz from is_gzipped() to match fastq::fs which only
  recognizes .gz
- Temp file creation errors propagate as SubsampleError instead of
  panicking via unwrap()

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jdidion jdidion force-pushed the feat/indexed-reader branch from 34b771b to 01825ae Compare March 15, 2026 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant