fastq: Add RecordIndex and IndexedReader for O(1) random access#56
Open
jdidion wants to merge 13 commits intostjude-rust-labs:masterfrom
Open
fastq: Add RecordIndex and IndexedReader for O(1) random access#56jdidion wants to merge 13 commits intostjude-rust-labs:masterfrom
jdidion wants to merge 13 commits intostjude-rust-labs:masterfrom
Conversation
Add `--record-count-per-tile` option to the subsample command. This bins reads by their lane and tile from the Illumina read header, discards bins with fewer than the specified number of records, and randomly subsamples exactly that many records from each retained bin. This normalizes FASTQ files based on physical sequencing location to mitigate batch effects from uneven coverage or flow cell edge effects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When --bin-by-tile is combined with --record-count or --probability, the per-tile count is computed automatically: - --record-count N: per_tile = N / num_bins - --probability P: per_tile = floor(P * total_records / num_bins) The --record-count-per-tile option remains for explicit per-tile counts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Major refactor of per-tile binning to use temporary files instead of in-memory bitmaps. This reduces memory usage from O(total_records) to O(records_per_tile * num_tiles_sampled). Changes: - First pass writes records to per-tile temp files - Skip-ahead sampling (default): uses exponential-distributed byte jumps through temp files for O(n_sampled) I/O - Exact indexed sampling (--exact): builds record offset index for precise seeking - --sampling-threads: parallel tile sampling via std::thread::scope - --compression-threads: parallel gzip output via concatenated streams - --record-count without --exact on uncompressed input also uses skip-ahead - Non-tile --record-count on gzipped input falls back to exact mode Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When --in-memory is used with tile binning, records are kept in RAM organized by tile instead of writing to temporary files. This trades memory for disk I/O and avoids temporary file creation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change default behavior to exact sampling (backward compatible). The --fast flag opts into skip-ahead approximate sampling. Add --temp-dir option to specify where tile temp files are written (defaults to system temp directory). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Skip-ahead sampling cannot reliably pair R1 and R2 records because it maps byte positions proportionally, which breaks when R1 and R2 records have different sizes. Paired-end now always uses exact indexed sampling regardless of the --fast flag. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Empty input now produces empty output instead of crashing with "invalid uniform range" from Uniform::new(0, 0) - .fai index counts are cross-checked against actual line counts for uncompressed files (cheap); gzipped files trust the index since decompressing to cross-check would negate the performance benefit Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the collect-then-concatenate pattern with a bounded mpsc channel. Sampling threads send per-tile results to a writer thread that writes directly to the output file(s), avoiding holding all sampled data in memory at once. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove .bgz from is_gzipped() to match fastq::fs which only recognizes .gz - Tile processing order is deterministic: tiles sorted by bin_key, per-tile RNG derived from base_seed + bin_key (no shared mutex) - Temp file creation errors propagate as SubsampleError instead of panicking via unwrap() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
RecordIndex builds an offset index in a single scan, replacing the separate count_lines, build_record_index, and estimate_avg_record_size functions. IndexedReader wraps a seekable reader + index and provides: - read_record_at(i): seek to known offset, no boundary scanning - read_records_at(indices): batch sorted reads - skip_ahead_sample(target, rng): geometric jumps in record space This eliminates the byte-level boundary detection (find_record_after) that was unreliable for paired-end data. Skip-ahead now jumps between index entries instead of byte positions. Removed from subsample.rs: skip_ahead_sample_file, build_record_index, read_record_at, find_record_after, exponential_sample, estimate_avg_record_size, count_records_from_index. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Skip-ahead no longer always starts at record 0; the first jump is drawn before selecting any record - Tile processing order is deterministic: tiles sorted by bin_key, per-tile RNG derived from base_seed + bin_key (no shared mutex) - Remove .bgz from is_gzipped() to match fastq::fs which only recognizes .gz - Temp file creation errors propagate as SubsampleError instead of panicking via unwrap() Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
34b771b to
01825ae
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduces
RecordIndexandIndexedReaderinsrc/fastq/io/, replacing several ad-hoc functions in the subsample command with a unified indexed FASTQ reader.RecordIndexBuilt by a single scan of an uncompressed FASTQ file. Stores the byte offset of every record start. Provides:
record_count()— O(1)avg_record_size()— O(1)offset(i)— O(1) lookupfrom_fai(path)— load record count from a.faisidecarbuild_from_path(path)— scan and index a file on diskIndexedReaderWraps a seekable reader +
RecordIndex. Provides:read_record_at(i)— seek to known offset, decode recordread_records_at(indices, callback)— batch reads at sorted indicesskip_ahead_sample(target, rng, callback)— geometric jumps in record index space (not byte space), eliminating boundary detectionWhat this replaces
count_lines+/ 4RecordIndex::build_from_path().record_count()build_record_indexRecordIndex::build_from_path()estimate_avg_record_sizeRecordIndex::build_from_path().avg_record_size()count_records_from_indexRecordIndex::from_fai()find_record_after(byte-level boundary scan)IndexedReader::read_record_at()(direct seek)skip_ahead_sample_file(byte jumps + boundary scan)IndexedReader::skip_ahead_sample()(index jumps)read_record_at(raw byte reads)IndexedReader::read_record_at()(returnsRecord)Why this matters
@as record starts.RecordIndexandIndexedReaderare infastq::io, available to any command.count_linesandopen_maybe_gzare retained for the gzipped-input fallback path insubsample_exact, since gzipped files can't be seeked.References
The per-tile binning method implemented in PR #44 (which this PR supports) is described in:
Test plan
RecordIndex::build— basic, empty, single record, variable lengthRecordIndex::build_from_path— file on diskRecordIndex::from_fai— found and missing casesIndexedReader::read_record_at— random access by indexIndexedReader::read_records_at— batch readsIndexedReader::skip_ahead_sample— approximate target countIndexedReader::open— from file path🤖 Generated with Claude Code