Open
Conversation
Replace the greedy bin-packing compute_groups with the upstream linesweep + topological sort approach. Overlapping same-time ancestors are now placed in the same group so they don't match against each other. - Add tsinfer/grouping.py with merge_overlapping_ancestors, run_linesweep (@numba.njit), find_groups (@numba.njit), and group_ancestors_by_linesweep - Rewrite compute_groups in matching.py to delegate to the linesweep; remove _intervals_overlap and _split_by_overlap - Add tests/test_grouping.py with merging, grouping (fixed + 500 random seeds), and cycle-detection tests - Update TestComputeGroups assertions for the new same-group semantics - Add numba to pyproject.toml dependencies
Wires together _collect_haplotypes and compute_groups into a new compute_groups_json() pipeline function and exposes it as the `compute-groups` CLI command, which outputs the group structure as JSON to stdout. Includes info-level logging with timing for each step (loading, grouping, serialisation). This serves as both an inspection tool and a stepping stone toward the distributed match-init workflow.
Move compute_groups from matching.py into grouping.py and add a lightweight metadata loader (collect_haplotype_metadata) that reads only 1-D arrays — no call_genotype data — making compute_groups_json much cheaper. The match() pipeline now derives its haplotype ordering from groups (computed internally or loaded from a JSON file via the new cfg.match.groups field), replacing the old _order_haplotypes helper. CLI changes: compute-groups gains --output/-o to write to a file; match gains --groups to accept a pre-computed groups JSON.
Replace the nested group-centric JSON format with a flat list of per-haplotype MatchJob records (source, sample_id, ploidy_index, time, positions, group). Remove redundant is_ancestor field from MatchJob since it's derivable from source == "ancestors".
Replace eager _collect_haplotypes() (which loaded all genotype data into a single in-memory matrix) with a HaplotypeReader that lazily reads and encodes one haplotype at a time. match() now iterates MatchJob objects directly, eliminating the large upfront allocation. - vcz.py: Add HaplotypeReader class with _SourceContext caching - pipeline.py: Rewrite match() around MatchJob + HaplotypeReader, delete _HaplotypeData, _collect_haplotypes, _load_groups - test_vcz.py: Add TestHaplotypeReader (7 tests)
Write contig_id, contig_length, and variant_contig arrays in AncestorWriter and write_empty_ancestor_vcz so that vcztools view can produce valid VCF output from ancestor stores. The contig info is read from the source store in infer_ancestors and forwarded through.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## dev-1.0 #1126 +/- ##
==========================================
Coverage ? 84.49%
==========================================
Files ? 17
Lines ? 4483
Branches ? 713
==========================================
Hits ? 3788
Misses ? 560
Partials ? 135
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Variant-dimensioned arrays (variant_position, variant_allele, variant_contig) used zarr's default chunk size while call_genotype used the configured variants_chunk_size, causing vcztools view to fail with "Array chrom must have first dimension equal to number of variants" when iterating chunks.
Split _build_one_ancestor into _call_make_ancestor (worker thread, GIL-releasing C only) and _finish_ancestor (main thread, Python post-processing). Array allocation and focal_sites conversion also move to the main thread before submission. This reduces GIL contention as ancestor work gets smaller later in the build.
The main thread was spending ~46s in synchronous zarr appends per interval, blocking submission of new make_ancestor work. Move the zarr I/O to a dedicated writer thread fed by a bounded queue (maxsize=2). The main thread now hands off assembled numpy arrays and immediately returns to draining/submitting futures. Also adds per-interval timing instrumentation (wait, finish, write, submit, make_ancestor stats) logged at INFO level to diagnose throughput bottlenecks.
Eliminates per-ancestor np.full allocation (~172K allocations of n_ab_sites arrays) by maintaining a pool of max_queued reusable arrays. Arrays are returned to the pool after _finish_ancestor copies out the active fragment, and reset with memset on reuse.
Make the blosc compressor (cname and clevel) configurable via AncestorsConfig, defaulting to zstd with bitshuffle at level 7. Add total wall-clock time per interval to the ancestor build timing log.
Merge _call_make_ancestor and _finish_ancestor into a single _build_ancestor worker function. The lightweight numpy slicing and Ancestor construction now runs in worker threads alongside the GIL-released C make_ancestor call, freeing the main thread to focus on dispatching and writing. Also remove the unnecessary memset and focal_sites array conversion from the submit path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Progress on computing match groups across different data sources via existing linesweep approach.