Groups by jeromekelleher · Pull Request #1126 · tskit-dev/tsinfer

jeromekelleher · 2026-03-16T17:00:24Z

Progress on computing match groups across different data sources via existing linesweep approach.

Replace the greedy bin-packing compute_groups with the upstream linesweep + topological sort approach. Overlapping same-time ancestors are now placed in the same group so they don't match against each other. - Add tsinfer/grouping.py with merge_overlapping_ancestors, run_linesweep (@numba.njit), find_groups (@numba.njit), and group_ancestors_by_linesweep - Rewrite compute_groups in matching.py to delegate to the linesweep; remove _intervals_overlap and _split_by_overlap - Add tests/test_grouping.py with merging, grouping (fixed + 500 random seeds), and cycle-detection tests - Update TestComputeGroups assertions for the new same-group semantics - Add numba to pyproject.toml dependencies

Wires together _collect_haplotypes and compute_groups into a new compute_groups_json() pipeline function and exposes it as the `compute-groups` CLI command, which outputs the group structure as JSON to stdout. Includes info-level logging with timing for each step (loading, grouping, serialisation). This serves as both an inspection tool and a stepping stone toward the distributed match-init workflow.

Move compute_groups from matching.py into grouping.py and add a lightweight metadata loader (collect_haplotype_metadata) that reads only 1-D arrays — no call_genotype data — making compute_groups_json much cheaper. The match() pipeline now derives its haplotype ordering from groups (computed internally or loaded from a JSON file via the new cfg.match.groups field), replacing the old _order_haplotypes helper. CLI changes: compute-groups gains --output/-o to write to a file; match gains --groups to accept a pre-computed groups JSON.

Replace the nested group-centric JSON format with a flat list of per-haplotype MatchJob records (source, sample_id, ploidy_index, time, positions, group). Remove redundant is_ancestor field from MatchJob since it's derivable from source == "ancestors".

Replace eager _collect_haplotypes() (which loaded all genotype data into a single in-memory matrix) with a HaplotypeReader that lazily reads and encodes one haplotype at a time. match() now iterates MatchJob objects directly, eliminating the large upfront allocation. - vcz.py: Add HaplotypeReader class with _SourceContext caching - pipeline.py: Rewrite match() around MatchJob + HaplotypeReader, delete _HaplotypeData, _collect_haplotypes, _load_groups - test_vcz.py: Add TestHaplotypeReader (7 tests)

Write contig_id, contig_length, and variant_contig arrays in AncestorWriter and write_empty_ancestor_vcz so that vcztools view can produce valid VCF output from ancestor stores. The contig info is read from the source store in infer_ancestors and forwarded through.

codecov · 2026-03-16T17:02:01Z

Codecov Report

❌ Patch coverage is 82.26351% with 105 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (dev-1.0@45e5494). Learn more about missing BASE report.

Additional details and impacted files

@@            Coverage Diff             @@
##             dev-1.0    #1126   +/-   ##
==========================================
  Coverage           ?   84.49%           
==========================================
  Files              ?       17           
  Lines              ?     4483           
  Branches           ?      713           
==========================================
  Hits               ?     3788           
  Misses             ?      560           
  Partials           ?      135

Flag	Coverage Δ
C	`81.57% <ø> (?)`
c-python	`58.31% <ø> (?)`
python-tests	`89.00% <82.26%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
Python API	`89.00% <0.00%> (?)`
Python C interface	`66.66% <0.00%> (?)`
C library	`88.68% <0.00%> (?)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Variant-dimensioned arrays (variant_position, variant_allele, variant_contig) used zarr's default chunk size while call_genotype used the configured variants_chunk_size, causing vcztools view to fail with "Array chrom must have first dimension equal to number of variants" when iterating chunks.

Split _build_one_ancestor into _call_make_ancestor (worker thread, GIL-releasing C only) and _finish_ancestor (main thread, Python post-processing). Array allocation and focal_sites conversion also move to the main thread before submission. This reduces GIL contention as ancestor work gets smaller later in the build.

The main thread was spending ~46s in synchronous zarr appends per interval, blocking submission of new make_ancestor work. Move the zarr I/O to a dedicated writer thread fed by a bounded queue (maxsize=2). The main thread now hands off assembled numpy arrays and immediately returns to draining/submitting futures. Also adds per-interval timing instrumentation (wait, finish, write, submit, make_ancestor stats) logged at INFO level to diagnose throughput bottlenecks.

Eliminates per-ancestor np.full allocation (~172K allocations of n_ab_sites arrays) by maintaining a pool of max_queued reusable arrays. Arrays are returned to the pool after _finish_ancestor copies out the active fragment, and reset with memset on reuse.

Make the blosc compressor (cname and clevel) configurable via AncestorsConfig, defaulting to zstd with bitshuffle at level 7. Add total wall-clock time per interval to the ancestor build timing log.

Merge _call_make_ancestor and _finish_ancestor into a single _build_ancestor worker function. The lightweight numpy slicing and Ancestor construction now runs in worker threads alongside the GIL-released C make_ancestor call, freeing the main thread to focus on dispatching and writing. Also remove the unnecessary memset and focal_sites array conversion from the submit path.

jeromekelleher added 9 commits March 16, 2026 14:33

Fixup the example config

f1afed0

Remove completed plan

7f31a00

Change ancestor sample IDs from ancestor_N to aN

90ee62e

jeromekelleher added 7 commits March 17, 2026 09:41

Add some debugging

af2a21f

Add configurable zarr compression and wall-clock timing for ancestors

778022a

Make the blosc compressor (cname and clevel) configurable via AncestorsConfig, defaulting to zstd with bitshuffle at level 7. Add total wall-clock time per interval to the ancestor build timing log.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groups#1126

Groups#1126
jeromekelleher wants to merge 16 commits intotskit-dev:dev-1.0from
jeromekelleher:groups

jeromekelleher commented Mar 16, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeromekelleher commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jeromekelleher commented Mar 16, 2026 •

edited

Loading

codecov bot commented Mar 16, 2026 •

edited

Loading