Skip to content

Groups#1126

Open
jeromekelleher wants to merge 16 commits intotskit-dev:dev-1.0from
jeromekelleher:groups
Open

Groups#1126
jeromekelleher wants to merge 16 commits intotskit-dev:dev-1.0from
jeromekelleher:groups

Conversation

@jeromekelleher
Copy link
Member

@jeromekelleher jeromekelleher commented Mar 16, 2026

Progress on computing match groups across different data sources via existing linesweep approach.

Replace the greedy bin-packing compute_groups with the upstream
linesweep + topological sort approach. Overlapping same-time ancestors
are now placed in the same group so they don't match against each other.

- Add tsinfer/grouping.py with merge_overlapping_ancestors,
  run_linesweep (@numba.njit), find_groups (@numba.njit), and
  group_ancestors_by_linesweep
- Rewrite compute_groups in matching.py to delegate to the linesweep;
  remove _intervals_overlap and _split_by_overlap
- Add tests/test_grouping.py with merging, grouping (fixed + 500
  random seeds), and cycle-detection tests
- Update TestComputeGroups assertions for the new same-group semantics
- Add numba to pyproject.toml dependencies
Wires together _collect_haplotypes and compute_groups into a new
compute_groups_json() pipeline function and exposes it as the
`compute-groups` CLI command, which outputs the group structure as
JSON to stdout. Includes info-level logging with timing for each step
(loading, grouping, serialisation). This serves as both an inspection
tool and a stepping stone toward the distributed match-init workflow.
Move compute_groups from matching.py into grouping.py and add a
lightweight metadata loader (collect_haplotype_metadata) that reads
only 1-D arrays — no call_genotype data — making compute_groups_json
much cheaper. The match() pipeline now derives its haplotype ordering
from groups (computed internally or loaded from a JSON file via the
new cfg.match.groups field), replacing the old _order_haplotypes
helper.

CLI changes: compute-groups gains --output/-o to write to a file;
match gains --groups to accept a pre-computed groups JSON.
Replace the nested group-centric JSON format with a flat list of
per-haplotype MatchJob records (source, sample_id, ploidy_index, time,
positions, group). Remove redundant is_ancestor field from MatchJob
since it's derivable from source == "ancestors".
Replace eager _collect_haplotypes() (which loaded all genotype data into
a single in-memory matrix) with a HaplotypeReader that lazily reads and
encodes one haplotype at a time. match() now iterates MatchJob objects
directly, eliminating the large upfront allocation.

- vcz.py: Add HaplotypeReader class with _SourceContext caching
- pipeline.py: Rewrite match() around MatchJob + HaplotypeReader,
  delete _HaplotypeData, _collect_haplotypes, _load_groups
- test_vcz.py: Add TestHaplotypeReader (7 tests)
Write contig_id, contig_length, and variant_contig arrays in
AncestorWriter and write_empty_ancestor_vcz so that vcztools view
can produce valid VCF output from ancestor stores. The contig info
is read from the source store in infer_ancestors and forwarded through.
@codecov
Copy link

codecov bot commented Mar 16, 2026

Codecov Report

❌ Patch coverage is 82.26351% with 105 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (dev-1.0@45e5494). Learn more about missing BASE report.

Additional details and impacted files
@@            Coverage Diff             @@
##             dev-1.0    #1126   +/-   ##
==========================================
  Coverage           ?   84.49%           
==========================================
  Files              ?       17           
  Lines              ?     4483           
  Branches           ?      713           
==========================================
  Hits               ?     3788           
  Misses             ?      560           
  Partials           ?      135           
Flag Coverage Δ
C 81.57% <ø> (?)
c-python 58.31% <ø> (?)
python-tests 89.00% <82.26%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Python API 89.00% <0.00%> (?)
Python C interface 66.66% <0.00%> (?)
C library 88.68% <0.00%> (?)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Variant-dimensioned arrays (variant_position, variant_allele,
variant_contig) used zarr's default chunk size while call_genotype
used the configured variants_chunk_size, causing vcztools view to
fail with "Array chrom must have first dimension equal to number of
variants" when iterating chunks.
Split _build_one_ancestor into _call_make_ancestor (worker thread,
GIL-releasing C only) and _finish_ancestor (main thread, Python
post-processing). Array allocation and focal_sites conversion also
move to the main thread before submission. This reduces GIL
contention as ancestor work gets smaller later in the build.
The main thread was spending ~46s in synchronous zarr appends per
interval, blocking submission of new make_ancestor work. Move the
zarr I/O to a dedicated writer thread fed by a bounded queue
(maxsize=2). The main thread now hands off assembled numpy arrays
and immediately returns to draining/submitting futures.

Also adds per-interval timing instrumentation (wait, finish, write,
submit, make_ancestor stats) logged at INFO level to diagnose
throughput bottlenecks.
Eliminates per-ancestor np.full allocation (~172K allocations of
n_ab_sites arrays) by maintaining a pool of max_queued reusable
arrays. Arrays are returned to the pool after _finish_ancestor
copies out the active fragment, and reset with memset on reuse.
Make the blosc compressor (cname and clevel) configurable via
AncestorsConfig, defaulting to zstd with bitshuffle at level 7.
Add total wall-clock time per interval to the ancestor build timing log.
Merge _call_make_ancestor and _finish_ancestor into a single
_build_ancestor worker function. The lightweight numpy slicing and
Ancestor construction now runs in worker threads alongside the
GIL-released C make_ancestor call, freeing the main thread to focus
on dispatching and writing. Also remove the unnecessary memset and
focal_sites array conversion from the submit path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant