0.4.0

benjeffery released this 06 Mar 15:01

· 110 commits to main since this release

fac1fec

[0.4.0] - 2024-04-06

Changelog is relative to the last full release, 0.3.3.

Breaking Changes

tsinfer 0.4.0 infers data from on-disk or in-memory vcf-zarr datasets, allowing users to leverage optimized
and parallel VCF parsing via the bio2zarr package. The SampleData file format and class are now deprecated.
If a mismatch ratio is provided to the infer command, it only applies during the
match_samples phase (#980, #981, @hyanwong)

Features

Add batch ancestor and sample matching APIs for splitting work across many independent jobs.
(#954, #917, @benjeffery)

Performance improvements

Reduce memory usage when running match_samples against large cohorts
containing sequences with substantial amounts of error.
(#761, @jeromekelleher)
truncate_ancestors no longer requires loading all the ancestors into RAM.
(#811, @benjeffery)
Increase parallelisation of match_ancestors by generating parallel groups from
their implied dependency graph. (#828, #147, @benjeffery)
Reduce memory requirements of the generate_ancestors function by providing
the genotype_encoding (#809) and mmap_temp_dir (#808) options
(@jeromekelleher).

Other Breaking Changes

Removed the uuid field from SampleData; equality is now purely based on data
If a mismatch ratio is provided to the infer command, it only applies during the match_samples phase
A permissive JSON schema is now set on node table metadata

Fixes

Properly account for "N" as an unknown ancestral state, and ban "" from being
set as an ancestral state (#963, @hyanwong)

Assets 2