Skip to content

Conversation

jeromekelleher
Copy link
Contributor

@jeromekelleher jeromekelleher commented May 21, 2025

Closes #232
Fixes #377

@coveralls
Copy link
Collaborator

coveralls commented May 22, 2025

Coverage Status

coverage: 98.286% (+0.07%) from 98.212%
when pulling bf317ff on jeromekelleher:finalise-tskit
into 6eb88ed on sgkit-dev:main.

@jeromekelleher jeromekelleher marked this pull request as ready for review May 22, 2025 14:51
@jeromekelleher
Copy link
Contributor Author

I think this is nearly ready to go. There's some details about how we deal with coordinate systems between tskit and VCZ which I'm going to punt on, but I think we should be able to handle the vast majority of inputs well with this.

I've added an initial framework for the Python API, just documenting a single function bio2zarr.tskit.convert. I think this is a good start (and will add the same for vcf when I get a chance). I had to skip some of the intersphinx linking for now, as the latest version of the docs don't seem to be live for tskit.

@jeromekelleher jeromekelleher requested a review from benjeffery May 22, 2025 14:56
Copy link
Contributor

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks, good. One nit in the docs.

One concern I have is that a key argument to convert is a tskit VcfModelMapping (or something that duck-types to a VCFModelMapping) that is currently undocumented in tskit. We should add docs in tskit and make its constructor part of the public API.

contig_id=None,
isolated_as_missing=False,
):
import tskit

self._path = ts_path
self.ts = tskit.load(ts_path)
self._path = None # Not sure what we're using this for?
Copy link
Contributor

@benjeffery benjeffery May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To answer the comment question: The base class defines Source.path which is used when we process out-of-core to put the path in metadata. This isn't used for tskit (at least for now) so I guess TskitFormat.path could just raise NotImplemented and the path not be stored.

@jeromekelleher jeromekelleher enabled auto-merge May 23, 2025 11:17
@jeromekelleher jeromekelleher added this pull request to the merge queue May 23, 2025
Merged via the queue into sgkit-dev:main with commit 479ec84 May 23, 2025
15 checks passed
@hyanwong
Copy link

Can .convert() (or perhaps another function) return a handle to the vcz object, for use in tsinfer? So that we can do

tsinfer.infer(tsinfer.VariantData(returned_handle, ancestral_state="ancestral_allele"))

@jeromekelleher
Copy link
Contributor Author

No, that's out of scope because would greatly complicate the semantics (should the returned object be a Zarr hierarchy, or an Xarray dataset? Should it be in read-only mode?). Bio2zarr just deals with the on-disk formats, not providing python APIs for working with them. Bio2zarr will always create an on-disk representation, so returning a handle would just be abstracting the details of where the temporary files are kept.

I think tsinfer should probably write a wrapper which creates a temporary directory and does the conversion so that you can use it like this:

with tsinfer.VariantData.from_tree_sequence(ts) as data:
    tsinfer.infer(data)

That'll take care of a lot of the messiness.

@jeromekelleher jeromekelleher deleted the finalise-tskit branch May 23, 2025 14:23
@hyanwong
Copy link

Ah, OK. I had gathered that bio2zarr might also create in-memory representations, which I use quite a lot. But yes, a wrapper would be fine, I guess, assuming temporary directories etc are available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove tskit github pin New tool: tskit2zarr
4 participants