Initial tskit conversion code #346

benjeffery · 2025-04-07T00:18:00Z

No description provided.

jeromekelleher

Looks great! Big validation of your general VCZ code.

jeromekelleher · 2025-04-07T08:54:27Z

bio2zarr/ts.py

+        self.num_records = self.ts.num_sites
+        self.positions = self.ts.sites_position
+
+    def _make_sample_mapping(self, ploidy):


This feels like it belongs in tskit. Let's file an issue there to support once we have something working committed here.

jeromekelleher · 2025-04-07T08:55:36Z

bio2zarr/ts.py

+
+    def iter_field(self, field_name, shape, start, stop):
+        if field_name == "position":
+            for pos in self.ts.tables.sites.position[start:stop]:


self.ts.sites_position - you're creating a copy of the tables every time here!

jeromekelleher · 2025-04-07T08:56:51Z

bio2zarr/ts.py

+
+        self.samples = [vcz.Sample(id=f"tsk_{j}") for j in range(self.num_samples)]
+
+    def iter_alleles(self, start, stop, num_alleles):


Ah yes, it's a bit unfortunate we have to do this. I can see ways to work around though.

coveralls · 2025-04-08T15:41:18Z

coverage: 98.142% (+0.07%) from 98.074%
when pulling 95c5cf3 on benjeffery:tskit
into e3c31e5 on sgkit-dev:main.

jeromekelleher

Looks good, I think we just need an update and a bit more round-trip testing for the initial version. We can make the interface more flexible later, but this is the core of it.

jeromekelleher · 2025-04-30T07:59:58Z

bio2zarr/ts.py

@@ -0,0 +1,287 @@
+import logging


Thinking about the long-term API, maybe this file should be called tskit.py? Then, clients can do stuff like

bio2zarr.tskit.convert()

I don't think the name-clash of the submodule is a big enough problem to make it worth muddying the interface for.

Renamed in e0dc5f6

benjeffery · 2025-04-30T10:48:51Z

I've added a load more testing - found some issues around mixed ploidy and sample nodes that aren't at the start of the nodes table. Would appreciate a check on those bits.

jeromekelleher · 2025-04-30T11:02:31Z

Looks plausible to me, but I'd have to really dig in to make sure it's watertight.

I think a good way to go might be to identify the API that we want from tskit to do this murky data-model migration stuff, and then make it tskit's responsibility to implement it? So for the initial version we just raise errors for mixed ploidies and stuff, keeping this code reasonably simple, and then try to fix a long-term stable API where tskit becomes responsible for mapping stuff into the VCF-Zarr model.

Ultimately, we just want tskit to give us a (n, max_ploidy) array which defines the sample nodes for individuals and punt all the complexity over there, right?

jeromekelleher · 2025-04-30T11:11:54Z

I think we can make the API here something like

ind_nodes = ts.individual_nodes()  # (n, ploidy) array, padded with -1s if mixed ploidy
sample_ids = ["x{j}" for j in range(len(ind_nodes))] # Or could look up metadata if you please
bio2zarr.tskit.convert(ts, ind_nodes, sample_ids=sample_ids)

The default in bio2zarr could be to call ts.individual_nodes() with no parameters if not specified and have the standard sample_ids, but the user then has the flexibility to call ts.individual_nodes() directly with various parameters if necessary, which would be responsible for doing all the quirky model translation.

For now, we can do the individual_nodes for the simple case where individuals are defined, and just error out for anything else while we get ts.individual_nodes() implemented and released in tskit (which should be easy). How does that sound?

jeromekelleher

Tthis is going in the right direction, but we need to tackle the efficiency issue up front. We need to

Generate a flattened list of nodes that corresponds to the non-missing elements of the individual_nodes array
A mapping from this array back into the output genotypes so that we can do something like

for variant in ts.variants(samples=node_map_flattened):
    for ploid in range(max_ploidy):
             genotypes[non_missing[ploid], ploid] = variant.genotypes[non_missing_map[ploid]]

jeromekelleher · 2025-05-02T13:27:00Z

bio2zarr/tskit.py

+            self.node_ids_array = individual_nodes(ts)
+            self._num_samples = ts.num_individuals
+            self.max_ploidy = self.node_ids_array.shape[1]
+


Scope of this is too wide, you're just covering the individual_nodes call

Also, might as well make this general straight away:

self._num_samples = self.node_ids.shape[0]

jeromekelleher · 2025-05-02T13:29:30Z

bio2zarr/tskit.py

-                        gt[i, j] = genotypes[sample_index + j]
-                sample_index += ploidy
+            # For each individual, get genotypes for their nodes
+            for i in range(self.num_samples):


There must be some numpy-way of doings this without iterating over samples - otherwise this is going to be super slow

jeromekelleher · 2025-05-02T13:31:15Z

bio2zarr/tskit.py

        phased = np.ones(shape[:-1], dtype=bool)

        for variant in self.ts.variants(
-            samples=self.sample_ids,


Have to pass samples somehow, otherwise we can't get data for non-sample nodes out.

benjeffery · 2025-05-02T16:20:12Z

Sorry I pushed a bit early as I was moving individual_nodes to the caller. Good idea on the numpy assignment though, I have implemented that.

benjeffery · 2025-05-07T12:34:32Z

Not sure if this one slipped the net - it's ready for another look over.

jeromekelleher

LGTM. Spotted a few small things

jeromekelleher · 2025-05-07T13:43:45Z

bio2zarr/tskit.py

+
+        # Determine max number of alleles
+        max_alleles = 0
+        for variant in self.ts.variants():


This feels like overkill - how about we count the maximum number of distinct states? Like

max_alleles = 0 for site in ts.sites(): states = {site.ancestral_state} for mut in site.mutations: states.add(mut.derived_state) max_alleles = max(len(states), max_alleles)

Could probably be done quicker with numpy, but this'll be fine.

It's a safe maximum, and it'll be correct in the vast majority of cases right?

Fixed in 4d5a9b4

This seems right - it can higher if there are back mutations, but I can't see how it could be smaller.

jeromekelleher · 2025-05-07T13:45:29Z

bio2zarr/tskit.py

+            vcz.ZarrArraySpec(
+                source="position",
+                name="variant_position",
+                dtype="i4",


We could potentially have position values that won't fit into int32, we should check for this.

Fixed in 304a85d

I've made it use i8 if larger positions are found.

jeromekelleher · 2025-05-07T13:46:48Z

bio2zarr/tskit.py

+            vcz.ZarrArraySpec(
+                source=None,
+                name="call_genotype",
+                dtype="i1",


Should check that max alleles will fit into this type. There's a function like min_dtype or something for this somewhere in this repo

Fixed in 6eb4caf

jeromekelleher · 2025-05-07T13:47:42Z

bio2zarr/tskit.py

+    def __init__(
+        self,
+        ts_path,
+        individual_nodes,


Best make this individuals_nodes for consistency

Fixed in 727e47d

jeromekelleher · 2025-05-07T16:34:04Z

bio2zarr/tskit.py

+        max_position = 0
+        if self.ts.num_sites > 0:
+            max_position = np.max(self.ts.sites_position)
+        position_dtype = "i4" if max_position <= np.iinfo(np.int32).max else "i8"


Why not use min_int_dtype here also?

Good point! Done in 27a79ec

jeromekelleher

OK. looks like ready to squash and merge?

benjeffery · 2025-05-08T08:20:22Z

Squashed! Will merge when CI happy.

benjeffery · 2025-05-08T08:20:54Z

On that note, shall I implement the github auto merge here? Seems to work well over at tskit.

jeromekelleher · 2025-05-08T08:35:18Z

On that note, shall I implement the github auto merge here? Seems to work well over at tskit.

yes please!

Can you do vcztools also?

benjeffery · 2025-05-08T11:06:29Z

Added to the auto queue, which you can see here: https://github.com/sgkit-dev/bio2zarr/queue/main

jeromekelleher reviewed Apr 7, 2025

View reviewed changes

benjeffery force-pushed the tskit branch from 9f5ae9a to 7ab49d4 Compare April 8, 2025 09:21

benjeffery force-pushed the tskit branch 2 times, most recently from 868bc9d to 0c6fd5e Compare April 30, 2025 00:16

jeromekelleher reviewed Apr 30, 2025

View reviewed changes

jeromekelleher reviewed May 2, 2025

View reviewed changes

benjeffery force-pushed the tskit branch from e3585ff to 99130dd Compare May 2, 2025 17:59

benjeffery marked this pull request as ready for review May 2, 2025 18:55

jeromekelleher reviewed May 7, 2025

View reviewed changes

jeromekelleher approved these changes May 8, 2025

View reviewed changes

Add tskit conversion module

95c5cf3

benjeffery force-pushed the tskit branch from 27a79ec to 95c5cf3 Compare May 8, 2025 08:19

benjeffery added this pull request to the merge queue May 8, 2025

github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 8, 2025

benjeffery added this pull request to the merge queue May 8, 2025

Merged via the queue into sgkit-dev:main with commit 33c66a3 May 8, 2025
16 checks passed

benjeffery deleted the tskit branch May 8, 2025 12:23

jeromekelleher mentioned this pull request May 16, 2025

Modelling of sex chromosomes, haplodiploids and the VCF data model tskit-dev/tskit#3166

Closed


		self.samples = [vcz.Sample(id=f"tsk_{j}") for j in range(self.num_samples)]

		def iter_alleles(self, start, stop, num_alleles):

Initial tskit conversion code #346

Initial tskit conversion code #346

Uh oh!

Conversation

benjeffery commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benjeffery commented Apr 30, 2025

Uh oh!

jeromekelleher commented Apr 30, 2025

Uh oh!

jeromekelleher commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benjeffery commented May 2, 2025

Uh oh!

benjeffery commented May 7, 2025

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

benjeffery commented May 8, 2025

Uh oh!

benjeffery commented May 8, 2025

Uh oh!

jeromekelleher commented May 8, 2025

Uh oh!

benjeffery commented Apr 7, 2025 •

edited

Loading

coveralls commented Apr 8, 2025 •

edited

Loading

jeromekelleher commented Apr 30, 2025 •

edited

Loading