-
Notifications
You must be signed in to change notification settings - Fork 10
Initial tskit conversion code #346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Big validation of your general VCZ code.
bio2zarr/ts.py
Outdated
self.num_records = self.ts.num_sites | ||
self.positions = self.ts.sites_position | ||
|
||
def _make_sample_mapping(self, ploidy): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it belongs in tskit. Let's file an issue there to support once we have something working committed here.
bio2zarr/ts.py
Outdated
|
||
def iter_field(self, field_name, shape, start, stop): | ||
if field_name == "position": | ||
for pos in self.ts.tables.sites.position[start:stop]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.ts.sites_position
- you're creating a copy of the tables every time here!
bio2zarr/ts.py
Outdated
|
||
self.samples = [vcz.Sample(id=f"tsk_{j}") for j in range(self.num_samples)] | ||
|
||
def iter_alleles(self, start, stop, num_alleles): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, it's a bit unfortunate we have to do this. I can see ways to work around though.
868bc9d
to
0c6fd5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I think we just need an update and a bit more round-trip testing for the initial version. We can make the interface more flexible later, but this is the core of it.
bio2zarr/ts.py
Outdated
@@ -0,0 +1,287 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about the long-term API, maybe this file should be called tskit.py
? Then, clients can do stuff like
bio2zarr.tskit.convert()
I don't think the name-clash of the submodule is a big enough problem to make it worth muddying the interface for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed in e0dc5f6
I've added a load more testing - found some issues around mixed ploidy and sample nodes that aren't at the start of the nodes table. Would appreciate a check on those bits. |
Looks plausible to me, but I'd have to really dig in to make sure it's watertight. I think a good way to go might be to identify the API that we want from tskit to do this murky data-model migration stuff, and then make it tskit's responsibility to implement it? So for the initial version we just raise errors for mixed ploidies and stuff, keeping this code reasonably simple, and then try to fix a long-term stable API where tskit becomes responsible for mapping stuff into the VCF-Zarr model. Ultimately, we just want tskit to give us a |
I think we can make the API here something like ind_nodes = ts.individual_nodes() # (n, ploidy) array, padded with -1s if mixed ploidy
sample_ids = ["x{j}" for j in range(len(ind_nodes))] # Or could look up metadata if you please
bio2zarr.tskit.convert(ts, ind_nodes, sample_ids=sample_ids) The default in bio2zarr could be to call For now, we can do the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tthis is going in the right direction, but we need to tackle the efficiency issue up front. We need to
- Generate a flattened list of nodes that corresponds to the non-missing elements of the
individual_nodes
array - A mapping from this array back into the output genotypes so that we can do something like
for variant in ts.variants(samples=node_map_flattened):
for ploid in range(max_ploidy):
genotypes[non_missing[ploid], ploid] = variant.genotypes[non_missing_map[ploid]]
bio2zarr/tskit.py
Outdated
self.node_ids_array = individual_nodes(ts) | ||
self._num_samples = ts.num_individuals | ||
self.max_ploidy = self.node_ids_array.shape[1] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scope of this is too wide, you're just covering the individual_nodes call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, might as well make this general straight away:
self._num_samples = self.node_ids.shape[0]
bio2zarr/tskit.py
Outdated
gt[i, j] = genotypes[sample_index + j] | ||
sample_index += ploidy | ||
# For each individual, get genotypes for their nodes | ||
for i in range(self.num_samples): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There must be some numpy-way of doings this without iterating over samples - otherwise this is going to be super slow
bio2zarr/tskit.py
Outdated
phased = np.ones(shape[:-1], dtype=bool) | ||
|
||
for variant in self.ts.variants( | ||
samples=self.sample_ids, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have to pass samples somehow, otherwise we can't get data for non-sample nodes out.
Sorry I pushed a bit early as I was moving |
Not sure if this one slipped the net - it's ready for another look over. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Spotted a few small things
bio2zarr/tskit.py
Outdated
|
||
# Determine max number of alleles | ||
max_alleles = 0 | ||
for variant in self.ts.variants(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like overkill - how about we count the maximum number of distinct states? Like
max_alleles = 0
for site in ts.sites():
states = {site.ancestral_state}
for mut in site.mutations:
states.add(mut.derived_state)
max_alleles = max(len(states), max_alleles)
Could probably be done quicker with numpy, but this'll be fine.
It's a safe maximum, and it'll be correct in the vast majority of cases right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 4d5a9b4
This seems right - it can higher if there are back mutations, but I can't see how it could be smaller.
bio2zarr/tskit.py
Outdated
vcz.ZarrArraySpec( | ||
source="position", | ||
name="variant_position", | ||
dtype="i4", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could potentially have position values that won't fit into int32, we should check for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 304a85d
I've made it use i8
if larger positions are found.
bio2zarr/tskit.py
Outdated
vcz.ZarrArraySpec( | ||
source=None, | ||
name="call_genotype", | ||
dtype="i1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should check that max alleles will fit into this type. There's a function like min_dtype
or something for this somewhere in this repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 6eb4caf
bio2zarr/tskit.py
Outdated
def __init__( | ||
self, | ||
ts_path, | ||
individual_nodes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Best make this individuals_nodes
for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 727e47d
bio2zarr/tskit.py
Outdated
max_position = 0 | ||
if self.ts.num_sites > 0: | ||
max_position = np.max(self.ts.sites_position) | ||
position_dtype = "i4" if max_position <= np.iinfo(np.int32).max else "i8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use min_int_dtype here also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Done in 27a79ec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. looks like ready to squash and merge?
Squashed! Will merge when CI happy. |
On that note, shall I implement the github auto merge here? Seems to work well over at tskit. |
yes please! Can you do vcztools also? |
Added to the auto queue, which you can see here: https://github.com/sgkit-dev/bio2zarr/queue/main |
No description provided.