-
Notifications
You must be signed in to change notification settings - Fork 77
Description
We're trying to nail down the precise semantics of the mapping between the tskit data model and VCF in #3163 so that we can define as complete a mapping as we can in for ts2zarr (initial PR here: sgkit-dev/bio2zarr#346).
For simplicity, assume that we have an individual table defined and each individual maps > 0 nodes. Nearly always we're going to be interested in generating VCF for samples. The mapping to the model is defined by: a list of node IDs grouped by individual, and a corresponding list of strings denoting the IDs of those individuals in the VCF data model. So, for (say) two diploids the output is ``[[0, 1], [2, 3]], ["tsk_0", "tsk_1"]. (Let's drop the strings from here as it's obvious.)
Mixed ploidy is straightforward. Suppose we have two individuals, one diploid and one haploid. Then, we get [[0, 1], [2]]
. In (simplified) VCF we would see a variant as
CHROM POS FORMAT tsk_0 tsk_1
1 1 GT 0|0 0
This will be correctly interpreted as a mixed ploidy case by VCF Zarr, which will represent the genotypes here as
[0, 0], [0, -2]
where -2 is the dimension padding sentinel (indicating that this inidividual is haploid at this variant). The VCF spec itself is silent on the topic of samples with mixed ploidy, but this representation is entirely consistent with it, unambiguous and fully standards compliant.
Things get more fun when we have individuals that are both haploid and diploid in the same dataset (sex chrs) but from a VCF/Zarr perspective it's unequivocal:
CHROM POS FORMAT tsk_0 tsk_1
1 1 GT 0|0 0
2 1 GT 0|0 0|0
Here, individual tsk_1 is haploid on chr 1 and diploid in chr 2. This will be faithfully and unambiguously rendered in VCF Zarr, and is fully standards compliant.
The tricky question here is, how do we represent this in tskit? In practise, we'll have two tree sequences one each for chr 1 and chr 2, which will have defined the same set of individuals. Suppose we had the following mapping of nodes, [[0, 1], [2, 3]]. I think what happens in a SLiM simulation now is that all the [0, 1, 2, 3] are samples (as all the node tables are identical during the simulation), but in chr 1, node 3 will be isolated (i.e. not be in any trees). I think the proposal in pyslim is to detect the situation and to unmark 3 as a sample for the purposes of analysis (and therefore all the usual stats should take care of themselves).
The correct thing for the VCF model conversion though in this case is to ignore the non-sample node in a mixed sample/non-sample case. So, the question is, should we do this by default? Or all the time? Or add a flag, like include_non_sample_nodes=False
, which would always exclude non-sample nodes from the mapping in both the default and non-default cases (which is probably what you want , nearly always)?
Moving on to individuals that are sometimes "zero-ploid" (females on Y chr, e.g.), I think the same idea extends, and works fine in VCF Zarr. So, suppose tsk_1 is zero ploid on chr 1 this time. Then, we'd have the mapping of [[0, 1], []]
, and in VCF Zarr we'd get genotypes of [[0, 0], [-2, -2]]
. As explored in this issue (sgkit-dev/vcztools#212) this works perfectly well for VCF Zarr, but hits some issues with people's implementations of the VCF text specification.
My take would be that tskit's view is that these individuals are zero-ploid, i.e. have zero sample nodes associated with them.