-
Notifications
You must be signed in to change notification settings - Fork 6
Description
VCF Zarr handles ploidy by assigning a maximum ploidy across all individuals, and then letting each call have it's own ploidy by padding the call_genotype array with dimension padding sentinels (-2) as appropriate. This allows us to handle individuals with different ploidy naturally (e.g a haplodiploid species where one sex is haploid and the other diploid), and even cases where the ploidy of an individual varies along the genome. For example, we could model X chromosomes as diploid in females, and haploid in males. As we're handling ploidy on a per-call basis there is no problem with this.
An interesting extension of this is to consider a ploidy of zero, in cases when individuals do not have a particular chromosome at all --- for example, Y chromosomes in females. If we're modelling the genome as one big Zarr, then the natural thing to do is to have [-2, -2] values in the call_genotype
array for all females in the Y chromosome.
This works perfectly well for VCF Zarr, but presents a challenge when we try to map back to the VCF text format. We get something like this
##fileformat=VCFv4.3
##contig=<ID=chr1>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT x0 x1
chr1 1 . A T 0 PASS . GT 0/0
chr1 2 . A T 0 PASS . GT 0/0
Here, x0 is 0-ploid at variant 0 and x1 is 0-ploid at variant 2. As a natural extension (i.e., with no changes to our code) we can handle this by outputting nothing in at the GT space for these zero-ploid samples - there is no text in between the "tab" delimeters. However, this isn't something the VCF spec discusses, and it seems that htslib doesn't like it either:
$ bcftools view zero-ploidy.vcf
[E::vcf_parse_format] Couldn't read GT data: value not a number or '.' at chr1:1
Error: VCF parse error
I think this doesn't really arise in practise with VCFs because people always output per chromosome files, and they just omit zero-ploid samples from these. However, it's something we'll have to deal with in vcztools if we do combine all the data together into One Big Zarr.
Perhaps we should add an option "zero-ploidy-as-missing" or something, which at least puts in a "."? That would let people get data out in a usable format for a start (even if it's slightly misleading)