Skip to content

Handling Zero-ploidy individuals in the output #212

@jeromekelleher

Description

@jeromekelleher

VCF Zarr handles ploidy by assigning a maximum ploidy across all individuals, and then letting each call have it's own ploidy by padding the call_genotype array with dimension padding sentinels (-2) as appropriate. This allows us to handle individuals with different ploidy naturally (e.g a haplodiploid species where one sex is haploid and the other diploid), and even cases where the ploidy of an individual varies along the genome. For example, we could model X chromosomes as diploid in females, and haploid in males. As we're handling ploidy on a per-call basis there is no problem with this.

An interesting extension of this is to consider a ploidy of zero, in cases when individuals do not have a particular chromosome at all --- for example, Y chromosomes in females. If we're modelling the genome as one big Zarr, then the natural thing to do is to have [-2, -2] values in the call_genotype array for all females in the Y chromosome.

This works perfectly well for VCF Zarr, but presents a challenge when we try to map back to the VCF text format. We get something like this

##fileformat=VCFv4.3
##contig=<ID=chr1>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  x0      x1
chr1    1       .       A       T       0       PASS    .       GT              0/0
chr1    2       .       A       T       0       PASS    .       GT      0/0

Here, x0 is 0-ploid at variant 0 and x1 is 0-ploid at variant 2. As a natural extension (i.e., with no changes to our code) we can handle this by outputting nothing in at the GT space for these zero-ploid samples - there is no text in between the "tab" delimeters. However, this isn't something the VCF spec discusses, and it seems that htslib doesn't like it either:

$ bcftools view zero-ploidy.vcf
[E::vcf_parse_format] Couldn't read GT data: value not a number or '.' at chr1:1
Error: VCF parse error

I think this doesn't really arise in practise with VCFs because people always output per chromosome files, and they just omit zero-ploid samples from these. However, it's something we'll have to deal with in vcztools if we do combine all the data together into One Big Zarr.

Perhaps we should add an option "zero-ploidy-as-missing" or something, which at least puts in a "."? That would let people get data out in a usable format for a start (even if it's slightly misleading)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions