Skip to content

Commit b92ec7c

Browse files
Document local alleles
Closes #298
1 parent 0f50d0c commit b92ec7c

File tree

2 files changed

+52
-4
lines changed

2 files changed

+52
-4
lines changed

docs/vcf2zarr/overview.md

Lines changed: 50 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@ See the {ref}`sec-vcf2zarr-tutorial` for a step-by-step introduction
99
and the {ref}`sec-vcf2zarr-cli-ref` detailed documentation on
1010
command line options.
1111

12+
See the [bioRxiv preprint](https://www.biorxiv.org/content/10.1101/2024.06.11.598241) for
13+
further details.
1214

1315
## Quickstart
1416

@@ -43,13 +45,18 @@ vcf2zarr inspect sample.vcz
4345
### What next?
4446

4547
VCF Zarr is a starting point in what we hope will become a diverse ecosytem
46-
of packages that efficiently process VCF data in Zarr format. However, this
47-
ecosytem does not exist yet, and there isn't much software available
48-
for working with the format. As such, VCF Zarr isn't suitable for end users
49-
who just want to get their work done for the moment.
48+
of packages that efficiently process VCF data in Zarr format. This
49+
ecosytem is in its infancy and there isn't much software available
50+
for performing off-the-shelf bioinformatics tasks
51+
working with the format. As such, VCF Zarr isn't suitable for end users
52+
who just want to get their work done for the moment, and is currently
53+
aimed methods developers and early adopters.
5054

5155
Having said that, you can:
5256

57+
- Use [vcztools](https://github.com/sgkit-dev/vcztools/) as a drop-in replacment
58+
for bcftools, transparently using Zarr on local storage or cloud stores as the
59+
backend.
5360
- Look at the [VCF Zarr specification](https://github.com/sgkit-dev/vcf-zarr-spec/)
5461
to see how data is mapped from VCF to Zarr
5562
- Use the mature [Zarr Python](https://zarr.readthedocs.io/en/stable/) package or
@@ -59,6 +66,9 @@ your data.
5966
sister project to analyse the data. Note that sgkit is under active development,
6067
however, and the documentation may not be fully in-sync with this project.
6168

69+
For more information, please see our
70+
bioRxiv preprint [Analysis-ready VCF at Biobank scale using Zarr](
71+
https://www.biorxiv.org/content/10.1101/2024.06.11.598241).
6272

6373

6474
## How does it work?
@@ -83,6 +93,42 @@ across cores on a single machine (via the ``--worker-processes`` argument)
8393
or distributed across a cluster by the three-part ``init``, ``partition``
8494
and ``finalise`` commands.
8595

96+
## Local alleles
97+
98+
As discussed in our [preprint](
99+
https://www.biorxiv.org/content/10.1101/2024.06.11.598241)
100+
vcf2zarr has an experimental implementation of the local alleles data
101+
reduction technique. This essentially reduces the inner dimension of
102+
large fields such as AD by storing information relevant only to the alleles
103+
involved in a particular variant call, rather than information information
104+
for all alleles. This can make a substantial difference when there is a large
105+
number of alleles.
106+
107+
To use local alleles, you must generate storage a schema (see the
108+
{ref}`sec-vcf2zarr-tutorial-medium-dataset` section of the tutorial)
109+
using the {ref}`mkschema<cmd-vcf2zarr-mkschema>` command with the
110+
``--local-alleles`` option. This will generate the ``call_LA`` field
111+
which lists the alleles observed for each genotype call, and
112+
translate supported fields from their global alleles to local
113+
alleles representation.
114+
115+
:::{warning}
116+
Support for local-alleles is preliminary and may be subject to change
117+
as the details of how alleles for a particular call are chosen, and the
118+
number of alleles retained determined. Please open an issue on
119+
[GitHub](https://github.com/sgkit-dev/bio2zarr/issues/) if you would like to
120+
help improve Bio2zarr's local alleles implementation.
121+
:::
122+
123+
:::{note}
124+
Only the PL and AD fields are currently supported for local alleles
125+
data reduction. Please comment on our
126+
[local alleles fields tracking issue](https://github.com/sgkit-dev/bio2zarr/issues/315)
127+
if you would like to see other fields supported, or to help out with
128+
implementing more.
129+
:::
130+
131+
86132
## Copying to object stores
87133

88134
:::{todo}

docs/vcf2zarr/tutorial.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,8 @@ chunks are *larger* than the actual arrays. This is because it's
8383
a tiny example, with only 9 variants and 3 samples (see the ``shape``
8484
column), so, for example ``call_genotype`` is only 54 bytes.
8585

86+
87+
(sec-vcf2zarr-tutorial-medium-dataset)=
8688
## Medium dataset
8789

8890
Conversion in ``vcf2zarr`` is a two step process. First we convert the VCF(s) to

0 commit comments

Comments
 (0)