@@ -9,6 +9,8 @@ See the {ref}`sec-vcf2zarr-tutorial` for a step-by-step introduction
99and the {ref}` sec-vcf2zarr-cli-ref ` detailed documentation on
1010command line options.
1111
12+ See the [ bioRxiv preprint] ( https://www.biorxiv.org/content/10.1101/2024.06.11.598241 ) for
13+ further details.
1214
1315## Quickstart
1416
@@ -43,13 +45,18 @@ vcf2zarr inspect sample.vcz
4345### What next?
4446
4547VCF Zarr is a starting point in what we hope will become a diverse ecosytem
46- of packages that efficiently process VCF data in Zarr format. However, this
47- ecosytem does not exist yet, and there isn't much software available
48- for working with the format. As such, VCF Zarr isn't suitable for end users
49- who just want to get their work done for the moment.
48+ of packages that efficiently process VCF data in Zarr format. This
49+ ecosytem is in its infancy and there isn't much software available
50+ for performing off-the-shelf bioinformatics tasks
51+ working with the format. As such, VCF Zarr isn't suitable for end users
52+ who just want to get their work done for the moment, and is currently
53+ aimed methods developers and early adopters.
5054
5155Having said that, you can:
5256
57+ - Use [ vcztools] ( https://github.com/sgkit-dev/vcztools/ ) as a drop-in replacment
58+ for bcftools, transparently using Zarr on local storage or cloud stores as the
59+ backend.
5360- Look at the [ VCF Zarr specification] ( https://github.com/sgkit-dev/vcf-zarr-spec/ )
5461 to see how data is mapped from VCF to Zarr
5562- Use the mature [ Zarr Python] ( https://zarr.readthedocs.io/en/stable/ ) package or
@@ -59,6 +66,9 @@ your data.
5966sister project to analyse the data. Note that sgkit is under active development,
6067however, and the documentation may not be fully in-sync with this project.
6168
69+ For more information, please see our
70+ bioRxiv preprint [ Analysis-ready VCF at Biobank scale using Zarr] (
71+ https://www.biorxiv.org/content/10.1101/2024.06.11.598241 ).
6272
6373
6474## How does it work?
@@ -83,6 +93,42 @@ across cores on a single machine (via the ``--worker-processes`` argument)
8393or distributed across a cluster by the three-part `` init `` , `` partition ``
8494and `` finalise `` commands.
8595
96+ ## Local alleles
97+
98+ As discussed in our [ preprint] (
99+ https://www.biorxiv.org/content/10.1101/2024.06.11.598241 )
100+ vcf2zarr has an experimental implementation of the local alleles data
101+ reduction technique. This essentially reduces the inner dimension of
102+ large fields such as AD by storing information relevant only to the alleles
103+ involved in a particular variant call, rather than information information
104+ for all alleles. This can make a substantial difference when there is a large
105+ number of alleles.
106+
107+ To use local alleles, you must generate storage a schema (see the
108+ {ref}` sec-vcf2zarr-tutorial-medium-dataset ` section of the tutorial)
109+ using the {ref}` mkschema<cmd-vcf2zarr-mkschema> ` command with the
110+ `` --local-alleles `` option. This will generate the `` call_LA `` field
111+ which lists the alleles observed for each genotype call, and
112+ translate supported fields from their global alleles to local
113+ alleles representation.
114+
115+ :::{warning}
116+ Support for local-alleles is preliminary and may be subject to change
117+ as the details of how alleles for a particular call are chosen, and the
118+ number of alleles retained determined. Please open an issue on
119+ [ GitHub] ( https://github.com/sgkit-dev/bio2zarr/issues/ ) if you would like to
120+ help improve Bio2zarr's local alleles implementation.
121+ :::
122+
123+ :::{note}
124+ Only the PL and AD fields are currently supported for local alleles
125+ data reduction. Please comment on our
126+ [ local alleles fields tracking issue] ( https://github.com/sgkit-dev/bio2zarr/issues/315 )
127+ if you would like to see other fields supported, or to help out with
128+ implementing more.
129+ :::
130+
131+
86132## Copying to object stores
87133
88134:::{todo}
0 commit comments