Document local alleles

jeromekelleher · jeromekelleher · commit b92ec7ccb8b3 · 2025-02-05T10:08:50.000Z
Closes #298
diff --git a/docs/vcf2zarr/overview.md b/docs/vcf2zarr/overview.md
@@ -9,6 +9,8 @@ See the {ref}`sec-vcf2zarr-tutorial` for a step-by-step introduction
 and the {ref}`sec-vcf2zarr-cli-ref` detailed documentation on
 command line options.
 
+See the [bioRxiv preprint](https://www.biorxiv.org/content/10.1101/2024.06.11.598241) for 
+further details.
 
 ## Quickstart
 
@@ -43,13 +45,18 @@ vcf2zarr inspect sample.vcz
 ### What next?
 
 VCF Zarr is a starting point in what we hope will become a diverse ecosytem
-of packages that efficiently process VCF data in Zarr format. However, this
-ecosytem does not exist yet, and there isn't much software available
-for working with the format. As such, VCF Zarr isn't suitable for end users
-who just want to get their work done for the moment.
+of packages that efficiently process VCF data in Zarr format. This 
+ecosytem is in its infancy and there isn't much software available
+for performing off-the-shelf bioinformatics tasks
+working with the format. As such, VCF Zarr isn't suitable for end users
+who just want to get their work done for the moment, and is currently
+aimed methods developers and early adopters.
 
 Having said that, you can:
 
+- Use [vcztools](https://github.com/sgkit-dev/vcztools/) as a drop-in replacment 
+  for bcftools, transparently using Zarr on local storage or cloud stores as the 
+  backend.
 - Look at the [VCF Zarr specification](https://github.com/sgkit-dev/vcf-zarr-spec/)
   to see how data is mapped from VCF to Zarr
 - Use the mature [Zarr Python](https://zarr.readthedocs.io/en/stable/) package or
@@ -59,6 +66,9 @@ your data.
 sister project to analyse the data. Note that sgkit is under active development,
 however, and the documentation may not be fully in-sync with this project.
 
+For more information, please see our 
+bioRxiv preprint [Analysis-ready VCF at Biobank scale using Zarr](
+https://www.biorxiv.org/content/10.1101/2024.06.11.598241).
 
 
 ## How does it work?
@@ -83,6 +93,42 @@ across cores on a single machine (via the ``--worker-processes`` argument)
 or distributed across a cluster by the three-part ``init``, ``partition``
 and ``finalise`` commands.
 
+## Local alleles
+
+As discussed in our [preprint](
+https://www.biorxiv.org/content/10.1101/2024.06.11.598241) 
+vcf2zarr has an experimental implementation of the local alleles data
+reduction technique. This essentially reduces the inner dimension of 
+large fields such as AD by storing information relevant only to the alleles
+involved in a particular variant call, rather than information information
+for all alleles. This can make a substantial difference when there is a large 
+number of alleles.
+
+To use local alleles, you must generate storage a schema (see the 
+{ref}`sec-vcf2zarr-tutorial-medium-dataset` section of the tutorial)
+using the {ref}`mkschema<cmd-vcf2zarr-mkschema>` command with the 
+``--local-alleles`` option. This will generate the ``call_LA`` field
+which lists the alleles observed for each genotype call, and 
+translate supported fields from their global alleles to local
+alleles representation.
+
+:::{warning}
+Support for local-alleles is preliminary and may be subject to change
+as the details of how alleles for a particular call are chosen, and the 
+number of alleles retained determined. Please open an issue on
+[GitHub](https://github.com/sgkit-dev/bio2zarr/issues/) if you would like to 
+help improve Bio2zarr's local alleles implementation.
+:::
+
+:::{note}
+Only the PL and AD fields are currently supported for local alleles
+data reduction. Please comment on our 
+[local alleles fields tracking issue](https://github.com/sgkit-dev/bio2zarr/issues/315)
+if you would like to see other fields supported, or to help out with 
+implementing more.
+:::
+
+
 ## Copying to object stores
 
 :::{todo}
diff --git a/docs/vcf2zarr/tutorial.md b/docs/vcf2zarr/tutorial.md
@@ -83,6 +83,8 @@ chunks are *larger* than the actual arrays. This is because it's
 a tiny example, with only 9 variants and 3 samples (see the ``shape``
 column), so, for example ``call_genotype`` is only 54 bytes.
 
+
+(sec-vcf2zarr-tutorial-medium-dataset)=
 ## Medium dataset
 
 Conversion in ``vcf2zarr`` is a two step process. First we convert the VCF(s) to