Merge pull request #236 from jeromekelleher/some-more-docs

jeromekelleher · web-flow · commit 164c9233b23c · 2024-06-07T15:46:06.000+01:00
Some more docs
diff --git a/docs/Makefile b/docs/Makefile
@@ -20,17 +20,23 @@ dist: ${CASTS}
 
 clean:
 	rm -fR $(BUILDDIR)
+	rm -f _static/*.cast
 
 
 sample.vcf.gz:
 	cp ../tests/data/vcf/sample.vcf.gz ./
 	cp ../tests/data/vcf/sample.vcf.gz.tbi ./
+	# FIXME we should really running the casts out of the
+	# vcf2zarr directory, but let's get this working for now.
+	cp sample.vcf.gz* vcf2zarr
 
 _static/vcf2zarr_convert.cast: sample.vcf.gz
-	rm -f sample.vcz
+	rm -fR sample.vcz
 	asciinema-automation cast_scripts/vcf2zarr_convert.sh $@
+	cp -R sample.vcz vcf2zarr
 
 _static/vcf2zarr_explode.cast: sample.vcf.gz
 	rm -Rf sample.icf
 	asciinema-automation cast_scripts/vcf2zarr_explode.sh $@
+	cp -R sample.icf vcf2zarr
 
diff --git a/docs/_config.yml b/docs/_config.yml
@@ -4,6 +4,7 @@
 title: bio2zarr Documentation
 author: sgkit developers
 logo: logo.png
+copyright: "2024"
 
 # Force re-execution of notebooks on each build.
 # See https://jupyterbook.org/content/execute.html
diff --git a/docs/installation.md b/docs/installation.md
@@ -22,12 +22,11 @@ vcf2zarr <args>
 and will always work.
 
 :::{note}
-The ``python3 -m bio2zarr vcf2zarr`` for may be replaced with
+The ``python3 -m bio2zarr vcf2zarr`` form may be replaced with
 ``python3 -m bio2zarr.vcf2zarr`` in the near future.
 See GitHub issue [203](https://github.com/sgkit-dev/bio2zarr/issues/203).
 :::
 
-
 :::{warning}
 Windows is not currently supported. Please comment on
 [this issue](https://github.com/sgkit-dev/bio2zarr/issues/174) if you would
diff --git a/docs/intro.md b/docs/intro.md
@@ -8,20 +8,29 @@
 - {ref}`sec-vcf2zarr` converts VCF data to
   [VCF Zarr](https://github.com/sgkit-dev/vcf-zarr-spec/) format.
 
-- {ref}`sec-vcfpartition` is a utility to split an input (set of)
-  VCFs into a given number of partitions. This is useful for
-  parallel processing.
+- {ref}`sec-vcfpartition` is a utility to split an input
+  VCF into a given number of partitions. This is useful for
+  parallel processing of VCF data.
 
 ## Development status
 
 `bio2zarr` is in development, contributions, feedback and issues are welcome
 at the [GitHub repository](https://github.com/sgkit-dev/bio2zarr).
 
 Support for converting PLINK data to VCF Zarr is partially implemented,
-and adding BGEN support is also planned. If you would like to see
+and adding BGEN and [tskit](https://tskit.dev/) support is also planned.
+If you would like to see
 support for other formats (or an interested in helping with implementing),
 please open an [issue on Github](https://github.com/sgkit-dev/bio2zarr/issues)
 to discuss!
 
+
 The package is currently focused on command line interfaces, but a
 Python API is also planned.
+
+:::{warning}
+Although it is possible to import the bio2zarr Python package
+the APIs are purely internal for the moment and will change
+in arbitrary ways. Please don't use them (or open issues about
+them on GitHub).
+:::
diff --git a/docs/vcf2zarr/cli_ref.md b/docs/vcf2zarr/cli_ref.md
@@ -10,11 +10,6 @@
 
 ```{eval-rst}
 
-.. _cmd-vcf2zarr:
-.. click:: bio2zarr.cli:vcf2zarr_main
-   :prog: vcf2zarr
-   :nested: short
-
 .. _cmd-vcf2zarr-convert:
 .. click:: bio2zarr.cli:convert_vcf
    :prog: vcf2zarr convert
diff --git a/docs/vcf2zarr/overview.md b/docs/vcf2zarr/overview.md
@@ -12,117 +12,76 @@ command line options.
 
 ## Quickstart
 
-First {ref}`install bio2zarr<sec-installation>`.
+- First {ref}`install bio2zarr<sec-installation>`.
 
 
-:::{note}
-FINISH ME
-:::
-
-
-
-## How does it work?
-The conversion of VCF data to Zarr is a two-step process:
-
-1. Convert ({ref}`explode<cmd-vcf2zarr-explode>`) VCF file(s) to
-    Intermediate Columnar Format (ICF)
-2. Convert ({ref}`encode<cmd-vcf2zarr-encode>`) ICF to Zarr
-
-This two-step process allows `vcf2zarr` to determine the correct
-dimension of Zarr arrays corresponding to each VCF field, and
-to keep memory usage tightly bounded while writing the arrays.
-
-:::{important}
-The intermediate columnar format is not intended for any use
-other than a temporary storage while converting VCF to Zarr.
-The format may change between versions of `bio2zarr`.
-:::
-
-
-## Common options
+- Get some indexed VCF data:
 
 ```
-$ vcf2zarr convert <VCF1> <VCF2> <zarr>
+curl -O https://raw.githubusercontent.com/sgkit-dev/bio2zarr/main/tests/data/vcf/sample.vcf.gz
+curl -O https://raw.githubusercontent.com/sgkit-dev/bio2zarr/main/tests/data/vcf/sample.vcf.gz.tbi
 ```
 
-Converts the VCF to zarr format.
-
-**Do not use this for anything but the smallest files**
-
-The recommended approach is to use a multi-stage conversion
-
-First, convert the VCF into the intermediate format:
-
-```
-vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
-```
+- Convert to VCF Zarr in two steps:
 
-Then, (optionally) inspect this representation to get a feel for your dataset
 ```
-vcf2zarr inspect tmp/sample.exploded
+vcf2zarr explode sample.vcf.gz sample.icf
+vcf2zarr encode sample.icf sample.vcz
 ```
 
-Then, (optionally) generate a conversion schema to describe the corresponding
-Zarr arrays:
-
-```
-vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
-```
+:::{tip}
+If the ``vcf2zarr`` executable doesn't work, try ``python -m bio2zarr vcf2zarr``
+instead.
+:::
 
-View and edit the schema, deleting any columns you don't want, or tweaking
-dtypes and compression settings to your taste.
+- Have a look at the results:
 
-Finally, encode to Zarr:
 ```
-vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
+vcf2zarr inspect sample.vcz
 ```
 
-Use the ``-p, --worker-processes`` argument to control the number of workers used
-in the ``explode`` and ``encode`` phases.
-
-## To be merged with above
-
-The simplest usage is:
-
-```
-$ vcf2zarr convert [VCF_FILE] [ZARR_PATH]
-```
+### What next?
 
+VCF Zarr is a starting point in what we hope will become a diverse ecosytem
+of packages that efficiently process VCF data in Zarr format. However, this
+ecosytem does not exist yet, and there isn't much software available
+for working with the format. As such, VCF Zarr isn't suitable for end users
+who just want to get their work done for the moment.
 
-This will convert the indexed VCF (or BCF) into the vcfzarr format in a single
-step. As this writes the intermediate columnar format to a temporary directory,
-we only recommend this approach for small files (< 1GB, say).
+Having said that, you can:
 
-The recommended approach is to run the conversion in two passes, and
-to keep the intermediate columnar format ("exploded") around to facilitate
-experimentation with chunk sizes and compression settings:
+- Look at the [VCF Zarr specification](https://github.com/sgkit-dev/vcf-zarr-spec/)
+  to see how data is mapped from VCF to Zarr
+- Use the mature [Zarr Python](https://zarr.readthedocs.io/en/stable/) package or
+one of the other [Zarr implementations](https://zarr.dev/implementations/) to access
+your data.
+- Use the many functions in our [sgkit](https://sgkit-dev.github.io/sgkit/latest/)
+sister project to analyse the data. Note that sgkit is under active development,
+however, and the documentation may not be fully in-sync with this project.
 
-```
-$ vcf2zarr explode [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH]
-$ vcf2zarr encode [ICF_PATH] [ZARR_PATH]
-```
 
-The inspect command provides a way to view contents of an exploded ICF
-or Zarr:
 
-```
-$ vcf2zarr inspect [PATH]
-```
+## How does it work?
+The conversion of VCF data to Zarr is a two-step process:
 
-This is useful when tweaking chunk sizes and compression settings to suit
-your dataset, using the mkschema command and --schema option to encode:
+1. Convert ({ref}`explode<cmd-vcf2zarr-explode>`) VCF file(s) to
+    Intermediate Columnar Format (ICF)
+2. Convert ({ref}`encode<cmd-vcf2zarr-encode>`) ICF to Zarr
 
-```
-$ vcf2zarr mkschema [ICF_PATH] > schema.json
-$ vcf2zarr encode [ICF_PATH] [ZARR_PATH] --schema schema.json
-```
+This two-step process allows `vcf2zarr` to determine the correct
+dimension of Zarr arrays corresponding to each VCF field, and
+to keep memory usage tightly bounded while writing the arrays.
 
-By editing the schema.json file you can drop columns that are not of interest
-and edit column specific compression settings. The --max-variant-chunks option
-to encode allows you to try out these options on small subsets, hopefully
-arriving at settings with the desired balance of compression and query
-performance.
+:::{important}
+The intermediate columnar format is not intended for any use
+other than a temporary storage while converting VCF to Zarr.
+The format may change between versions of `bio2zarr`.
+:::
 
+Both ``explode`` and ``encode`` can be performed in parallel
+across cores on a single machine (via the ``--worker-processes`` argument)
+or distributed across a cluster by the three-part ``init``, ``partition``
+and ``finalise`` commands.
 
 ## Copying to object stores
 
diff --git a/docs/vcf2zarr/tutorial.md b/docs/vcf2zarr/tutorial.md