sgkit-dev
diff --git a/‎docs/_toc.yml‎
Lines changed: 7 additions & 2 deletions b/‎docs/_toc.yml‎
Lines changed: 7 additions & 2 deletions
diff --git a/‎docs/vcf2zarr.md‎
Lines changed: 0 additions & 210 deletions b/‎docs/vcf2zarr.md‎
Lines changed: 0 additions & 210 deletions
diff --git a/‎docs/vcf2zarr/cli_ref.md‎
Lines changed: 79 additions & 0 deletions b/‎docs/vcf2zarr/cli_ref.md‎
Lines changed: 79 additions & 0 deletions
diff --git a/‎docs/vcf2zarr/overview.md‎
Lines changed: 87 additions & 0 deletions b/‎docs/vcf2zarr/overview.md‎
Lines changed: 87 additions & 0 deletions
@@ -2,5 +2,10 @@ format: jb-book
 root: intro
 chapters:
 - file: installation
-- file: vcf2zarr
-- file: vcfpartition
+- file: vcf2zarr/overview
+  sections:
+  - file: vcf2zarr/tutorial
+  - file: vcf2zarr/cli_ref
+- file: vcfpartition/overview
+  sections:
+  - file: vcfpartition/cli_ref
@@ -0,0 +1,79 @@
+# CLI Reference
+
+% A note on cross references... There's some weird long-standing problem with
+% cross referencing program values in Sphinx, which means that we can't use
+% the built-in labels generated by sphinx-click. We can make our own explicit
+% targets, but these have to have slightly weird names to avoid conflicting
+% with what sphinx-click is doing. So, hence the cmd- prefix.
+% Based on: https://github.com/skypilot-org/skypilot/pull/2834
+
+```{eval-rst}
+
+.. _cmd-vcf2zarr:
+.. click:: bio2zarr.cli:vcf2zarr_main
+   :prog: vcf2zarr
+   :nested: short
+
+.. _cmd-vcf2zarr-convert:
+.. click:: bio2zarr.cli:convert_vcf
+   :prog: vcf2zarr convert
+   :nested: full
+
+.. _cmd-vcf2zarr-inspect:
+.. click:: bio2zarr.cli:inspect
+   :prog: vcf2zarr inspect
+   :nested: full
+
+.. _cmd-vcf2zarr-mkschema:
+.. click:: bio2zarr.cli:mkschema
+   :prog: vcf2zarr mkschema
+   :nested: full
+```
+
+## Explode
+
+```{eval-rst}
+.. _cmd-vcf2zarr-explode:
+.. click:: bio2zarr.cli:explode
+   :prog: vcf2zarr explode
+   :nested: full
+
+.. _cmd-vcf2zarr-dexplode-init:
+.. click:: bio2zarr.cli:dexplode_init
+   :prog: vcf2zarr dexplode-init
+   :nested: full
+
+.. _cmd-vcf2zarr-dexplode-partition:
+.. click:: bio2zarr.cli:dexplode_partition
+   :prog: vcf2zarr dexplode-partition
+   :nested: full
+
+.. _cmd-vcf2zarr-dexplode-finalise:
+.. click:: bio2zarr.cli:dexplode_finalise
+   :prog: vcf2zarr dexplode-finalise
+   :nested: full
+```
+
+## Encode
+
+```{eval-rst}
+.. click:: bio2zarr.cli:encode
+   :prog: vcf2zarr encode
+   :nested: full
+
+.. _cmd-vcf2zarr-dencode-init:
+.. click:: bio2zarr.cli:dencode_init
+   :prog: vcf2zarr dencode-init
+   :nested: full
+
+.. _cmd-vcf2zarr-dencode-partition:
+.. click:: bio2zarr.cli:dencode_partition
+   :prog: vcf2zarr dencode-partition
+   :nested: full
+
+.. _cmd-vcf2zarr-dencode-finalise:
+.. click:: bio2zarr.cli:dencode_finalise
+   :prog: vcf2zarr dencode-finalise
+   :nested: full
+```
+
@@ -0,0 +1,87 @@
+# vcf2zarr
+
+
+Convert a VCF to zarr format:
+
+```
+$ vcf2zarr convert <VCF1> <VCF2> <zarr>
+```
+
+Converts the VCF to zarr format.
+
+**Do not use this for anything but the smallest files**
+
+The recommended approach is to use a multi-stage conversion
+
+First, convert the VCF into the intermediate format:
+
+```
+vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
+```
+
+Then, (optionally) inspect this representation to get a feel for your dataset
+```
+vcf2zarr inspect tmp/sample.exploded
+```
+
+Then, (optionally) generate a conversion schema to describe the corresponding
+Zarr arrays:
+
+```
+vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
+```
+
+View and edit the schema, deleting any columns you don't want, or tweaking
+dtypes and compression settings to your taste.
+
+Finally, encode to Zarr:
+```
+vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
+```
+
+Use the ``-p, --worker-processes`` argument to control the number of workers used
+in the ``explode`` and ``encode`` phases.
+
+## To be merged with above
+
+The simplest usage is:
+
+```
+$ vcf2zarr convert [VCF_FILE] [ZARR_PATH]
+```
+
+
+This will convert the indexed VCF (or BCF) into the vcfzarr format in a single
+step. As this writes the intermediate columnar format to a temporary directory,
+we only recommend this approach for small files (< 1GB, say).
+
+The recommended approach is to run the conversion in two passes, and
+to keep the intermediate columnar format ("exploded") around to facilitate
+experimentation with chunk sizes and compression settings:
+
+```
+$ vcf2zarr explode [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH]
+$ vcf2zarr encode [ICF_PATH] [ZARR_PATH]
+```
+
+The inspect command provides a way to view contents of an exploded ICF
+or Zarr:
+
+```
+$ vcf2zarr inspect [PATH]
+```
+
+This is useful when tweaking chunk sizes and compression settings to suit
+your dataset, using the mkschema command and --schema option to encode:
+
+```
+$ vcf2zarr mkschema [ICF_PATH] > schema.json
+$ vcf2zarr encode [ICF_PATH] [ZARR_PATH] --schema schema.json
+```
+
+By editing the schema.json file you can drop columns that are not of interest
+and edit column specific compression settings. The --max-variant-chunks option
+to encode allows you to try out these options on small subsets, hopefully
+arriving at settings with the desired balance of compression and query
+performance.
+