Merge pull request #200 from jeromekelleher/move-readme-to-docs

jeromekelleher · web-flow · commit 976b96eafaa6 · 2024-05-14T09:59:19.000+01:00
Move readme to docs
diff --git a/README.md b/README.md
@@ -1,124 +1,9 @@
 [![CI](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml)
+[![Coverage Status](https://coveralls.io/repos/github/sgkit-dev/bio2zarr/badge.svg)](https://coveralls.io/github/sgkit-dev/bio2zarr)
+![PyPI](https://img.shields.io/pypi/v/PACKAGE?label=pypi%20bio2zarr)
+![PyPI - Downloads](https://img.shields.io/pypi/dm/bio2zarr)
 
 # bio2zarr
 Convert bioinformatics file formats to Zarr
 
-Initially supports converting VCF to the
-[sgkit vcf-zarr specification](https://github.com/pystatgen/vcf-zarr-spec/)
-
-**This is early alpha-status code: everything is subject to change,
-and it has not been thoroughly tested**
-
-## Install
-
-```
-$ python3 -m pip install bio2zarr
-```
-
-This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition``
-into your local Python path. You may need to update your $PATH to call the 
-executables directly.
-
-Alternatively, calling 
-```
-$ python3 -m bio2zarr vcf2zarr <args>
-```
-is equivalent to 
-
-```
-$ vcf2zarr <args>
-```
-and will always work.
-
-
-## vcf2zarr
-
-
-Convert a VCF to zarr format:
-
-```
-$ vcf2zarr convert <VCF1> <VCF2> <zarr>
-```
-
-Converts the VCF to zarr format.
-
-**Do not use this for anything but the smallest files**
-
-The recommended approach is to use a multi-stage conversion
-
-First, convert the VCF into the intermediate format:
-
-```
-vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
-```
-
-Then, (optionally) inspect this representation to get a feel for your dataset
-```
-vcf2zarr inspect tmp/sample.exploded
-```
-
-Then, (optionally) generate a conversion schema to describe the corresponding
-Zarr arrays:
-
-```
-vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
-```
-
-View and edit the schema, deleting any columns you don't want, or tweaking 
-dtypes and compression settings to your taste.
-
-Finally, encode to Zarr:
-```
-vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
-```
-
-Use the ``-p, --worker-processes`` argument to control the number of workers used
-in the ``explode`` and ``encode`` phases.
-
-### Shell completion
-
-To enable shell completion for a particular session in Bash do:
-
-```
-eval "$(_VCF2ZARR_COMPLETE=bash_source vcf2zarr)" 
-```
-
-If you add this to your ``.bashrc`` vcf2zarr shell completion should available
-in all new shell sessions.
-
-See the [Click documentation](https://click.palletsprojects.com/en/8.1.x/shell-completion/#enabling-completion)
-for instructions on how to enable completion in other shells.
-a
-
-## plink2zarr
-
-Convert a plink ``.bed`` file to zarr format. **This is incomplete**
-
-## vcf_partition
-
-Partition a given VCF file into (approximately) a give number of regions:
-
-```
-vcf_partition 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz -n 10
-```
-gives
-```
-chr20:1-6799360
-chr20:6799361-14319616
-chr20:14319617-21790720
-chr20:21790721-28770304
-chr20:28770305-31096832
-chr20:31096833-38043648
-chr20:38043649-45580288
-chr20:45580289-52117504
-chr20:52117505-58834944
-chr20:58834945-
-```
-
-These reqion strings can then be used to split computation of the VCF 
-into chunks for parallelisation.
-
-**TODO give a nice example here using xargs**
-
-**WARNING that this does not take into account that indels may overlap 
-partitions and you may count variants twice or more if they do**
+See the [documentation](https://sgkit-dev.github.io/bio2zarr/) for details.
diff --git a/bio2zarr/cli.py b/bio2zarr/cli.py
@@ -459,51 +459,8 @@ def vcf2zarr_main():
     """
     Convert VCF file(s) to the vcfzarr format.
 
-    The simplest usage is:
-
-    $ vcf2zarr convert [VCF_FILE] [ZARR_PATH]
-
-    This will convert the indexed VCF (or BCF) into the vcfzarr format in a single
-    step. As this writes the intermediate columnar format to a temporary directory,
-    we only recommend this approach for small files (< 1GB, say).
-
-    The recommended approach is to run the conversion in two passes, and
-    to keep the intermediate columnar format ("exploded") around to facilitate
-    experimentation with chunk sizes and compression settings:
-
-    \b
-    $ vcf2zarr explode [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH]
-    $ vcf2zarr encode [ICF_PATH] [ZARR_PATH]
-
-    The inspect command provides a way to view contents of an exploded ICF
-    or Zarr:
-
-    $ vcf2zarr inspect [PATH]
-
-    This is useful when tweaking chunk sizes and compression settings to suit
-    your dataset, using the mkschema command and --schema option to encode:
-
-    \b
-    $ vcf2zarr mkschema [ICF_PATH] > schema.json
-    $ vcf2zarr encode [ICF_PATH] [ZARR_PATH] --schema schema.json
-
-    By editing the schema.json file you can drop columns that are not of interest
-    and edit column specific compression settings. The --max-variant-chunks option
-    to encode allows you to try out these options on small subsets, hopefully
-    arriving at settings with the desired balance of compression and query
-    performance.
-
-    ADVANCED USAGE
-
-    For very large datasets (terabyte scale) it may be necessary to distribute the
-    explode and encode steps across a cluster:
-
-    \b
-    $ vcf2zarr dexplode-init [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH] [NUM_PARTITIONS]
-    $ vcf2zarr dexplode-partition [ICF_PATH] [PARTITION_INDEX]
-    $ vcf2zarr dexplode-finalise [ICF_PATH]
-
-    See the online documentation at [FIXME] for more details on distributed explode.
+    See the online documentation at https://sgkit-dev.github.io/bio2zarr/
+    for more information.
     """
 
 
diff --git a/docs/_toc.yml b/docs/_toc.yml
@@ -1,5 +1,7 @@
 format: jb-book
 root: intro
 chapters:
-- file: vcf2zarr_tutorial
+- file: installation
+- file: vcf2zarr
+- file: vcfpartition
 - file: cli
diff --git a/docs/cli.md b/docs/cli.md
@@ -1,4 +1,4 @@
-# Command Line Interface
+# CLI Reference
 
 % A note on cross references... There's some weird long-standing problem with
 % cross referencing program values in Sphinx, which means that we can't use
diff --git a/docs/installation.md b/docs/installation.md
@@ -0,0 +1,36 @@
+# Installation
+
+
+```
+$ python3 -m pip install bio2zarr
+```
+
+This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition``
+into your local Python path. You may need to update your $PATH to call the
+executables directly.
+
+Alternatively, calling
+```
+$ python3 -m bio2zarr vcf2zarr <args>
+```
+is equivalent to
+
+```
+$ vcf2zarr <args>
+```
+and will always work.
+
+
+## Shell completion
+
+To enable shell completion for a particular session in Bash do:
+
+```
+eval "$(_VCF2ZARR_COMPLETE=bash_source vcf2zarr)"
+```
+
+If you add this to your ``.bashrc`` vcf2zarr shell completion should available
+in all new shell sessions.
+
+See the [Click documentation](https://click.palletsprojects.com/en/8.1.x/shell-completion/#enabling-completion)
+for instructions on how to enable completion in other shells.
diff --git a/docs/intro.md b/docs/intro.md
@@ -1,76 +1,9 @@
-# bio2zarr Documentation
+# bio2zarr
 
-`bio2zarr` efficiently converts common bioinformatics formats to 
-[Zarr](https://zarr.readthedocs.io/en/stable/) format. Initially supporting converting 
-VCF to the [sgkit vcf-zarr specification](https://github.com/pystatgen/vcf-zarr-spec/).
+`bio2zarr` efficiently converts common bioinformatics formats to
+[Zarr](https://zarr.readthedocs.io/en/stable/) format. Initially supporting converting
+VCF to the [VCF Zarr specification](https://github.com/sgkit-dev/vcf-zarr-spec/).
 
-`bio2zarr` is in early alpha development, contributions, feedback and issues are welcome
+`bio2zarr` is in development, contributions, feedback and issues are welcome
 at the [GitHub repository](https://github.com/sgkit-dev/bio2zarr).
 
-## Installation
-`bio2zarr` can be installed from PyPI using pip:
-
-```bash
-$ python3 -m pip install bio2zarr
-```
-
-This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition``
-into your local Python path. You may need to update your $PATH to call the 
-executables directly.
-
-Alternatively, calling 
-```
-$ python3 -m bio2zarr vcf2zarr <args>
-```
-is equivalent to 
-
-```
-$ vcf2zarr <args>
-```
-and will always work.
-
-## Basic vcf2zarr usage
-For modest VCF files (up to a few GB), a single command can be used to convert a VCF file
-(or set of VCF files) using the {ref}`convert<cmd-vcf2zarr-convert>` command:
-
-```bash
-$ vcf2zarr convert <VCF1> <VCF2> ... <VCFN> <zarr>
-```
-
-For larger files a multi-step process is recommended. 
-
-
-First, convert the VCF into the intermediate format:
-
-```bash
-$ vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
-```
-
-Then, (optionally) inspect this representation to get a feel for your dataset
-```bash
-$ vcf2zarr inspect tmp/sample.exploded
-```
-
-Then, (optionally) generate a conversion schema to describe the corresponding
-Zarr arrays:
-
-```bash
-$ vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
-```
-
-View and edit the schema, deleting any columns you don't want, or tweaking 
-dtypes and compression settings to your taste.
-
-Finally, encode to Zarr:
-```bash
-$ vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
-```
-
-Use the ``-p, --worker-processes`` argument to control the number of workers used
-in the ``explode`` and ``encode`` phases.
-
-
-
-
-```{tableofcontents}
-```
diff --git a/docs/vcf2zarr.md b/docs/vcf2zarr.md
diff --git a/docs/vcf2zarr_tutorial.md b/docs/vcf2zarr_tutorial.md
diff --git a/docs/vcfpartition.md b/docs/vcfpartition.md

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Command Line Interface`
	`1`	`+# CLI Reference`
`2`	`2`
`3`	`3`	`% A note on cross references... There's some weird long-standing problem with`
`4`	`4`	`% cross referencing program values in Sphinx, which means that we can't use`