Merge pull request #69 from jeromekelleher/pre-release-stuff

jeromekelleher · web-flow · commit 35c62b0115a1 · 2024-03-06T16:15:47.000Z
Pre release stuff
diff --git a/README.md b/README.md
@@ -4,15 +4,37 @@ Convert bioinformatics file formats to Zarr
 Initially supports converting VCF to the
 [sgkit vcf-zarr specification](https://github.com/pystatgen/vcf-zarr-spec/)
 
-**This is early alpha-status code: everything is subject to change, a
+**This is early alpha-status code: everything is subject to change,
 and it has not been thoroughly tested**
 
-## Usage
+## Install
+
+```
+$ python3 -m pip install bio2zarr
+```
+
+This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition``
+into your local Python path. You may need to update your $PATH to call the 
+executables directly.
+
+Alternatively, calling 
+```
+$ python3 -m bio2zarr vcf2zarr <args>
+```
+is equivalent to 
+
+```
+$ vcf2zarr <args>
+```
+and will always work.
+
+
+## vcf2zarr
 
 Convert a VCF to zarr format:
 
 ```
-python3 -m bio2zarr vcf2zarr convert <VCF> <zarr>
+$ vcf2zarr convert <VCF1> <VCF2> <zarr>
 ```
 
 Converts the VCF to zarr format.
@@ -21,33 +43,64 @@ Converts the VCF to zarr format.
 
 The recommended approach is to use a multi-stage conversion
 
-First, convert the VCF into an intermediate columnar format:
+First, convert the VCF into the intermediate format:
 
 ```
-python3 -m bio2zarr vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
+vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
 ```
 
 Then, (optionally) inspect this representation to get a feel for your dataset
 ```
-python3 -m bio2zarr vcf2zarr inspec tmp/sample.exploded
+vcf2zarr inspect tmp/sample.exploded
 ```
 
 Then, (optionally) generate a conversion schema to describe the corresponding
 Zarr arrays:
 
 ```
-python3 -m bio2zarr vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
+vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
 ```
 
-View and edit the schema, deleting any columns you don't want.
-
-Finally, convert to Zarr
+View and edit the schema, deleting any columns you don't want, or tweaking 
+dtypes and compression settings to your taste.
 
+Finally, encode to Zarr:
 ```
-python3 -m bio2zarr vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
+vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
 ```
 
 Use the ``-p, --worker-processes`` argument to control the number of workers used
-to do zarr encoding.
+in the ``explode`` and ``encode`` phases.
+
+## plink2zarr
+
+Convert a plink ``.bed`` file to zarr format. **This is incomplete**
+
+## vcf_partition
+
+Partition a given VCF file into (approximately) a give number of regions:
+
+```
+vcf_partition 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz -n 10
+```
+gives
+```
+chr20:1-6799360
+chr20:6799361-14319616
+chr20:14319617-21790720
+chr20:21790721-28770304
+chr20:28770305-31096832
+chr20:31096833-38043648
+chr20:38043649-45580288
+chr20:45580289-52117504
+chr20:52117505-58834944
+chr20:58834945-
+```
+
+These reqion strings can then be used to split computation of the VCF 
+into chunks for parallelisation.
 
+**TODO give a nice example here using xargs**
 
+**WARNING that this does not take into account that indels may overlap 
+partitions and you may count variants twice or more if they do**
diff --git a/bio2zarr/cli.py b/bio2zarr/cli.py
@@ -31,7 +31,7 @@
     help="Chunk size in the samples dimension",
 )
 
-version = click.version_option(version=provenance.__version__)
+version = click.version_option(version=f"bio2zarr {provenance.__version__}")
 
 
 # Note: logging hasn't been implemented in the code at all, this is just
diff --git a/setup.cfg b/setup.cfg
@@ -3,10 +3,11 @@ name = bio2zarr
 author = sgkit Developers
 author_email = project@pystatgen.org
 license = Apache
-description = FIXME
+description = Convert bioinformatics data to Zarr 
 long_description_content_type=text/x-rst
 long_description =
-    FIXME
+    This is an early alpha release for testing and development.
+    **Do not use in production**
 url = https://github.com/pystatgen/bio2zarr
 classifiers =
     Development Status :: 3 - Alpha
@@ -15,7 +16,6 @@ classifiers =
     Intended Audience :: Science/Research
     Programming Language :: Python
     Programming Language :: Python :: 3
-    Programming Language :: Python :: 3.8
     Programming Language :: Python :: 3.9
     Programming Language :: Python :: 3.10
     Programming Language :: Python :: 3.11
@@ -25,7 +25,7 @@ classifiers =
 packages = bio2zarr
 zip_safe = False  # https://mypy.readthedocs.io/en/latest/installed_packages.html
 include_package_data = True
-python_requires = >=3.8
+python_requires = >=3.9
 install_requires =
     numpy
     zarr >= 2.10.0, != 2.11.0, != 2.11.1, != 2.11.2
@@ -45,6 +45,8 @@ setup_requires =
 console_scripts = 
     vcf2zarr = bio2zarr.cli:vcf2zarr
     plink2zarr = bio2zarr.cli:plink2zarr
+    # TODO I don't like this name, anything better?
+    vcf_partition = bio2zarr.cli:vcf_partition
 
 [flake8]
 ignore =

Original file line number	Diff line number	Diff line change
`@@ -31,7 +31,7 @@`
`31`	`31`	`help="Chunk size in the samples dimension",`
`32`	`32`	`)`
`33`	`33`
`34`		`-version = click.version_option(version=provenance.__version__)`
	`34`	`+version = click.version_option(version=f"bio2zarr {provenance.__version__}")`
`35`	`35`
`36`	`36`
`37`	`37`	`# Note: logging hasn't been implemented in the code at all, this is just`