Skip to content

Commit 976b96e

Browse files
Merge pull request #200 from jeromekelleher/move-readme-to-docs
Move readme to docs
2 parents 7760c57 + b297246 commit 976b96e

File tree

9 files changed

+210
-275
lines changed

9 files changed

+210
-275
lines changed

README.md

Lines changed: 4 additions & 119 deletions
Original file line numberDiff line numberDiff line change
@@ -1,124 +1,9 @@
11
[![CI](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml/badge.svg?branch=main)](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml)
2+
[![Coverage Status](https://coveralls.io/repos/github/sgkit-dev/bio2zarr/badge.svg)](https://coveralls.io/github/sgkit-dev/bio2zarr)
3+
![PyPI](https://img.shields.io/pypi/v/PACKAGE?label=pypi%20bio2zarr)
4+
![PyPI - Downloads](https://img.shields.io/pypi/dm/bio2zarr)
25

36
# bio2zarr
47
Convert bioinformatics file formats to Zarr
58

6-
Initially supports converting VCF to the
7-
[sgkit vcf-zarr specification](https://github.com/pystatgen/vcf-zarr-spec/)
8-
9-
**This is early alpha-status code: everything is subject to change,
10-
and it has not been thoroughly tested**
11-
12-
## Install
13-
14-
```
15-
$ python3 -m pip install bio2zarr
16-
```
17-
18-
This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition``
19-
into your local Python path. You may need to update your $PATH to call the
20-
executables directly.
21-
22-
Alternatively, calling
23-
```
24-
$ python3 -m bio2zarr vcf2zarr <args>
25-
```
26-
is equivalent to
27-
28-
```
29-
$ vcf2zarr <args>
30-
```
31-
and will always work.
32-
33-
34-
## vcf2zarr
35-
36-
37-
Convert a VCF to zarr format:
38-
39-
```
40-
$ vcf2zarr convert <VCF1> <VCF2> <zarr>
41-
```
42-
43-
Converts the VCF to zarr format.
44-
45-
**Do not use this for anything but the smallest files**
46-
47-
The recommended approach is to use a multi-stage conversion
48-
49-
First, convert the VCF into the intermediate format:
50-
51-
```
52-
vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
53-
```
54-
55-
Then, (optionally) inspect this representation to get a feel for your dataset
56-
```
57-
vcf2zarr inspect tmp/sample.exploded
58-
```
59-
60-
Then, (optionally) generate a conversion schema to describe the corresponding
61-
Zarr arrays:
62-
63-
```
64-
vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
65-
```
66-
67-
View and edit the schema, deleting any columns you don't want, or tweaking
68-
dtypes and compression settings to your taste.
69-
70-
Finally, encode to Zarr:
71-
```
72-
vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
73-
```
74-
75-
Use the ``-p, --worker-processes`` argument to control the number of workers used
76-
in the ``explode`` and ``encode`` phases.
77-
78-
### Shell completion
79-
80-
To enable shell completion for a particular session in Bash do:
81-
82-
```
83-
eval "$(_VCF2ZARR_COMPLETE=bash_source vcf2zarr)"
84-
```
85-
86-
If you add this to your ``.bashrc`` vcf2zarr shell completion should available
87-
in all new shell sessions.
88-
89-
See the [Click documentation](https://click.palletsprojects.com/en/8.1.x/shell-completion/#enabling-completion)
90-
for instructions on how to enable completion in other shells.
91-
a
92-
93-
## plink2zarr
94-
95-
Convert a plink ``.bed`` file to zarr format. **This is incomplete**
96-
97-
## vcf_partition
98-
99-
Partition a given VCF file into (approximately) a give number of regions:
100-
101-
```
102-
vcf_partition 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz -n 10
103-
```
104-
gives
105-
```
106-
chr20:1-6799360
107-
chr20:6799361-14319616
108-
chr20:14319617-21790720
109-
chr20:21790721-28770304
110-
chr20:28770305-31096832
111-
chr20:31096833-38043648
112-
chr20:38043649-45580288
113-
chr20:45580289-52117504
114-
chr20:52117505-58834944
115-
chr20:58834945-
116-
```
117-
118-
These reqion strings can then be used to split computation of the VCF
119-
into chunks for parallelisation.
120-
121-
**TODO give a nice example here using xargs**
122-
123-
**WARNING that this does not take into account that indels may overlap
124-
partitions and you may count variants twice or more if they do**
9+
See the [documentation](https://sgkit-dev.github.io/bio2zarr/) for details.

bio2zarr/cli.py

Lines changed: 2 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -459,51 +459,8 @@ def vcf2zarr_main():
459459
"""
460460
Convert VCF file(s) to the vcfzarr format.
461461
462-
The simplest usage is:
463-
464-
$ vcf2zarr convert [VCF_FILE] [ZARR_PATH]
465-
466-
This will convert the indexed VCF (or BCF) into the vcfzarr format in a single
467-
step. As this writes the intermediate columnar format to a temporary directory,
468-
we only recommend this approach for small files (< 1GB, say).
469-
470-
The recommended approach is to run the conversion in two passes, and
471-
to keep the intermediate columnar format ("exploded") around to facilitate
472-
experimentation with chunk sizes and compression settings:
473-
474-
\b
475-
$ vcf2zarr explode [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH]
476-
$ vcf2zarr encode [ICF_PATH] [ZARR_PATH]
477-
478-
The inspect command provides a way to view contents of an exploded ICF
479-
or Zarr:
480-
481-
$ vcf2zarr inspect [PATH]
482-
483-
This is useful when tweaking chunk sizes and compression settings to suit
484-
your dataset, using the mkschema command and --schema option to encode:
485-
486-
\b
487-
$ vcf2zarr mkschema [ICF_PATH] > schema.json
488-
$ vcf2zarr encode [ICF_PATH] [ZARR_PATH] --schema schema.json
489-
490-
By editing the schema.json file you can drop columns that are not of interest
491-
and edit column specific compression settings. The --max-variant-chunks option
492-
to encode allows you to try out these options on small subsets, hopefully
493-
arriving at settings with the desired balance of compression and query
494-
performance.
495-
496-
ADVANCED USAGE
497-
498-
For very large datasets (terabyte scale) it may be necessary to distribute the
499-
explode and encode steps across a cluster:
500-
501-
\b
502-
$ vcf2zarr dexplode-init [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH] [NUM_PARTITIONS]
503-
$ vcf2zarr dexplode-partition [ICF_PATH] [PARTITION_INDEX]
504-
$ vcf2zarr dexplode-finalise [ICF_PATH]
505-
506-
See the online documentation at [FIXME] for more details on distributed explode.
462+
See the online documentation at https://sgkit-dev.github.io/bio2zarr/
463+
for more information.
507464
"""
508465

509466

docs/_toc.yml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
format: jb-book
22
root: intro
33
chapters:
4-
- file: vcf2zarr_tutorial
4+
- file: installation
5+
- file: vcf2zarr
6+
- file: vcfpartition
57
- file: cli

docs/cli.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Command Line Interface
1+
# CLI Reference
22

33
% A note on cross references... There's some weird long-standing problem with
44
% cross referencing program values in Sphinx, which means that we can't use

docs/installation.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Installation
2+
3+
4+
```
5+
$ python3 -m pip install bio2zarr
6+
```
7+
8+
This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition``
9+
into your local Python path. You may need to update your $PATH to call the
10+
executables directly.
11+
12+
Alternatively, calling
13+
```
14+
$ python3 -m bio2zarr vcf2zarr <args>
15+
```
16+
is equivalent to
17+
18+
```
19+
$ vcf2zarr <args>
20+
```
21+
and will always work.
22+
23+
24+
## Shell completion
25+
26+
To enable shell completion for a particular session in Bash do:
27+
28+
```
29+
eval "$(_VCF2ZARR_COMPLETE=bash_source vcf2zarr)"
30+
```
31+
32+
If you add this to your ``.bashrc`` vcf2zarr shell completion should available
33+
in all new shell sessions.
34+
35+
See the [Click documentation](https://click.palletsprojects.com/en/8.1.x/shell-completion/#enabling-completion)
36+
for instructions on how to enable completion in other shells.

docs/intro.md

Lines changed: 5 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -1,76 +1,9 @@
1-
# bio2zarr Documentation
1+
# bio2zarr
22

3-
`bio2zarr` efficiently converts common bioinformatics formats to
4-
[Zarr](https://zarr.readthedocs.io/en/stable/) format. Initially supporting converting
5-
VCF to the [sgkit vcf-zarr specification](https://github.com/pystatgen/vcf-zarr-spec/).
3+
`bio2zarr` efficiently converts common bioinformatics formats to
4+
[Zarr](https://zarr.readthedocs.io/en/stable/) format. Initially supporting converting
5+
VCF to the [VCF Zarr specification](https://github.com/sgkit-dev/vcf-zarr-spec/).
66

7-
`bio2zarr` is in early alpha development, contributions, feedback and issues are welcome
7+
`bio2zarr` is in development, contributions, feedback and issues are welcome
88
at the [GitHub repository](https://github.com/sgkit-dev/bio2zarr).
99

10-
## Installation
11-
`bio2zarr` can be installed from PyPI using pip:
12-
13-
```bash
14-
$ python3 -m pip install bio2zarr
15-
```
16-
17-
This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition``
18-
into your local Python path. You may need to update your $PATH to call the
19-
executables directly.
20-
21-
Alternatively, calling
22-
```
23-
$ python3 -m bio2zarr vcf2zarr <args>
24-
```
25-
is equivalent to
26-
27-
```
28-
$ vcf2zarr <args>
29-
```
30-
and will always work.
31-
32-
## Basic vcf2zarr usage
33-
For modest VCF files (up to a few GB), a single command can be used to convert a VCF file
34-
(or set of VCF files) using the {ref}`convert<cmd-vcf2zarr-convert>` command:
35-
36-
```bash
37-
$ vcf2zarr convert <VCF1> <VCF2> ... <VCFN> <zarr>
38-
```
39-
40-
For larger files a multi-step process is recommended.
41-
42-
43-
First, convert the VCF into the intermediate format:
44-
45-
```bash
46-
$ vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
47-
```
48-
49-
Then, (optionally) inspect this representation to get a feel for your dataset
50-
```bash
51-
$ vcf2zarr inspect tmp/sample.exploded
52-
```
53-
54-
Then, (optionally) generate a conversion schema to describe the corresponding
55-
Zarr arrays:
56-
57-
```bash
58-
$ vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
59-
```
60-
61-
View and edit the schema, deleting any columns you don't want, or tweaking
62-
dtypes and compression settings to your taste.
63-
64-
Finally, encode to Zarr:
65-
```bash
66-
$ vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
67-
```
68-
69-
Use the ``-p, --worker-processes`` argument to control the number of workers used
70-
in the ``explode`` and ``encode`` phases.
71-
72-
73-
74-
75-
```{tableofcontents}
76-
```

0 commit comments

Comments
 (0)