Skip to content

Commit 8fa7a2a

Browse files
Rearrange doc structure
1 parent 09bb8fb commit 8fa7a2a

File tree

7 files changed

+220
-225
lines changed

7 files changed

+220
-225
lines changed

docs/_toc.yml

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,5 +2,10 @@ format: jb-book
22
root: intro
33
chapters:
44
- file: installation
5-
- file: vcf2zarr
6-
- file: vcfpartition
5+
- file: vcf2zarr/overview
6+
sections:
7+
- file: vcf2zarr/tutorial
8+
- file: vcf2zarr/cli_ref
9+
- file: vcfpartition/overview
10+
sections:
11+
- file: vcfpartition/cli_ref

docs/vcf2zarr.md

Lines changed: 0 additions & 210 deletions
This file was deleted.

docs/vcf2zarr/cli_ref.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# CLI Reference
2+
3+
% A note on cross references... There's some weird long-standing problem with
4+
% cross referencing program values in Sphinx, which means that we can't use
5+
% the built-in labels generated by sphinx-click. We can make our own explicit
6+
% targets, but these have to have slightly weird names to avoid conflicting
7+
% with what sphinx-click is doing. So, hence the cmd- prefix.
8+
% Based on: https://github.com/skypilot-org/skypilot/pull/2834
9+
10+
```{eval-rst}
11+
12+
.. _cmd-vcf2zarr:
13+
.. click:: bio2zarr.cli:vcf2zarr_main
14+
:prog: vcf2zarr
15+
:nested: short
16+
17+
.. _cmd-vcf2zarr-convert:
18+
.. click:: bio2zarr.cli:convert_vcf
19+
:prog: vcf2zarr convert
20+
:nested: full
21+
22+
.. _cmd-vcf2zarr-inspect:
23+
.. click:: bio2zarr.cli:inspect
24+
:prog: vcf2zarr inspect
25+
:nested: full
26+
27+
.. _cmd-vcf2zarr-mkschema:
28+
.. click:: bio2zarr.cli:mkschema
29+
:prog: vcf2zarr mkschema
30+
:nested: full
31+
```
32+
33+
## Explode
34+
35+
```{eval-rst}
36+
.. _cmd-vcf2zarr-explode:
37+
.. click:: bio2zarr.cli:explode
38+
:prog: vcf2zarr explode
39+
:nested: full
40+
41+
.. _cmd-vcf2zarr-dexplode-init:
42+
.. click:: bio2zarr.cli:dexplode_init
43+
:prog: vcf2zarr dexplode-init
44+
:nested: full
45+
46+
.. _cmd-vcf2zarr-dexplode-partition:
47+
.. click:: bio2zarr.cli:dexplode_partition
48+
:prog: vcf2zarr dexplode-partition
49+
:nested: full
50+
51+
.. _cmd-vcf2zarr-dexplode-finalise:
52+
.. click:: bio2zarr.cli:dexplode_finalise
53+
:prog: vcf2zarr dexplode-finalise
54+
:nested: full
55+
```
56+
57+
## Encode
58+
59+
```{eval-rst}
60+
.. click:: bio2zarr.cli:encode
61+
:prog: vcf2zarr encode
62+
:nested: full
63+
64+
.. _cmd-vcf2zarr-dencode-init:
65+
.. click:: bio2zarr.cli:dencode_init
66+
:prog: vcf2zarr dencode-init
67+
:nested: full
68+
69+
.. _cmd-vcf2zarr-dencode-partition:
70+
.. click:: bio2zarr.cli:dencode_partition
71+
:prog: vcf2zarr dencode-partition
72+
:nested: full
73+
74+
.. _cmd-vcf2zarr-dencode-finalise:
75+
.. click:: bio2zarr.cli:dencode_finalise
76+
:prog: vcf2zarr dencode-finalise
77+
:nested: full
78+
```
79+

docs/vcf2zarr/overview.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# vcf2zarr
2+
3+
4+
Convert a VCF to zarr format:
5+
6+
```
7+
$ vcf2zarr convert <VCF1> <VCF2> <zarr>
8+
```
9+
10+
Converts the VCF to zarr format.
11+
12+
**Do not use this for anything but the smallest files**
13+
14+
The recommended approach is to use a multi-stage conversion
15+
16+
First, convert the VCF into the intermediate format:
17+
18+
```
19+
vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
20+
```
21+
22+
Then, (optionally) inspect this representation to get a feel for your dataset
23+
```
24+
vcf2zarr inspect tmp/sample.exploded
25+
```
26+
27+
Then, (optionally) generate a conversion schema to describe the corresponding
28+
Zarr arrays:
29+
30+
```
31+
vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
32+
```
33+
34+
View and edit the schema, deleting any columns you don't want, or tweaking
35+
dtypes and compression settings to your taste.
36+
37+
Finally, encode to Zarr:
38+
```
39+
vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
40+
```
41+
42+
Use the ``-p, --worker-processes`` argument to control the number of workers used
43+
in the ``explode`` and ``encode`` phases.
44+
45+
## To be merged with above
46+
47+
The simplest usage is:
48+
49+
```
50+
$ vcf2zarr convert [VCF_FILE] [ZARR_PATH]
51+
```
52+
53+
54+
This will convert the indexed VCF (or BCF) into the vcfzarr format in a single
55+
step. As this writes the intermediate columnar format to a temporary directory,
56+
we only recommend this approach for small files (< 1GB, say).
57+
58+
The recommended approach is to run the conversion in two passes, and
59+
to keep the intermediate columnar format ("exploded") around to facilitate
60+
experimentation with chunk sizes and compression settings:
61+
62+
```
63+
$ vcf2zarr explode [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH]
64+
$ vcf2zarr encode [ICF_PATH] [ZARR_PATH]
65+
```
66+
67+
The inspect command provides a way to view contents of an exploded ICF
68+
or Zarr:
69+
70+
```
71+
$ vcf2zarr inspect [PATH]
72+
```
73+
74+
This is useful when tweaking chunk sizes and compression settings to suit
75+
your dataset, using the mkschema command and --schema option to encode:
76+
77+
```
78+
$ vcf2zarr mkschema [ICF_PATH] > schema.json
79+
$ vcf2zarr encode [ICF_PATH] [ZARR_PATH] --schema schema.json
80+
```
81+
82+
By editing the schema.json file you can drop columns that are not of interest
83+
and edit column specific compression settings. The --max-variant-chunks option
84+
to encode allows you to try out these options on small subsets, hopefully
85+
arriving at settings with the desired balance of compression and query
86+
performance.
87+

0 commit comments

Comments
 (0)