Skip to content

Commit 164c923

Browse files
Merge pull request #236 from jeromekelleher/some-more-docs
Some more docs
2 parents d3155f1 + 5d26a5b commit 164c923

File tree

7 files changed

+198
-114
lines changed

7 files changed

+198
-114
lines changed

docs/Makefile

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,17 +20,23 @@ dist: ${CASTS}
2020

2121
clean:
2222
rm -fR $(BUILDDIR)
23+
rm -f _static/*.cast
2324

2425

2526
sample.vcf.gz:
2627
cp ../tests/data/vcf/sample.vcf.gz ./
2728
cp ../tests/data/vcf/sample.vcf.gz.tbi ./
29+
# FIXME we should really running the casts out of the
30+
# vcf2zarr directory, but let's get this working for now.
31+
cp sample.vcf.gz* vcf2zarr
2832

2933
_static/vcf2zarr_convert.cast: sample.vcf.gz
30-
rm -f sample.vcz
34+
rm -fR sample.vcz
3135
asciinema-automation cast_scripts/vcf2zarr_convert.sh $@
36+
cp -R sample.vcz vcf2zarr
3237

3338
_static/vcf2zarr_explode.cast: sample.vcf.gz
3439
rm -Rf sample.icf
3540
asciinema-automation cast_scripts/vcf2zarr_explode.sh $@
41+
cp -R sample.icf vcf2zarr
3642

docs/_config.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
title: bio2zarr Documentation
55
author: sgkit developers
66
logo: logo.png
7+
copyright: "2024"
78

89
# Force re-execution of notebooks on each build.
910
# See https://jupyterbook.org/content/execute.html

docs/installation.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,11 @@ vcf2zarr <args>
2222
and will always work.
2323

2424
:::{note}
25-
The ``python3 -m bio2zarr vcf2zarr`` for may be replaced with
25+
The ``python3 -m bio2zarr vcf2zarr`` form may be replaced with
2626
``python3 -m bio2zarr.vcf2zarr`` in the near future.
2727
See GitHub issue [203](https://github.com/sgkit-dev/bio2zarr/issues/203).
2828
:::
2929

30-
3130
:::{warning}
3231
Windows is not currently supported. Please comment on
3332
[this issue](https://github.com/sgkit-dev/bio2zarr/issues/174) if you would

docs/intro.md

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -8,20 +8,29 @@
88
- {ref}`sec-vcf2zarr` converts VCF data to
99
[VCF Zarr](https://github.com/sgkit-dev/vcf-zarr-spec/) format.
1010

11-
- {ref}`sec-vcfpartition` is a utility to split an input (set of)
12-
VCFs into a given number of partitions. This is useful for
13-
parallel processing.
11+
- {ref}`sec-vcfpartition` is a utility to split an input
12+
VCF into a given number of partitions. This is useful for
13+
parallel processing of VCF data.
1414

1515
## Development status
1616

1717
`bio2zarr` is in development, contributions, feedback and issues are welcome
1818
at the [GitHub repository](https://github.com/sgkit-dev/bio2zarr).
1919

2020
Support for converting PLINK data to VCF Zarr is partially implemented,
21-
and adding BGEN support is also planned. If you would like to see
21+
and adding BGEN and [tskit](https://tskit.dev/) support is also planned.
22+
If you would like to see
2223
support for other formats (or an interested in helping with implementing),
2324
please open an [issue on Github](https://github.com/sgkit-dev/bio2zarr/issues)
2425
to discuss!
2526

27+
2628
The package is currently focused on command line interfaces, but a
2729
Python API is also planned.
30+
31+
:::{warning}
32+
Although it is possible to import the bio2zarr Python package
33+
the APIs are purely internal for the moment and will change
34+
in arbitrary ways. Please don't use them (or open issues about
35+
them on GitHub).
36+
:::

docs/vcf2zarr/cli_ref.md

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,6 @@
1010

1111
```{eval-rst}
1212
13-
.. _cmd-vcf2zarr:
14-
.. click:: bio2zarr.cli:vcf2zarr_main
15-
:prog: vcf2zarr
16-
:nested: short
17-
1813
.. _cmd-vcf2zarr-convert:
1914
.. click:: bio2zarr.cli:convert_vcf
2015
:prog: vcf2zarr convert

docs/vcf2zarr/overview.md

Lines changed: 45 additions & 86 deletions
Original file line numberDiff line numberDiff line change
@@ -12,117 +12,76 @@ command line options.
1212

1313
## Quickstart
1414

15-
First {ref}`install bio2zarr<sec-installation>`.
15+
- First {ref}`install bio2zarr<sec-installation>`.
1616

1717

18-
:::{note}
19-
FINISH ME
20-
:::
21-
22-
23-
24-
## How does it work?
25-
The conversion of VCF data to Zarr is a two-step process:
26-
27-
1. Convert ({ref}`explode<cmd-vcf2zarr-explode>`) VCF file(s) to
28-
Intermediate Columnar Format (ICF)
29-
2. Convert ({ref}`encode<cmd-vcf2zarr-encode>`) ICF to Zarr
30-
31-
This two-step process allows `vcf2zarr` to determine the correct
32-
dimension of Zarr arrays corresponding to each VCF field, and
33-
to keep memory usage tightly bounded while writing the arrays.
34-
35-
:::{important}
36-
The intermediate columnar format is not intended for any use
37-
other than a temporary storage while converting VCF to Zarr.
38-
The format may change between versions of `bio2zarr`.
39-
:::
40-
41-
42-
## Common options
18+
- Get some indexed VCF data:
4319

4420
```
45-
$ vcf2zarr convert <VCF1> <VCF2> <zarr>
21+
curl -O https://raw.githubusercontent.com/sgkit-dev/bio2zarr/main/tests/data/vcf/sample.vcf.gz
22+
curl -O https://raw.githubusercontent.com/sgkit-dev/bio2zarr/main/tests/data/vcf/sample.vcf.gz.tbi
4623
```
4724

48-
Converts the VCF to zarr format.
49-
50-
**Do not use this for anything but the smallest files**
51-
52-
The recommended approach is to use a multi-stage conversion
53-
54-
First, convert the VCF into the intermediate format:
55-
56-
```
57-
vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
58-
```
25+
- Convert to VCF Zarr in two steps:
5926

60-
Then, (optionally) inspect this representation to get a feel for your dataset
6127
```
62-
vcf2zarr inspect tmp/sample.exploded
28+
vcf2zarr explode sample.vcf.gz sample.icf
29+
vcf2zarr encode sample.icf sample.vcz
6330
```
6431

65-
Then, (optionally) generate a conversion schema to describe the corresponding
66-
Zarr arrays:
67-
68-
```
69-
vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
70-
```
32+
:::{tip}
33+
If the ``vcf2zarr`` executable doesn't work, try ``python -m bio2zarr vcf2zarr``
34+
instead.
35+
:::
7136

72-
View and edit the schema, deleting any columns you don't want, or tweaking
73-
dtypes and compression settings to your taste.
37+
- Have a look at the results:
7438

75-
Finally, encode to Zarr:
7639
```
77-
vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
40+
vcf2zarr inspect sample.vcz
7841
```
7942

80-
Use the ``-p, --worker-processes`` argument to control the number of workers used
81-
in the ``explode`` and ``encode`` phases.
82-
83-
## To be merged with above
84-
85-
The simplest usage is:
86-
87-
```
88-
$ vcf2zarr convert [VCF_FILE] [ZARR_PATH]
89-
```
43+
### What next?
9044

45+
VCF Zarr is a starting point in what we hope will become a diverse ecosytem
46+
of packages that efficiently process VCF data in Zarr format. However, this
47+
ecosytem does not exist yet, and there isn't much software available
48+
for working with the format. As such, VCF Zarr isn't suitable for end users
49+
who just want to get their work done for the moment.
9150

92-
This will convert the indexed VCF (or BCF) into the vcfzarr format in a single
93-
step. As this writes the intermediate columnar format to a temporary directory,
94-
we only recommend this approach for small files (< 1GB, say).
51+
Having said that, you can:
9552

96-
The recommended approach is to run the conversion in two passes, and
97-
to keep the intermediate columnar format ("exploded") around to facilitate
98-
experimentation with chunk sizes and compression settings:
53+
- Look at the [VCF Zarr specification](https://github.com/sgkit-dev/vcf-zarr-spec/)
54+
to see how data is mapped from VCF to Zarr
55+
- Use the mature [Zarr Python](https://zarr.readthedocs.io/en/stable/) package or
56+
one of the other [Zarr implementations](https://zarr.dev/implementations/) to access
57+
your data.
58+
- Use the many functions in our [sgkit](https://sgkit-dev.github.io/sgkit/latest/)
59+
sister project to analyse the data. Note that sgkit is under active development,
60+
however, and the documentation may not be fully in-sync with this project.
9961

100-
```
101-
$ vcf2zarr explode [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH]
102-
$ vcf2zarr encode [ICF_PATH] [ZARR_PATH]
103-
```
10462

105-
The inspect command provides a way to view contents of an exploded ICF
106-
or Zarr:
10763

108-
```
109-
$ vcf2zarr inspect [PATH]
110-
```
64+
## How does it work?
65+
The conversion of VCF data to Zarr is a two-step process:
11166

112-
This is useful when tweaking chunk sizes and compression settings to suit
113-
your dataset, using the mkschema command and --schema option to encode:
67+
1. Convert ({ref}`explode<cmd-vcf2zarr-explode>`) VCF file(s) to
68+
Intermediate Columnar Format (ICF)
69+
2. Convert ({ref}`encode<cmd-vcf2zarr-encode>`) ICF to Zarr
11470

115-
```
116-
$ vcf2zarr mkschema [ICF_PATH] > schema.json
117-
$ vcf2zarr encode [ICF_PATH] [ZARR_PATH] --schema schema.json
118-
```
71+
This two-step process allows `vcf2zarr` to determine the correct
72+
dimension of Zarr arrays corresponding to each VCF field, and
73+
to keep memory usage tightly bounded while writing the arrays.
11974

120-
By editing the schema.json file you can drop columns that are not of interest
121-
and edit column specific compression settings. The --max-variant-chunks option
122-
to encode allows you to try out these options on small subsets, hopefully
123-
arriving at settings with the desired balance of compression and query
124-
performance.
75+
:::{important}
76+
The intermediate columnar format is not intended for any use
77+
other than a temporary storage while converting VCF to Zarr.
78+
The format may change between versions of `bio2zarr`.
79+
:::
12580

81+
Both ``explode`` and ``encode`` can be performed in parallel
82+
across cores on a single machine (via the ``--worker-processes`` argument)
83+
or distributed across a cluster by the three-part ``init``, ``partition``
84+
and ``finalise`` commands.
12685

12786
## Copying to object stores
12887

0 commit comments

Comments
 (0)