|
1 |
| -# bio2zarr Documentation |
| 1 | +# bio2zarr |
2 | 2 |
|
3 |
| -`bio2zarr` efficiently converts common bioinformatics formats to |
4 |
| -[Zarr](https://zarr.readthedocs.io/en/stable/) format. Initially supporting converting |
5 |
| -VCF to the [sgkit vcf-zarr specification](https://github.com/pystatgen/vcf-zarr-spec/). |
| 3 | +`bio2zarr` efficiently converts common bioinformatics formats to |
| 4 | +[Zarr](https://zarr.readthedocs.io/en/stable/) format. Initially supporting converting |
| 5 | +VCF to the [VCF Zarr specification](https://github.com/sgkit-dev/vcf-zarr-spec/). |
6 | 6 |
|
7 |
| -`bio2zarr` is in early alpha development, contributions, feedback and issues are welcome |
| 7 | +`bio2zarr` is in development, contributions, feedback and issues are welcome |
8 | 8 | at the [GitHub repository](https://github.com/sgkit-dev/bio2zarr).
|
9 | 9 |
|
10 |
| -## Installation |
11 |
| -`bio2zarr` can be installed from PyPI using pip: |
12 |
| - |
13 |
| -```bash |
14 |
| -$ python3 -m pip install bio2zarr |
15 |
| -``` |
16 |
| - |
17 |
| -This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition`` |
18 |
| -into your local Python path. You may need to update your $PATH to call the |
19 |
| -executables directly. |
20 |
| - |
21 |
| -Alternatively, calling |
22 |
| -``` |
23 |
| -$ python3 -m bio2zarr vcf2zarr <args> |
24 |
| -``` |
25 |
| -is equivalent to |
26 |
| - |
27 |
| -``` |
28 |
| -$ vcf2zarr <args> |
29 |
| -``` |
30 |
| -and will always work. |
31 |
| - |
32 |
| -## Basic vcf2zarr usage |
33 |
| -For modest VCF files (up to a few GB), a single command can be used to convert a VCF file |
34 |
| -(or set of VCF files) using the {ref}`convert<cmd-vcf2zarr-convert>` command: |
35 |
| - |
36 |
| -```bash |
37 |
| -$ vcf2zarr convert <VCF1> <VCF2> ... <VCFN> <zarr> |
38 |
| -``` |
39 |
| - |
40 |
| -For larger files a multi-step process is recommended. |
41 |
| - |
42 |
| - |
43 |
| -First, convert the VCF into the intermediate format: |
44 |
| - |
45 |
| -```bash |
46 |
| -$ vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded |
47 |
| -``` |
48 |
| - |
49 |
| -Then, (optionally) inspect this representation to get a feel for your dataset |
50 |
| -```bash |
51 |
| -$ vcf2zarr inspect tmp/sample.exploded |
52 |
| -``` |
53 |
| - |
54 |
| -Then, (optionally) generate a conversion schema to describe the corresponding |
55 |
| -Zarr arrays: |
56 |
| - |
57 |
| -```bash |
58 |
| -$ vcf2zarr mkschema tmp/sample.exploded > sample.schema.json |
59 |
| -``` |
60 |
| - |
61 |
| -View and edit the schema, deleting any columns you don't want, or tweaking |
62 |
| -dtypes and compression settings to your taste. |
63 |
| - |
64 |
| -Finally, encode to Zarr: |
65 |
| -```bash |
66 |
| -$ vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json |
67 |
| -``` |
68 |
| - |
69 |
| -Use the ``-p, --worker-processes`` argument to control the number of workers used |
70 |
| -in the ``explode`` and ``encode`` phases. |
71 |
| - |
72 |
| - |
73 |
| - |
74 |
| - |
75 |
| -```{tableofcontents} |
76 |
| -``` |
0 commit comments