|
| 1 | +# bio2zarr Documentation |
| 2 | + |
| 3 | +`bio2zarr` efficiently converts common bioinformatics formats to |
| 4 | +[Zarr](https://zarr.readthedocs.io/en/stable/) format. Initially supporting converting |
| 5 | +VCF to the [sgkit vcf-zarr specification](https://github.com/pystatgen/vcf-zarr-spec/). |
| 6 | + |
| 7 | +`bio2zarr` is in early alpha development, contributions, feedback and issues are welcome |
| 8 | +at the [GitHub repository](https://github.com/sgkit-dev/bio2zarr). |
| 9 | + |
| 10 | +## Installation |
| 11 | +`bio2zarr` can be installed from PyPI using pip: |
| 12 | + |
| 13 | +```bash |
| 14 | +$ python3 -m pip install bio2zarr |
| 15 | +``` |
| 16 | + |
| 17 | +This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition`` |
| 18 | +into your local Python path. You may need to update your $PATH to call the |
| 19 | +executables directly. |
| 20 | + |
| 21 | +Alternatively, calling |
| 22 | +``` |
| 23 | +$ python3 -m bio2zarr vcf2zarr <args> |
| 24 | +``` |
| 25 | +is equivalent to |
| 26 | + |
| 27 | +``` |
| 28 | +$ vcf2zarr <args> |
| 29 | +``` |
| 30 | +and will always work. |
| 31 | + |
| 32 | +## Basic vcf2zarr usage |
| 33 | +For modest VCF files (up to a few GB), a single command can be used to convert a VCF file |
| 34 | +(or set of VCF files) to Zarr: |
| 35 | + |
| 36 | +```bash |
| 37 | +$ vcf2zarr convert <VCF1> <VCF2> ... <VCFN> <zarr> |
| 38 | +``` |
| 39 | + |
| 40 | +For larger files a multi-step process is recommended. |
| 41 | + |
| 42 | + |
| 43 | +First, convert the VCF into the intermediate format: |
| 44 | + |
| 45 | +```bash |
| 46 | +$ vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded |
| 47 | +``` |
| 48 | + |
| 49 | +Then, (optionally) inspect this representation to get a feel for your dataset |
| 50 | +```bash |
| 51 | +$ vcf2zarr inspect tmp/sample.exploded |
| 52 | +``` |
| 53 | + |
| 54 | +Then, (optionally) generate a conversion schema to describe the corresponding |
| 55 | +Zarr arrays: |
| 56 | + |
| 57 | +```bash |
| 58 | +$ vcf2zarr mkschema tmp/sample.exploded > sample.schema.json |
| 59 | +``` |
| 60 | + |
| 61 | +View and edit the schema, deleting any columns you don't want, or tweaking |
| 62 | +dtypes and compression settings to your taste. |
| 63 | + |
| 64 | +Finally, encode to Zarr: |
| 65 | +```bash |
| 66 | +$ vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json |
| 67 | +``` |
| 68 | + |
| 69 | +Use the ``-p, --worker-processes`` argument to control the number of workers used |
| 70 | +in the ``explode`` and ``encode`` phases. |
| 71 | + |
| 72 | + |
| 73 | + |
| 74 | + |
| 75 | +```{tableofcontents} |
| 76 | +``` |
0 commit comments