|
1 | 1 | [](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml) |
| 2 | +[](https://coveralls.io/github/sgkit-dev/bio2zarr) |
| 3 | + |
| 4 | + |
2 | 5 |
|
3 | 6 | # bio2zarr |
4 | 7 | Convert bioinformatics file formats to Zarr |
5 | 8 |
|
6 | | -Initially supports converting VCF to the |
7 | | -[sgkit vcf-zarr specification](https://github.com/pystatgen/vcf-zarr-spec/) |
8 | | - |
9 | | -**This is early alpha-status code: everything is subject to change, |
10 | | -and it has not been thoroughly tested** |
11 | | - |
12 | | -## Install |
13 | | - |
14 | | -``` |
15 | | -$ python3 -m pip install bio2zarr |
16 | | -``` |
17 | | - |
18 | | -This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition`` |
19 | | -into your local Python path. You may need to update your $PATH to call the |
20 | | -executables directly. |
21 | | - |
22 | | -Alternatively, calling |
23 | | -``` |
24 | | -$ python3 -m bio2zarr vcf2zarr <args> |
25 | | -``` |
26 | | -is equivalent to |
27 | | - |
28 | | -``` |
29 | | -$ vcf2zarr <args> |
30 | | -``` |
31 | | -and will always work. |
32 | | - |
33 | | - |
34 | | -## vcf2zarr |
35 | | - |
36 | | - |
37 | | -Convert a VCF to zarr format: |
38 | | - |
39 | | -``` |
40 | | -$ vcf2zarr convert <VCF1> <VCF2> <zarr> |
41 | | -``` |
42 | | - |
43 | | -Converts the VCF to zarr format. |
44 | | - |
45 | | -**Do not use this for anything but the smallest files** |
46 | | - |
47 | | -The recommended approach is to use a multi-stage conversion |
48 | | - |
49 | | -First, convert the VCF into the intermediate format: |
50 | | - |
51 | | -``` |
52 | | -vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded |
53 | | -``` |
54 | | - |
55 | | -Then, (optionally) inspect this representation to get a feel for your dataset |
56 | | -``` |
57 | | -vcf2zarr inspect tmp/sample.exploded |
58 | | -``` |
59 | | - |
60 | | -Then, (optionally) generate a conversion schema to describe the corresponding |
61 | | -Zarr arrays: |
62 | | - |
63 | | -``` |
64 | | -vcf2zarr mkschema tmp/sample.exploded > sample.schema.json |
65 | | -``` |
66 | | - |
67 | | -View and edit the schema, deleting any columns you don't want, or tweaking |
68 | | -dtypes and compression settings to your taste. |
69 | | - |
70 | | -Finally, encode to Zarr: |
71 | | -``` |
72 | | -vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json |
73 | | -``` |
74 | | - |
75 | | -Use the ``-p, --worker-processes`` argument to control the number of workers used |
76 | | -in the ``explode`` and ``encode`` phases. |
77 | | - |
78 | | -### Shell completion |
79 | | - |
80 | | -To enable shell completion for a particular session in Bash do: |
81 | | - |
82 | | -``` |
83 | | -eval "$(_VCF2ZARR_COMPLETE=bash_source vcf2zarr)" |
84 | | -``` |
85 | | - |
86 | | -If you add this to your ``.bashrc`` vcf2zarr shell completion should available |
87 | | -in all new shell sessions. |
88 | | - |
89 | | -See the [Click documentation](https://click.palletsprojects.com/en/8.1.x/shell-completion/#enabling-completion) |
90 | | -for instructions on how to enable completion in other shells. |
91 | | -a |
92 | | - |
93 | | -## plink2zarr |
94 | | - |
95 | | -Convert a plink ``.bed`` file to zarr format. **This is incomplete** |
96 | | - |
97 | | -## vcf_partition |
98 | | - |
99 | | -Partition a given VCF file into (approximately) a give number of regions: |
100 | | - |
101 | | -``` |
102 | | -vcf_partition 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz -n 10 |
103 | | -``` |
104 | | -gives |
105 | | -``` |
106 | | -chr20:1-6799360 |
107 | | -chr20:6799361-14319616 |
108 | | -chr20:14319617-21790720 |
109 | | -chr20:21790721-28770304 |
110 | | -chr20:28770305-31096832 |
111 | | -chr20:31096833-38043648 |
112 | | -chr20:38043649-45580288 |
113 | | -chr20:45580289-52117504 |
114 | | -chr20:52117505-58834944 |
115 | | -chr20:58834945- |
116 | | -``` |
117 | | - |
118 | | -These reqion strings can then be used to split computation of the VCF |
119 | | -into chunks for parallelisation. |
120 | | - |
121 | | -**TODO give a nice example here using xargs** |
122 | | - |
123 | | -**WARNING that this does not take into account that indels may overlap |
124 | | -partitions and you may count variants twice or more if they do** |
| 9 | +See the [documentation](https://sgkit-dev.github.io/bio2zarr/) for details. |
0 commit comments