|
1 | 1 | [](https://github.com/sgkit-dev/bio2zarr/actions/workflows/ci.yml)
|
| 2 | +[](https://coveralls.io/github/sgkit-dev/bio2zarr) |
| 3 | + |
| 4 | + |
2 | 5 |
|
3 | 6 | # bio2zarr
|
4 | 7 | Convert bioinformatics file formats to Zarr
|
5 | 8 |
|
6 |
| -Initially supports converting VCF to the |
7 |
| -[sgkit vcf-zarr specification](https://github.com/pystatgen/vcf-zarr-spec/) |
8 |
| - |
9 |
| -**This is early alpha-status code: everything is subject to change, |
10 |
| -and it has not been thoroughly tested** |
11 |
| - |
12 |
| -## Install |
13 |
| - |
14 |
| -``` |
15 |
| -$ python3 -m pip install bio2zarr |
16 |
| -``` |
17 |
| - |
18 |
| -This will install the programs ``vcf2zarr``, ``plink2zarr`` and ``vcf_partition`` |
19 |
| -into your local Python path. You may need to update your $PATH to call the |
20 |
| -executables directly. |
21 |
| - |
22 |
| -Alternatively, calling |
23 |
| -``` |
24 |
| -$ python3 -m bio2zarr vcf2zarr <args> |
25 |
| -``` |
26 |
| -is equivalent to |
27 |
| - |
28 |
| -``` |
29 |
| -$ vcf2zarr <args> |
30 |
| -``` |
31 |
| -and will always work. |
32 |
| - |
33 |
| - |
34 |
| -## vcf2zarr |
35 |
| - |
36 |
| - |
37 |
| -Convert a VCF to zarr format: |
38 |
| - |
39 |
| -``` |
40 |
| -$ vcf2zarr convert <VCF1> <VCF2> <zarr> |
41 |
| -``` |
42 |
| - |
43 |
| -Converts the VCF to zarr format. |
44 |
| - |
45 |
| -**Do not use this for anything but the smallest files** |
46 |
| - |
47 |
| -The recommended approach is to use a multi-stage conversion |
48 |
| - |
49 |
| -First, convert the VCF into the intermediate format: |
50 |
| - |
51 |
| -``` |
52 |
| -vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded |
53 |
| -``` |
54 |
| - |
55 |
| -Then, (optionally) inspect this representation to get a feel for your dataset |
56 |
| -``` |
57 |
| -vcf2zarr inspect tmp/sample.exploded |
58 |
| -``` |
59 |
| - |
60 |
| -Then, (optionally) generate a conversion schema to describe the corresponding |
61 |
| -Zarr arrays: |
62 |
| - |
63 |
| -``` |
64 |
| -vcf2zarr mkschema tmp/sample.exploded > sample.schema.json |
65 |
| -``` |
66 |
| - |
67 |
| -View and edit the schema, deleting any columns you don't want, or tweaking |
68 |
| -dtypes and compression settings to your taste. |
69 |
| - |
70 |
| -Finally, encode to Zarr: |
71 |
| -``` |
72 |
| -vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json |
73 |
| -``` |
74 |
| - |
75 |
| -Use the ``-p, --worker-processes`` argument to control the number of workers used |
76 |
| -in the ``explode`` and ``encode`` phases. |
77 |
| - |
78 |
| -### Shell completion |
79 |
| - |
80 |
| -To enable shell completion for a particular session in Bash do: |
81 |
| - |
82 |
| -``` |
83 |
| -eval "$(_VCF2ZARR_COMPLETE=bash_source vcf2zarr)" |
84 |
| -``` |
85 |
| - |
86 |
| -If you add this to your ``.bashrc`` vcf2zarr shell completion should available |
87 |
| -in all new shell sessions. |
88 |
| - |
89 |
| -See the [Click documentation](https://click.palletsprojects.com/en/8.1.x/shell-completion/#enabling-completion) |
90 |
| -for instructions on how to enable completion in other shells. |
91 |
| -a |
92 |
| - |
93 |
| -## plink2zarr |
94 |
| - |
95 |
| -Convert a plink ``.bed`` file to zarr format. **This is incomplete** |
96 |
| - |
97 |
| -## vcf_partition |
98 |
| - |
99 |
| -Partition a given VCF file into (approximately) a give number of regions: |
100 |
| - |
101 |
| -``` |
102 |
| -vcf_partition 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz -n 10 |
103 |
| -``` |
104 |
| -gives |
105 |
| -``` |
106 |
| -chr20:1-6799360 |
107 |
| -chr20:6799361-14319616 |
108 |
| -chr20:14319617-21790720 |
109 |
| -chr20:21790721-28770304 |
110 |
| -chr20:28770305-31096832 |
111 |
| -chr20:31096833-38043648 |
112 |
| -chr20:38043649-45580288 |
113 |
| -chr20:45580289-52117504 |
114 |
| -chr20:52117505-58834944 |
115 |
| -chr20:58834945- |
116 |
| -``` |
117 |
| - |
118 |
| -These reqion strings can then be used to split computation of the VCF |
119 |
| -into chunks for parallelisation. |
120 |
| - |
121 |
| -**TODO give a nice example here using xargs** |
122 |
| - |
123 |
| -**WARNING that this does not take into account that indels may overlap |
124 |
| -partitions and you may count variants twice or more if they do** |
| 9 | +See the [documentation](https://sgkit-dev.github.io/bio2zarr/) for details. |
0 commit comments