@@ -4,15 +4,37 @@ Convert bioinformatics file formats to Zarr
44Initially supports converting VCF to the
55[ sgkit vcf-zarr specification] ( https://github.com/pystatgen/vcf-zarr-spec/ )
66
7- ** This is early alpha-status code: everything is subject to change, a
7+ ** This is early alpha-status code: everything is subject to change,
88and it has not been thoroughly tested**
99
10- ## Usage
10+ ## Install
11+
12+ ```
13+ $ python3 -m pip install bio2zarr
14+ ```
15+
16+ This will install the programs `` vcf2zarr `` , `` plink2zarr `` and `` vcf_partition ``
17+ into your local Python path. You may need to update your $PATH to call the
18+ executables directly.
19+
20+ Alternatively, calling
21+ ```
22+ $ python3 -m bio2zarr vcf2zarr <args>
23+ ```
24+ is equivalent to
25+
26+ ```
27+ $ vcf2zarr <args>
28+ ```
29+ and will always work.
30+
31+
32+ ## vcf2zarr
1133
1234Convert a VCF to zarr format:
1335
1436```
15- python3 -m bio2zarr vcf2zarr convert <VCF > <zarr>
37+ $ vcf2zarr convert <VCF1> <VCF2 > <zarr>
1638```
1739
1840Converts the VCF to zarr format.
@@ -21,33 +43,64 @@ Converts the VCF to zarr format.
2143
2244The recommended approach is to use a multi-stage conversion
2345
24- First, convert the VCF into an intermediate columnar format:
46+ First, convert the VCF into the intermediate format:
2547
2648```
27- python3 -m bio2zarr vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
49+ vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
2850```
2951
3052Then, (optionally) inspect this representation to get a feel for your dataset
3153```
32- python3 -m bio2zarr vcf2zarr inspec tmp/sample.exploded
54+ vcf2zarr inspect tmp/sample.exploded
3355```
3456
3557Then, (optionally) generate a conversion schema to describe the corresponding
3658Zarr arrays:
3759
3860```
39- python3 -m bio2zarr vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
61+ vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
4062```
4163
42- View and edit the schema, deleting any columns you don't want.
43-
44- Finally, convert to Zarr
64+ View and edit the schema, deleting any columns you don't want, or tweaking
65+ dtypes and compression settings to your taste.
4566
67+ Finally, encode to Zarr:
4668```
47- python3 -m bio2zarr vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
69+ vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
4870```
4971
5072Use the `` -p, --worker-processes `` argument to control the number of workers used
51- to do zarr encoding.
73+ in the `` explode `` and `` encode `` phases.
74+
75+ ## plink2zarr
76+
77+ Convert a plink `` .bed `` file to zarr format. ** This is incomplete**
78+
79+ ## vcf_partition
80+
81+ Partition a given VCF file into (approximately) a give number of regions:
82+
83+ ```
84+ vcf_partition 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz -n 10
85+ ```
86+ gives
87+ ```
88+ chr20:1-6799360
89+ chr20:6799361-14319616
90+ chr20:14319617-21790720
91+ chr20:21790721-28770304
92+ chr20:28770305-31096832
93+ chr20:31096833-38043648
94+ chr20:38043649-45580288
95+ chr20:45580289-52117504
96+ chr20:52117505-58834944
97+ chr20:58834945-
98+ ```
99+
100+ These reqion strings can then be used to split computation of the VCF
101+ into chunks for parallelisation.
52102
103+ ** TODO give a nice example here using xargs**
53104
105+ ** WARNING that this does not take into account that indels may overlap
106+ partitions and you may count variants twice or more if they do**
0 commit comments