@@ -12,117 +12,76 @@ command line options.
12
12
13
13
## Quickstart
14
14
15
- First {ref}` install bio2zarr<sec-installation> ` .
15
+ - First {ref}` install bio2zarr<sec-installation> ` .
16
16
17
17
18
- :::{note}
19
- FINISH ME
20
- :::
21
-
22
-
23
-
24
- ## How does it work?
25
- The conversion of VCF data to Zarr is a two-step process:
26
-
27
- 1 . Convert ({ref}` explode<cmd-vcf2zarr-explode> ` ) VCF file(s) to
28
- Intermediate Columnar Format (ICF)
29
- 2 . Convert ({ref}` encode<cmd-vcf2zarr-encode> ` ) ICF to Zarr
30
-
31
- This two-step process allows ` vcf2zarr ` to determine the correct
32
- dimension of Zarr arrays corresponding to each VCF field, and
33
- to keep memory usage tightly bounded while writing the arrays.
34
-
35
- :::{important}
36
- The intermediate columnar format is not intended for any use
37
- other than a temporary storage while converting VCF to Zarr.
38
- The format may change between versions of ` bio2zarr ` .
39
- :::
40
-
41
-
42
- ## Common options
18
+ - Get some indexed VCF data:
43
19
44
20
```
45
- $ vcf2zarr convert <VCF1> <VCF2> <zarr>
21
+ curl -O https://raw.githubusercontent.com/sgkit-dev/bio2zarr/main/tests/data/vcf/sample.vcf.gz
22
+ curl -O https://raw.githubusercontent.com/sgkit-dev/bio2zarr/main/tests/data/vcf/sample.vcf.gz.tbi
46
23
```
47
24
48
- Converts the VCF to zarr format.
49
-
50
- ** Do not use this for anything but the smallest files**
51
-
52
- The recommended approach is to use a multi-stage conversion
53
-
54
- First, convert the VCF into the intermediate format:
55
-
56
- ```
57
- vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
58
- ```
25
+ - Convert to VCF Zarr in two steps:
59
26
60
- Then, (optionally) inspect this representation to get a feel for your dataset
61
27
```
62
- vcf2zarr inspect tmp/sample.exploded
28
+ vcf2zarr explode sample.vcf.gz sample.icf
29
+ vcf2zarr encode sample.icf sample.vcz
63
30
```
64
31
65
- Then, (optionally) generate a conversion schema to describe the corresponding
66
- Zarr arrays:
67
-
68
- ```
69
- vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
70
- ```
32
+ :::{tip}
33
+ If the `` vcf2zarr `` executable doesn't work, try `` python -m bio2zarr vcf2zarr ``
34
+ instead.
35
+ :::
71
36
72
- View and edit the schema, deleting any columns you don't want, or tweaking
73
- dtypes and compression settings to your taste.
37
+ - Have a look at the results:
74
38
75
- Finally, encode to Zarr:
76
39
```
77
- vcf2zarr encode tmp/ sample.exploded tmp/sample.zarr -s sample.schema.json
40
+ vcf2zarr inspect sample.vcz
78
41
```
79
42
80
- Use the `` -p, --worker-processes `` argument to control the number of workers used
81
- in the `` explode `` and `` encode `` phases.
82
-
83
- ## To be merged with above
84
-
85
- The simplest usage is:
86
-
87
- ```
88
- $ vcf2zarr convert [VCF_FILE] [ZARR_PATH]
89
- ```
43
+ ### What next?
90
44
45
+ VCF Zarr is a starting point in what we hope will become a diverse ecosytem
46
+ of packages that efficiently process VCF data in Zarr format. However, this
47
+ ecosytem does not exist yet, and there isn't much software available
48
+ for working with the format. As such, VCF Zarr isn't suitable for end users
49
+ who just want to get their work done for the moment.
91
50
92
- This will convert the indexed VCF (or BCF) into the vcfzarr format in a single
93
- step. As this writes the intermediate columnar format to a temporary directory,
94
- we only recommend this approach for small files (< 1GB, say).
51
+ Having said that, you can:
95
52
96
- The recommended approach is to run the conversion in two passes, and
97
- to keep the intermediate columnar format ("exploded") around to facilitate
98
- experimentation with chunk sizes and compression settings:
53
+ - Look at the [ VCF Zarr specification] ( https://github.com/sgkit-dev/vcf-zarr-spec/ )
54
+ to see how data is mapped from VCF to Zarr
55
+ - Use the mature [ Zarr Python] ( https://zarr.readthedocs.io/en/stable/ ) package or
56
+ one of the other [ Zarr implementations] ( https://zarr.dev/implementations/ ) to access
57
+ your data.
58
+ - Use the many functions in our [ sgkit] ( https://sgkit-dev.github.io/sgkit/latest/ )
59
+ sister project to analyse the data. Note that sgkit is under active development,
60
+ however, and the documentation may not be fully in-sync with this project.
99
61
100
- ```
101
- $ vcf2zarr explode [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH]
102
- $ vcf2zarr encode [ICF_PATH] [ZARR_PATH]
103
- ```
104
62
105
- The inspect command provides a way to view contents of an exploded ICF
106
- or Zarr:
107
63
108
- ```
109
- $ vcf2zarr inspect [PATH]
110
- ```
64
+ ## How does it work?
65
+ The conversion of VCF data to Zarr is a two-step process:
111
66
112
- This is useful when tweaking chunk sizes and compression settings to suit
113
- your dataset, using the mkschema command and --schema option to encode:
67
+ 1 . Convert ({ref}` explode<cmd-vcf2zarr-explode> ` ) VCF file(s) to
68
+ Intermediate Columnar Format (ICF)
69
+ 2 . Convert ({ref}` encode<cmd-vcf2zarr-encode> ` ) ICF to Zarr
114
70
115
- ```
116
- $ vcf2zarr mkschema [ICF_PATH] > schema.json
117
- $ vcf2zarr encode [ICF_PATH] [ZARR_PATH] --schema schema.json
118
- ```
71
+ This two-step process allows ` vcf2zarr ` to determine the correct
72
+ dimension of Zarr arrays corresponding to each VCF field, and
73
+ to keep memory usage tightly bounded while writing the arrays.
119
74
120
- By editing the schema.json file you can drop columns that are not of interest
121
- and edit column specific compression settings. The --max-variant-chunks option
122
- to encode allows you to try out these options on small subsets, hopefully
123
- arriving at settings with the desired balance of compression and query
124
- performance.
75
+ :::{important}
76
+ The intermediate columnar format is not intended for any use
77
+ other than a temporary storage while converting VCF to Zarr.
78
+ The format may change between versions of ` bio2zarr ` .
79
+ :::
125
80
81
+ Both `` explode `` and `` encode `` can be performed in parallel
82
+ across cores on a single machine (via the `` --worker-processes `` argument)
83
+ or distributed across a cluster by the three-part `` init `` , `` partition ``
84
+ and `` finalise `` commands.
126
85
127
86
## Copying to object stores
128
87
0 commit comments