@@ -121,7 +121,7 @@ head -n 20 sample.schema.json
121
121
```
122
122
123
123
We've displayed the first 20 lines here so you can get a feel for the JSON format.
124
- The [ jq] ( https://jqlang.github.io/jq/ ) provides a useful way of manipulating
124
+ The [ jq] ( https://jqlang.github.io/jq/ ) tool provides a useful way of manipulating
125
125
these schemas. Let's look at the schema for just the `` call_genotype ``
126
126
field, for example:
127
127
@@ -158,6 +158,15 @@ vcf2zarr mkschema sample.icf \
158
158
```
159
159
Then we can use the updated schema as input to `` encode `` :
160
160
161
+
162
+ <!-- FIXME shouldn't need to do this, but currently the execution model is very -->
163
+ <!-- fragile. -->
164
+ <!-- https://github.com/sgkit-dev/bio2zarr/issues/238 -->
165
+ ``` {code-cell}
166
+ :tags: [remove-cell]
167
+ rm -fR sample_noHQ.vcz
168
+ ```
169
+
161
170
``` {code-cell}
162
171
vcf2zarr encode sample.icf -s sample_noHQ.schema.json sample_noHQ.vcz
163
172
```
@@ -167,95 +176,9 @@ We can then ``inspect`` to see that there is no ``call_HQ`` array in the output:
167
176
vcf2zarr inspect sample_noHQ.vcz
168
177
```
169
178
170
-
171
179
## Large
172
180
173
181
174
182
175
- ## Parallel encode/explode
176
-
177
-
178
- ## Common options
179
-
180
- ```
181
- $ vcf2zarr convert <VCF1> <VCF2> <zarr>
182
- ```
183
-
184
- Converts the VCF to zarr format.
185
-
186
- ** Do not use this for anything but the smallest files**
187
-
188
- The recommended approach is to use a multi-stage conversion
189
-
190
- First, convert the VCF into the intermediate format:
191
-
192
- ```
193
- vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
194
- ```
195
-
196
- Then, (optionally) inspect this representation to get a feel for your dataset
197
- ```
198
- vcf2zarr inspect tmp/sample.exploded
199
- ```
200
-
201
- Then, (optionally) generate a conversion schema to describe the corresponding
202
- Zarr arrays:
203
-
204
- ```
205
- vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
206
- ```
207
-
208
- View and edit the schema, deleting any columns you don't want, or tweaking
209
- dtypes and compression settings to your taste.
210
-
211
- Finally, encode to Zarr:
212
- ```
213
- vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
214
- ```
215
-
216
- Use the `` -p, --worker-processes `` argument to control the number of workers used
217
- in the `` explode `` and `` encode `` phases.
218
-
219
- ## To be merged with above
220
-
221
- The simplest usage is:
222
-
223
- ```
224
- $ vcf2zarr convert [VCF_FILE] [ZARR_PATH]
225
- ```
226
-
227
-
228
- This will convert the indexed VCF (or BCF) into the vcfzarr format in a single
229
- step. As this writes the intermediate columnar format to a temporary directory,
230
- we only recommend this approach for small files (< 1GB, say).
231
-
232
- The recommended approach is to run the conversion in two passes, and
233
- to keep the intermediate columnar format ("exploded") around to facilitate
234
- experimentation with chunk sizes and compression settings:
235
-
236
- ```
237
- $ vcf2zarr explode [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH]
238
- $ vcf2zarr encode [ICF_PATH] [ZARR_PATH]
239
- ```
240
-
241
- The inspect command provides a way to view contents of an exploded ICF
242
- or Zarr:
243
-
244
- ```
245
- $ vcf2zarr inspect [PATH]
246
- ```
247
-
248
- This is useful when tweaking chunk sizes and compression settings to suit
249
- your dataset, using the mkschema command and --schema option to encode:
250
-
251
- ```
252
- $ vcf2zarr mkschema [ICF_PATH] > schema.json
253
- $ vcf2zarr encode [ICF_PATH] [ZARR_PATH] --schema schema.json
254
- ```
255
183
256
- By editing the schema.json file you can drop columns that are not of interest
257
- and edit column specific compression settings. The --max-variant-chunks option
258
- to encode allows you to try out these options on small subsets, hopefully
259
- arriving at settings with the desired balance of compression and query
260
- performance.
261
184
0 commit comments