@@ -121,7 +121,7 @@ head -n 20 sample.schema.json
121121```
122122
123123We've displayed the first 20 lines here so you can get a feel for the JSON format.
124- The [ jq] ( https://jqlang.github.io/jq/ ) provides a useful way of manipulating
124+ The [ jq] ( https://jqlang.github.io/jq/ ) tool provides a useful way of manipulating
125125these schemas. Let's look at the schema for just the `` call_genotype ``
126126field, for example:
127127
@@ -158,6 +158,15 @@ vcf2zarr mkschema sample.icf \
158158```
159159Then we can use the updated schema as input to `` encode `` :
160160
161+
162+ <!-- FIXME shouldn't need to do this, but currently the execution model is very -->
163+ <!-- fragile. -->
164+ <!-- https://github.com/sgkit-dev/bio2zarr/issues/238 -->
165+ ``` {code-cell}
166+ :tags: [remove-cell]
167+ rm -fR sample_noHQ.vcz
168+ ```
169+
161170``` {code-cell}
162171vcf2zarr encode sample.icf -s sample_noHQ.schema.json sample_noHQ.vcz
163172```
@@ -167,95 +176,60 @@ We can then ``inspect`` to see that there is no ``call_HQ`` array in the output:
167176vcf2zarr inspect sample_noHQ.vcz
168177```
169178
179+ :::{tip}
180+ Use the `` max-variants-chunks `` option to encode the first few chunks of your
181+ dataset while doing these kinds of schema tuning operations!
182+ :::
170183
171- ## Large
172-
173-
174-
175- ## Parallel encode/explode
176-
177-
178- ## Common options
179-
180- ```
181- $ vcf2zarr convert <VCF1> <VCF2> <zarr>
182- ```
183-
184- Converts the VCF to zarr format.
185-
186- ** Do not use this for anything but the smallest files**
187-
188- The recommended approach is to use a multi-stage conversion
189-
190- First, convert the VCF into the intermediate format:
191-
192- ```
193- vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded
194- ```
195-
196- Then, (optionally) inspect this representation to get a feel for your dataset
197- ```
198- vcf2zarr inspect tmp/sample.exploded
199- ```
200-
201- Then, (optionally) generate a conversion schema to describe the corresponding
202- Zarr arrays:
203-
204- ```
205- vcf2zarr mkschema tmp/sample.exploded > sample.schema.json
206- ```
207-
208- View and edit the schema, deleting any columns you don't want, or tweaking
209- dtypes and compression settings to your taste.
210-
211- Finally, encode to Zarr:
212- ```
213- vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json
214- ```
215-
216- Use the `` -p, --worker-processes `` argument to control the number of workers used
217- in the `` explode `` and `` encode `` phases.
218-
219- ## To be merged with above
184+ ## Large dataset
220185
221- The simplest usage is:
186+ The {ref}` explode<cmd-vcf2zarr-explode> `
187+ and {ref}` encode<cmd-vcf2zarr-encode> ` commands have powerful features for
188+ conversion on a single machine, and can take full advantage of large servers
189+ with many cores. Current biobank scale datasets, however, are so large that
190+ we must go a step further and * distribute* computations over a cluster.
191+ Vcf2zarr provides some low-level utilities that allow you to do this, that should
192+ be compatible with any cluster scheduler.
222193
223- ```
224- $ vcf2zarr convert [VCF_FILE] [ZARR_PATH]
225- ```
194+ The distributed commands are split into three phases:
226195
196+ - ** init <num_partitions>** : Initialise the computation, setting up the data structures needed
197+ for the bulk computation to be split into `` num_partitions `` independent partitions
198+ - ** partition <j >** : perform the computation of partition `` j ``
199+ - ** finalise** : Complete the full process.
227200
228- This will convert the indexed VCF (or BCF) into the vcfzarr format in a single
229- step. As this writes the intermediate columnar format to a temporary directory,
230- we only recommend this approach for small files (< 1GB, say) .
201+ When performing large-scale computations like this on a cluster, errors and job
202+ failures are essentially inevitable, and the commands are resilient to various
203+ failure modes .
231204
232- The recommended approach is to run the conversion in two passes, and
233- to keep the intermediate columnar format ("exploded") around to facilitate
234- experimentation with chunk sizes and compression settings:
205+ Let's go through the example above using the distributed commands. First, we
206+ {ref}` dexplode-init<cmd-vcf2zarr-dexplode-init> ` to create an ICF directory:
235207
208+ ``` {code-cell}
209+ :tags: [remove-cell]
210+ rm -fR sample-dist.icf
236211```
237- $ vcf2zarr explode [VCF_FILE_1] ... [VCF_FILE_N] [ICF_PATH]
238- $ vcf2zarr encode [ICF_PATH] [ZARR_PATH]
212+ ``` {code-cell}
213+ vcf2zarr dexplode-init sample.vcf.gz sample-dist.icf 5
239214```
240215
241- The inspect command provides a way to view contents of an exploded ICF
242- or Zarr:
216+ Here we asked `` dexplode-init `` to set up an ICF store in which the data
217+ is split into 5 partitions. The number of partitions determines the level
218+ of parallelism, so we would usually set this to the number of
219+ parallel jobs we would like to use. The output of `` dexplode-init `` is
220+ important though, as it tells us the ** actual** number of partitions that
221+ we have (partitioning is based on the VCF indexes, which have a limited
222+ granularity). You should be careful to use this value in your scripts
223+ (the format is designed to be machine readable using e.g. `` cut `` and
224+ `` grep `` ). In this case there are only 3 possible partitions.
243225
244- ```
245- $ vcf2zarr inspect [PATH]
246- ```
247-
248- This is useful when tweaking chunk sizes and compression settings to suit
249- your dataset, using the mkschema command and --schema option to encode:
250226
251- ```
252- $ vcf2zarr mkschema [ICF_PATH] > schema.json
253- $ vcf2zarr encode [ICF_PATH] [ZARR_PATH] --schema schema.json
254- ```
227+ Once `` dexplode-init `` is done and we know how many partitions we have,
228+ we need to call `` dexplode-partition `` this number of times.
255229
256- By editing the schema.json file you can drop columns that are not of interest
257- and edit column specific compression settings. The --max-variant-chunks option
258- to encode allows you to try out these options on small subsets, hopefully
259- arriving at settings with the desired balance of compression and query
260- performance.
230+ <!-- ```{code-cell} -->
231+ <!-- vcf2zarr dexplode-partition sample-dist.icf 0 -->
232+ <!-- vcf2zarr dexplode-partition sample-dist.icf 1 -->
233+ <!-- vcf2zarr dexplode-partition sample-dist.icf 2 -->
234+ <!-- ``` -->
261235
0 commit comments