Skip to content

Commit a732d6c

Browse files
committed
Updated docs
1 parent 5c757a4 commit a732d6c

File tree

3 files changed

+55
-9
lines changed

3 files changed

+55
-9
lines changed

CHANGES.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,9 @@
1010
local file path or URI of type `str` or `FileObj`.
1111
Dropped concept of _slice factories_ entirely. [#78]
1212

13+
* Chunk sizes can now be `null` for a given dimension. In this case the actual
14+
chunk size used is the size of the array's shape in that dimension. [#77]
15+
1316
* Internal refactoring: Extracted `Config` class out of `Context` and
1417
made available via new `Context.config: Config` property.
1518
The change concerns any usages of the `ctx: Context` argument passed to

docs/config.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -84,13 +84,13 @@ Variable metadata.
8484
Must be one of the following:
8585

8686
* Type _array_.
87-
Chunk sizes in the order of the dimensions.
87+
Chunk sizes for each dimension of the variable.
8888
The items of the array must be one of the following:
8989

9090
* Type _integer_.
9191
Dimension is chunked using given size.
9292

93-
* No chunking in this dimension.
93+
* Disable chunking in this dimension.
9494
Its value is `null`.
9595

9696

docs/guide.md

Lines changed: 50 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -323,18 +323,39 @@ multiple variables the wildcard variable name `*` can often be of help.
323323

324324
#### Chunking
325325

326+
Chunking refers to the subdivision of multidimensional data arrays into
327+
smaller multidimensional blocks. Using the Zarr format, such blocks become
328+
individual data files after optional [data packing](#data-packing)
329+
and [compression](#compression). The chunk sizes of the
330+
dimensions of the multidimensional blocks therefore determine the number of
331+
blocks used per data array and also their size. Hence, chunk sizes have
332+
a very large impact on I/O performance of datasets, especially if they are
333+
persisted in remote filesystems such as S3. The chunk sizes are specified
334+
using the `chunks` setting in the encoding of each variable.
335+
The value of `chunks` can also be `null`, which means no chunking is
336+
desired and the variable's data array will be persisted as one block.
337+
326338
By default, the chunking of the coordinate variable corresponding to the append
327-
dimension will be its dimension in the first slice dataset. Often, this will be one or
328-
a small number. Since `xarray` loads coordinates eagerly when opening a dataset, this
329-
can lead to performance issues if the target dataset is served from object storage such
330-
as S3. This is because, a separate HTTP request is required for every single chunk. It
331-
is therefore very advisable to set the chunks of that variable to a larger number using
332-
the `chunks` setting. For other variables, the chunking within the append dimension may
333-
stay small if desired:
339+
dimension will be its dimension size in the first slice dataset. Often, the size
340+
will be `1` or another small number. Since `xarray` loads coordinates eagerly
341+
when opening a dataset, this can lead to performance issues if the target
342+
dataset is served from object storage such as S3. The reason for this is that a
343+
separate HTTP request is required for every single chunk. It is therefore very
344+
advisable to set the chunks of that variable to a larger number using the
345+
`chunks` setting. For other variables, you could still use a small chunk size
346+
in the append dimension.
347+
348+
Here is a typical chunking configuration for the append dimension `"time"`:
334349

335350
```json
336351
{
352+
"append_dim": "time",
337353
"variables": {
354+
"*": {
355+
"encoding": {
356+
"chunks": null
357+
}
358+
},
338359
"time": {
339360
"dims": ["time"],
340361
"encoding": {
@@ -351,6 +372,28 @@ stay small if desired:
351372
}
352373
```
353374

375+
Sometimes, you may explicitly wish to not chunk a given dimension of a variable.
376+
If you know the size of that dimension in advance, you can then use its size as
377+
chunk size. But there are situations, where the final dimension size depends
378+
on some processing parameters. For example, you could define your own
379+
[slice source](#slice-sources) that takes a geodetic bounding box `bbox`
380+
parameter to spatially crop your variables in the `x` and `y` dimensions.
381+
If you want such dimensions to not be chunked, you can set their chunk sizes
382+
to `null` (`None` in Python):
383+
384+
```json
385+
{
386+
"variables": {
387+
"chl": {
388+
"dims": ["time", "y", "x"],
389+
"encoding": {
390+
"chunks": [1, null, null]
391+
}
392+
}
393+
}
394+
}
395+
```
396+
354397
#### Missing Data
355398

356399
To indicate missing data in a variable data array, a dedicated no-data or missing value

0 commit comments

Comments
 (0)