@@ -323,18 +323,39 @@ multiple variables the wildcard variable name `*` can often be of help.
323323
324324#### Chunking
325325
326+ Chunking refers to the subdivision of multidimensional data arrays into
327+ smaller multidimensional blocks. Using the Zarr format, such blocks become
328+ individual data files after optional [ data packing] ( #data-packing )
329+ and [ compression] ( #compression ) . The chunk sizes of the
330+ dimensions of the multidimensional blocks therefore determine the number of
331+ blocks used per data array and also their size. Hence, chunk sizes have
332+ a very large impact on I/O performance of datasets, especially if they are
333+ persisted in remote filesystems such as S3. The chunk sizes are specified
334+ using the ` chunks ` setting in the encoding of each variable.
335+ The value of ` chunks ` can also be ` null ` , which means no chunking is
336+ desired and the variable's data array will be persisted as one block.
337+
326338By default, the chunking of the coordinate variable corresponding to the append
327- dimension will be its dimension in the first slice dataset. Often, this will be one or
328- a small number. Since ` xarray ` loads coordinates eagerly when opening a dataset, this
329- can lead to performance issues if the target dataset is served from object storage such
330- as S3. This is because, a separate HTTP request is required for every single chunk. It
331- is therefore very advisable to set the chunks of that variable to a larger number using
332- the ` chunks ` setting. For other variables, the chunking within the append dimension may
333- stay small if desired:
339+ dimension will be its dimension size in the first slice dataset. Often, the size
340+ will be ` 1 ` or another small number. Since ` xarray ` loads coordinates eagerly
341+ when opening a dataset, this can lead to performance issues if the target
342+ dataset is served from object storage such as S3. The reason for this is that a
343+ separate HTTP request is required for every single chunk. It is therefore very
344+ advisable to set the chunks of that variable to a larger number using the
345+ ` chunks ` setting. For other variables, you could still use a small chunk size
346+ in the append dimension.
347+
348+ Here is a typical chunking configuration for the append dimension ` "time" ` :
334349
335350``` json
336351{
352+ "append_dim" : " time" ,
337353 "variables" : {
354+ "*" : {
355+ "encoding" : {
356+ "chunks" : null
357+ }
358+ },
338359 "time" : {
339360 "dims" : [" time" ],
340361 "encoding" : {
@@ -351,6 +372,28 @@ stay small if desired:
351372}
352373```
353374
375+ Sometimes, you may explicitly wish to not chunk a given dimension of a variable.
376+ If you know the size of that dimension in advance, you can then use its size as
377+ chunk size. But there are situations, where the final dimension size depends
378+ on some processing parameters. For example, you could define your own
379+ [ slice source] ( #slice-sources ) that takes a geodetic bounding box ` bbox `
380+ parameter to spatially crop your variables in the ` x ` and ` y ` dimensions.
381+ If you want such dimensions to not be chunked, you can set their chunk sizes
382+ to ` null ` (` None ` in Python):
383+
384+ ``` json
385+ {
386+ "variables" : {
387+ "chl" : {
388+ "dims" : [" time" , " y" , " x" ],
389+ "encoding" : {
390+ "chunks" : [1 , null , null ]
391+ }
392+ }
393+ }
394+ }
395+ ```
396+
354397#### Missing Data
355398
356399To indicate missing data in a variable data array, a dedicated no-data or missing value
0 commit comments