Skip to content

Commit f67c6eb

Browse files
authored
Merge pull request #79 from bcdev/forman-77-null_chunk_sizes
Null chunk sizes
2 parents e7c3ce3 + 903e91f commit f67c6eb

File tree

10 files changed

+268
-92
lines changed

10 files changed

+268
-92
lines changed

CHANGES.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,9 @@
1010
local file path or URI of type `str` or `FileObj`.
1111
Dropped concept of _slice factories_ entirely. [#78]
1212

13+
* Chunk sizes can now be `null` for a given dimension. In this case the actual
14+
chunk size used is the size of the array's shape in that dimension. [#77]
15+
1316
* Internal refactoring: Extracted `Config` class out of `Context` and
1417
made available via new `Context.config: Config` property.
1518
The change concerns any usages of the `ctx: Context` argument passed to

docs/config.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -84,10 +84,17 @@ Variable metadata.
8484
Must be one of the following:
8585

8686
* Type _array_.
87-
Chunk sizes in the order of the dimensions.
88-
The items of the array are of type _integer_.
87+
Chunk sizes for each dimension of the variable.
88+
The items of the array must be one of the following:
89+
90+
* Type _integer_.
91+
Dimension is chunked using given size.
92+
93+
* Disable chunking in this dimension.
94+
Its value is `null`.
95+
8996

90-
* Disable chunking.
97+
* Disable chunking in all dimensions.
9198
Its value is `null`.
9299

93100

@@ -243,7 +250,7 @@ Options for the filesystem given by the protocol of `temp_dir`.
243250
## `force_new`
244251

245252
Type _boolean_.
246-
Force creation of a new target dataset. An existing target dataset (and its lock) will be permanently deleted before appending of slice datasets begins. WARNING: the deletion cannot be rolled back.
253+
Force creation of a new target dataset. An existing target dataset (and its lock) will be permanently deleted before appending of slice datasets begins. WARNING: the deletion cannot be rolled back.
247254
Defaults to `false`.
248255

249256
## `disable_rollback`

docs/guide.md

Lines changed: 50 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -323,18 +323,39 @@ multiple variables the wildcard variable name `*` can often be of help.
323323

324324
#### Chunking
325325

326+
Chunking refers to the subdivision of multidimensional data arrays into
327+
smaller multidimensional blocks. Using the Zarr format, such blocks become
328+
individual data files after optional [data packing](#data-packing)
329+
and [compression](#compression). The chunk sizes of the
330+
dimensions of the multidimensional blocks therefore determine the number of
331+
blocks used per data array and also their size. Hence, chunk sizes have
332+
a very large impact on I/O performance of datasets, especially if they are
333+
persisted in remote filesystems such as S3. The chunk sizes are specified
334+
using the `chunks` setting in the encoding of each variable.
335+
The value of `chunks` can also be `null`, which means no chunking is
336+
desired and the variable's data array will be persisted as one block.
337+
326338
By default, the chunking of the coordinate variable corresponding to the append
327-
dimension will be its dimension in the first slice dataset. Often, this will be one or
328-
a small number. Since `xarray` loads coordinates eagerly when opening a dataset, this
329-
can lead to performance issues if the target dataset is served from object storage such
330-
as S3. This is because, a separate HTTP request is required for every single chunk. It
331-
is therefore very advisable to set the chunks of that variable to a larger number using
332-
the `chunks` setting. For other variables, the chunking within the append dimension may
333-
stay small if desired:
339+
dimension will be its dimension size in the first slice dataset. Often, the size
340+
will be `1` or another small number. Since `xarray` loads coordinates eagerly
341+
when opening a dataset, this can lead to performance issues if the target
342+
dataset is served from object storage such as S3. The reason for this is that a
343+
separate HTTP request is required for every single chunk. It is therefore very
344+
advisable to set the chunks of that variable to a larger number using the
345+
`chunks` setting. For other variables, you could still use a small chunk size
346+
in the append dimension.
347+
348+
Here is a typical chunking configuration for the append dimension `"time"`:
334349

335350
```json
336351
{
352+
"append_dim": "time",
337353
"variables": {
354+
"*": {
355+
"encoding": {
356+
"chunks": null
357+
}
358+
},
338359
"time": {
339360
"dims": ["time"],
340361
"encoding": {
@@ -351,6 +372,28 @@ stay small if desired:
351372
}
352373
```
353374

375+
Sometimes, you may explicitly wish to not chunk a given dimension of a variable.
376+
If you know the size of that dimension in advance, you can then use its size as
377+
chunk size. But there are situations, where the final dimension size depends
378+
on some processing parameters. For example, you could define your own
379+
[slice source](#slice-sources) that takes a geodetic bounding box `bbox`
380+
parameter to spatially crop your variables in the `x` and `y` dimensions.
381+
If you want such dimensions to not be chunked, you can set their chunk sizes
382+
to `null` (`None` in Python):
383+
384+
```json
385+
{
386+
"variables": {
387+
"chl": {
388+
"dims": ["time", "y", "x"],
389+
"encoding": {
390+
"chunks": [1, null, null]
391+
}
392+
}
393+
}
394+
}
395+
```
396+
354397
#### Missing Data
355398

356399
To indicate missing data in a variable data array, a dedicated no-data or missing value

tests/config/test_normalize.py

Lines changed: 12 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,14 @@ def test_it_raises_if_config_is_not_object(self):
9393
normalize_config(file_obj)
9494

9595
def test_normalize_sequence(self):
96+
data_var_spec = {
97+
"dims": ("time", "y", "x"),
98+
"encoding": {
99+
"dtype": "float32",
100+
"chunks": (1, 20, 30),
101+
"fill_value": None,
102+
},
103+
}
96104
configs = (
97105
{
98106
"version": 1,
@@ -129,22 +137,8 @@ def test_normalize_sequence(self):
129137
},
130138
{
131139
"variables": {
132-
"chl": {
133-
"dims": ("time", "y", "x"),
134-
"encoding": {
135-
"dtype": "float32",
136-
"chunks": (1, 20, 30),
137-
"fill_value": None,
138-
},
139-
},
140-
"tsm": {
141-
"dims": ("time", "y", "x"),
142-
"encoding": {
143-
"dtype": "float32",
144-
"chunks": (1, 20, 30),
145-
"fill_value": None,
146-
},
147-
},
140+
"chl": data_var_spec,
141+
"tsm": data_var_spec,
148142
}
149143
},
150144
)
@@ -170,22 +164,8 @@ def test_normalize_sequence(self):
170164
"dims": "time",
171165
"encoding": {"dtype": "uint64"},
172166
},
173-
"chl": {
174-
"dims": ("time", "y", "x"),
175-
"encoding": {
176-
"dtype": "float32",
177-
"chunks": (1, 20, 30),
178-
"fill_value": None,
179-
},
180-
},
181-
"tsm": {
182-
"dims": ("time", "y", "x"),
183-
"encoding": {
184-
"dtype": "float32",
185-
"chunks": (1, 20, 30),
186-
"fill_value": None,
187-
},
188-
},
167+
"chl": data_var_spec,
168+
"tsm": data_var_spec,
189169
},
190170
},
191171
normalize_config(configs),

tests/test_api.py

Lines changed: 78 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -142,10 +142,20 @@ def test_some_slices_local_output_to_existing_dir_force_new(self):
142142
zappend(slices, target_dir=target_dir, force_new=True)
143143
self.assertEqual(False, lock_file.exists())
144144

145-
def test_some_slices_with_class_slice_source(self):
145+
def test_some_slices_with_slice_source_class(self):
146+
class DropTsm(SliceSource):
147+
def __init__(self, slice_ds):
148+
self.slice_ds = slice_ds
149+
150+
def get_dataset(self) -> xr.Dataset:
151+
return self.slice_ds.drop_vars(["tsm"])
152+
153+
def dispose(self):
154+
pass
155+
146156
target_dir = "memory://target.zarr"
147157
slices = [make_test_dataset(index=3 * i) for i in range(3)]
148-
zappend(slices, target_dir=target_dir, slice_source=MySliceSource)
158+
zappend(slices, target_dir=target_dir, slice_source=DropTsm)
149159
ds = xr.open_zarr(target_dir)
150160
self.assertEqual({"time": 9, "y": 50, "x": 100}, ds.sizes)
151161
self.assertEqual({"chl"}, set(ds.data_vars))
@@ -158,13 +168,13 @@ def test_some_slices_with_class_slice_source(self):
158168
ds.attrs,
159169
)
160170

161-
def test_some_slices_with_func_slice_source(self):
162-
def process_slice(slice_ds: xr.Dataset) -> SliceSource:
163-
return MySliceSource(slice_ds)
171+
def test_some_slices_with_slice_source_func(self):
172+
def drop_tsm(slice_ds: xr.Dataset) -> xr.Dataset:
173+
return slice_ds.drop_vars(["tsm"])
164174

165175
target_dir = "memory://target.zarr"
166176
slices = [make_test_dataset(index=3 * i) for i in range(3)]
167-
zappend(slices, target_dir=target_dir, slice_source=process_slice)
177+
zappend(slices, target_dir=target_dir, slice_source=drop_tsm)
168178
ds = xr.open_zarr(target_dir)
169179
self.assertEqual({"time": 9, "y": 50, "x": 100}, ds.sizes)
170180
self.assertEqual({"chl"}, set(ds.data_vars))
@@ -177,6 +187,68 @@ def process_slice(slice_ds: xr.Dataset) -> SliceSource:
177187
ds.attrs,
178188
)
179189

190+
# See https://github.com/bcdev/zappend/issues/77
191+
def test_some_slices_with_cropping_slice_source_no_chunks_spec(self):
192+
def crop_ds(slice_ds: xr.Dataset) -> xr.Dataset:
193+
w = slice_ds.x.size
194+
h = slice_ds.y.size
195+
return slice_ds.isel(x=slice(5, w - 5), y=slice(5, h - 5))
196+
197+
target_dir = "memory://target.zarr"
198+
slices = [make_test_dataset(index=3 * i) for i in range(3)]
199+
zappend(slices, target_dir=target_dir, slice_source=crop_ds)
200+
ds = xr.open_zarr(target_dir)
201+
self.assertEqual({"time": 9, "y": 40, "x": 90}, ds.sizes)
202+
self.assertEqual({"chl", "tsm"}, set(ds.data_vars))
203+
self.assertEqual({"time", "y", "x"}, set(ds.coords))
204+
self.assertEqual((90,), ds.x.encoding.get("chunks"))
205+
self.assertEqual((40,), ds.y.encoding.get("chunks"))
206+
self.assertEqual((3,), ds.time.encoding.get("chunks"))
207+
# Chunk sizes are the ones of the original array, because we have not
208+
# specified chunks in encoding.
209+
self.assertEqual((1, 25, 45), ds.chl.encoding.get("chunks"))
210+
self.assertEqual((1, 25, 45), ds.tsm.encoding.get("chunks"))
211+
212+
# See https://github.com/bcdev/zappend/issues/77
213+
def test_some_slices_with_cropping_slice_source_with_chunks_spec(self):
214+
def crop_ds(slice_ds: xr.Dataset) -> xr.Dataset:
215+
w = slice_ds.x.size
216+
h = slice_ds.y.size
217+
return slice_ds.isel(x=slice(5, w - 5), y=slice(5, h - 5))
218+
219+
variables = {
220+
"*": {
221+
"encoding": {
222+
"chunks": None,
223+
}
224+
},
225+
"chl": {
226+
"encoding": {
227+
"chunks": [1, None, None],
228+
}
229+
},
230+
"tsm": {
231+
"encoding": {
232+
"chunks": [None, 25, 50],
233+
}
234+
},
235+
}
236+
237+
target_dir = "memory://target.zarr"
238+
slices = [make_test_dataset(index=3 * i) for i in range(3)]
239+
zappend(
240+
slices, target_dir=target_dir, slice_source=crop_ds, variables=variables
241+
)
242+
ds = xr.open_zarr(target_dir)
243+
self.assertEqual({"time": 9, "y": 40, "x": 90}, ds.sizes)
244+
self.assertEqual({"chl", "tsm"}, set(ds.data_vars))
245+
self.assertEqual({"time", "y", "x"}, set(ds.coords))
246+
self.assertEqual((90,), ds.x.encoding.get("chunks"))
247+
self.assertEqual((40,), ds.y.encoding.get("chunks"))
248+
self.assertEqual((3,), ds.time.encoding.get("chunks"))
249+
self.assertEqual((1, 40, 90), ds.chl.encoding.get("chunks"))
250+
self.assertEqual((3, 25, 50), ds.tsm.encoding.get("chunks"))
251+
180252
def test_some_slices_with_inc_append_step(self):
181253
target_dir = "memory://target.zarr"
182254
slices = [make_test_dataset(index=i, shape=(1, 50, 100)) for i in range(3)]
@@ -391,14 +463,3 @@ def test_some_slices_with_profiling(self):
391463
finally:
392464
if os.path.exists("prof.out"):
393465
os.remove("prof.out")
394-
395-
396-
class MySliceSource(SliceSource):
397-
def __init__(self, slice_ds):
398-
self.slice_ds = slice_ds
399-
400-
def get_dataset(self) -> xr.Dataset:
401-
return self.slice_ds.drop_vars(["tsm"])
402-
403-
def dispose(self):
404-
pass

tests/test_metadata.py

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -291,6 +291,47 @@ def test_variable_encoding_from_netcdf(self):
291291
).to_dict(),
292292
)
293293

294+
def test_variable_encoding_can_deal_with_chunk_size_none(self):
295+
# See https://github.com/bcdev/zappend/issues/77
296+
a = xr.DataArray(np.zeros((2, 3, 4)), dims=("time", "y", "x"))
297+
b = xr.DataArray(np.zeros((2, 3, 4)), dims=("time", "y", "x"))
298+
self.assertEqual(
299+
{
300+
"attrs": {},
301+
"sizes": {"time": 2, "x": 4, "y": 3},
302+
"variables": {
303+
"a": {
304+
"attrs": {},
305+
"dims": ("time", "y", "x"),
306+
"encoding": {"chunks": (1, 3, 4)},
307+
"shape": (2, 3, 4),
308+
},
309+
"b": {
310+
"attrs": {},
311+
"dims": ("time", "y", "x"),
312+
"encoding": {"chunks": (2, 2, 3)},
313+
"shape": (2, 3, 4),
314+
},
315+
},
316+
},
317+
DatasetMetadata.from_dataset(
318+
xr.Dataset(
319+
{
320+
"a": a,
321+
"b": b,
322+
}
323+
),
324+
make_config(
325+
{
326+
"variables": {
327+
"a": {"encoding": {"chunks": [1, None, None]}},
328+
"b": {"encoding": {"chunks": [None, 2, 3]}},
329+
},
330+
}
331+
),
332+
).to_dict(),
333+
)
334+
294335
def test_variable_encoding_normalisation(self):
295336
def normalize(k, v):
296337
metadata = DatasetMetadata.from_dataset(
@@ -363,6 +404,7 @@ def test_it_raises_on_unspecified_variable(self):
363404
),
364405
)
365406

407+
# noinspection PyMethodMayBeStatic
366408
def test_it_raises_on_wrong_size_found_in_ds(self):
367409
with pytest.raises(
368410
ValueError,

zappend/config/config.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -184,4 +184,4 @@ def logging(self) -> dict[str, Any] | str | bool | None:
184184
@property
185185
def profiling(self) -> dict[str, Any] | str | bool | None:
186186
"""Profiling configuration."""
187-
return self._config.get("profiling") or {}
187+
return self._config.get("profiling")

zappend/config/schema.py

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -73,11 +73,23 @@
7373
"description": "Storage chunking.",
7474
"oneOf": [
7575
{
76-
"description": "Chunk sizes in the order of the dimensions.",
76+
"description": "Chunk sizes for each dimension of the variable.",
7777
"type": "array",
78-
"items": {"type": "integer", "minimum": 1},
78+
"items": {
79+
"oneOf": [
80+
{
81+
"description": "Dimension is chunked using given size.",
82+
"type": "integer",
83+
"minimum": 1,
84+
},
85+
{
86+
"description": "Disable chunking in this dimension.",
87+
"const": None,
88+
},
89+
]
90+
},
7991
},
80-
{"description": "Disable chunking.", "const": None},
92+
{"description": "Disable chunking in all dimensions.", "const": None},
8193
],
8294
},
8395
fill_value={

0 commit comments

Comments
 (0)