Skip to content

Commit 831e687

Browse files
d-v-bjakirkhamjoshmoore
authored
Optimize setitem with chunk equal to fill_value, round 2 (#738)
* Consolidate encode/store in _chunk_setitem_nosync Matches how these lines are written in `_set_basic_selection_zd`. * Clear key-value pair if chunk is just fill value Add a simple check to see if the key-value pair is just being set with a chunk equal to the fill value. If so, simply delete the key-value pair instead of storing a chunk that only contains the fill value. The Array will behave the same externally. However this will cutdown on the space require to store the Array. Also will make sure that copying one Array to another Array won't dramatically effect the storage size. * set empty chunk write behavior via array constructor * add rudimentary tests, np.equal -> np.array_equal * add test for chunk deletion * add flattening function * add kwarg for empty writes to array creators * fix check for chunk equality to fill value * flake8 * add None check to setitems * add write_empty_chunks to output of __getstate__ * flake8 * add partial decompress to __get_state__ * functionalize emptiness checks and key deletion * flake8 * add path for delitems, and add some failing tests * flake8 * add delitems method to FSStore, and correspondingly change zarr.Array behavior * add nested + empty writes test * set write_empty_chunks to True by default * rename chunk_is_empty method and clean up logic in _chunk_setitem * rename test * add test for flatten * initial support for using delitems api in chunk_setitems * flake8 * strip path separator that was screwing up store.listdir * change tests to check for empty writing behavior * bump fsspec and s3fs versions * delitems: only attempt to delete keys that exist * cdon't pass empty collections to self.map.delitems * flake8 * use main branch of fsspec until a new release is cut * add support for checking if a chunk is all nans in _chunk_is_empty * docstring tweaks * clean up empty_write tests * fix hexdigests for FSStore + empty writes, and remove redundant nchunks_initialized test * resolve merge conflicts in favor of master * set write_empty_chunks to True by default; put chunk emptiness checking in a function in util.py; optimize chunk emptiness checking * remove np.typing import * use public attribute in test_nchunks_initialized * remove duplicated logic in _chunk_setitems, instead using _chunk_delitems; clean up 0d empty writes; add coverage exemptions * expand 0d tests and nchunks_initialized tests to hit more parts of the write_empty_chunks logic * remove return type annotation for all_equal that was breaking CI * refactor write_empty_chunks tests by expanding the create_array logic in the base test class, remove standalone write_empty_chunks tests * correctly handle merge from upstream master * don't use os.path.join for constructing a chunk key; instead use _chunk_key method * complete removal of os.path.join calls * add coverage exemption to type error branch in all_equal * remove unreachable conditionals in n5 tests * instantiate ReadOnlyError * add explcit delitems and setitems calls to readonly fsstore tests * Update docstrings * Update requirements_dev_optional * Add changelog Co-authored-by: John Kirkham <[email protected]> Co-authored-by: Josh Moore <[email protected]> Co-authored-by: jmoore <[email protected]>
1 parent 7c31f04 commit 831e687

File tree

9 files changed

+314
-54
lines changed

9 files changed

+314
-54
lines changed

docs/release.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@ Enhancements
1212
* array indexing with [] (getitem and setitem) now supports fancy indexing.
1313
By :user:`Juan Nunez-Iglesias <jni>`; :issue:`725`.
1414

15+
* write_empty_chunks=False deletes chunks consisting of only fill_value.
16+
By :user:`Davis Bennett <d-v-b>`; :issue:`738`.
17+
1518
.. _release_2.10.2:
1619

1720
2.10.2

zarr/core.py

Lines changed: 72 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,7 @@
3232
from zarr.meta import decode_array_metadata, encode_array_metadata
3333
from zarr.storage import array_meta_key, attrs_key, getsize, listdir
3434
from zarr.util import (
35+
all_equal,
3536
InfoReporter,
3637
check_array_shape,
3738
human_readable_size,
@@ -75,6 +76,14 @@ class Array:
7576
If True and while the chunk_store is a FSStore and the compresion used
7677
is Blosc, when getting data from the array chunks will be partially
7778
read and decompressed when possible.
79+
write_empty_chunks : bool, optional
80+
If True (default), all chunks will be stored regardless of their
81+
contents. If False, each chunk is compared to the array's fill
82+
value prior to storing. If a chunk is uniformly equal to the fill
83+
value, then that chunk is not be stored, and the store entry for
84+
that chunk's key is deleted. This setting enables sparser storage,
85+
as only chunks with non-fill-value data are stored, at the expense
86+
of overhead associated with checking the data of each chunk.
7887
7988
.. versionadded:: 2.7
8089
@@ -107,6 +116,7 @@ class Array:
107116
info
108117
vindex
109118
oindex
119+
write_empty_chunks
110120
111121
Methods
112122
-------
@@ -139,6 +149,7 @@ def __init__(
139149
cache_metadata=True,
140150
cache_attrs=True,
141151
partial_decompress=False,
152+
write_empty_chunks=True,
142153
):
143154
# N.B., expect at this point store is fully initialized with all
144155
# configuration metadata fully specified and normalized
@@ -155,6 +166,7 @@ def __init__(
155166
self._cache_metadata = cache_metadata
156167
self._is_view = False
157168
self._partial_decompress = partial_decompress
169+
self._write_empty_chunks = write_empty_chunks
158170

159171
# initialize metadata
160172
self._load_metadata()
@@ -455,6 +467,13 @@ def vindex(self):
455467
:func:`set_mask_selection` for documentation and examples."""
456468
return self._vindex
457469

470+
@property
471+
def write_empty_chunks(self) -> bool:
472+
"""A Boolean, True if chunks composed of the array's fill value
473+
will be stored. If False, such chunks will not be stored.
474+
"""
475+
return self._write_empty_chunks
476+
458477
def __eq__(self, other):
459478
return (
460479
isinstance(other, Array) and
@@ -1626,9 +1645,18 @@ def _set_basic_selection_zd(self, selection, value, fields=None):
16261645
else:
16271646
chunk[selection] = value
16281647

1629-
# encode and store
1630-
cdata = self._encode_chunk(chunk)
1631-
self.chunk_store[ckey] = cdata
1648+
# remove chunk if write_empty_chunks is false and it only contains the fill value
1649+
if (not self.write_empty_chunks) and all_equal(self.fill_value, chunk):
1650+
try:
1651+
del self.chunk_store[ckey]
1652+
return
1653+
except Exception: # pragma: no cover
1654+
# deleting failed, fallback to overwriting
1655+
pass
1656+
else:
1657+
# encode and store
1658+
cdata = self._encode_chunk(chunk)
1659+
self.chunk_store[ckey] = cdata
16321660

16331661
def _set_basic_selection_nd(self, selection, value, fields=None):
16341662
# implementation of __setitem__ for array with at least one dimension
@@ -1896,11 +1924,38 @@ def _chunk_getitems(self, lchunk_coords, lchunk_selection, out, lout_selection,
18961924
out[out_select] = fill_value
18971925

18981926
def _chunk_setitems(self, lchunk_coords, lchunk_selection, values, fields=None):
1899-
ckeys = [self._chunk_key(co) for co in lchunk_coords]
1900-
cdatas = [self._process_for_setitem(key, sel, val, fields=fields)
1901-
for key, sel, val in zip(ckeys, lchunk_selection, values)]
1902-
values = {k: v for k, v in zip(ckeys, cdatas)}
1903-
self.chunk_store.setitems(values)
1927+
ckeys = map(self._chunk_key, lchunk_coords)
1928+
cdatas = {key: self._process_for_setitem(key, sel, val, fields=fields)
1929+
for key, sel, val in zip(ckeys, lchunk_selection, values)}
1930+
to_store = {}
1931+
if not self.write_empty_chunks:
1932+
empty_chunks = {k: v for k, v in cdatas.items() if all_equal(self.fill_value, v)}
1933+
self._chunk_delitems(empty_chunks.keys())
1934+
nonempty_keys = cdatas.keys() - empty_chunks.keys()
1935+
to_store = {k: self._encode_chunk(cdatas[k]) for k in nonempty_keys}
1936+
else:
1937+
to_store = {k: self._encode_chunk(v) for k, v in cdatas.items()}
1938+
self.chunk_store.setitems(to_store)
1939+
1940+
def _chunk_delitems(self, ckeys):
1941+
if hasattr(self.store, "delitems"):
1942+
self.store.delitems(ckeys)
1943+
else: # pragma: no cover
1944+
# exempting this branch from coverage as there are no extant stores
1945+
# that will trigger this condition, but it's possible that they
1946+
# will be developed in the future.
1947+
tuple(map(self._chunk_delitem, ckeys))
1948+
return None
1949+
1950+
def _chunk_delitem(self, ckey):
1951+
"""
1952+
Attempt to delete the value associated with ckey.
1953+
"""
1954+
try:
1955+
del self.chunk_store[ckey]
1956+
return
1957+
except KeyError:
1958+
return
19041959

19051960
def _chunk_setitem(self, chunk_coords, chunk_selection, value, fields=None):
19061961
"""Replace part or whole of a chunk.
@@ -1931,8 +1986,12 @@ def _chunk_setitem(self, chunk_coords, chunk_selection, value, fields=None):
19311986
def _chunk_setitem_nosync(self, chunk_coords, chunk_selection, value, fields=None):
19321987
ckey = self._chunk_key(chunk_coords)
19331988
cdata = self._process_for_setitem(ckey, chunk_selection, value, fields=fields)
1934-
# store
1935-
self.chunk_store[ckey] = cdata
1989+
1990+
# attempt to delete chunk if it only contains the fill value
1991+
if (not self.write_empty_chunks) and all_equal(self.fill_value, cdata):
1992+
self._chunk_delitem(ckey)
1993+
else:
1994+
self.chunk_store[ckey] = self._encode_chunk(cdata)
19361995

19371996
def _process_for_setitem(self, ckey, chunk_selection, value, fields=None):
19381997
if is_total_slice(chunk_selection, self._chunks) and not fields:
@@ -1988,8 +2047,7 @@ def _process_for_setitem(self, ckey, chunk_selection, value, fields=None):
19882047
else:
19892048
chunk[chunk_selection] = value
19902049

1991-
# encode chunk
1992-
return self._encode_chunk(chunk)
2050+
return chunk
19932051

19942052
def _chunk_key(self, chunk_coords):
19952053
return self._key_prefix + self._dimension_separator.join(map(str, chunk_coords))
@@ -2209,7 +2267,8 @@ def hexdigest(self, hashname="sha1"):
22092267

22102268
def __getstate__(self):
22112269
return (self._store, self._path, self._read_only, self._chunk_store,
2212-
self._synchronizer, self._cache_metadata, self._attrs.cache)
2270+
self._synchronizer, self._cache_metadata, self._attrs.cache,
2271+
self._partial_decompress, self._write_empty_chunks)
22132272

22142273
def __setstate__(self, state):
22152274
self.__init__(*state)

zarr/creation.py

Lines changed: 22 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ def create(shape, chunks=True, dtype=None, compressor='default',
2121
fill_value=0, order='C', store=None, synchronizer=None,
2222
overwrite=False, path=None, chunk_store=None, filters=None,
2323
cache_metadata=True, cache_attrs=True, read_only=False,
24-
object_codec=None, dimension_separator=None, **kwargs):
24+
object_codec=None, dimension_separator=None, write_empty_chunks=True, **kwargs):
2525
"""Create an array.
2626
2727
Parameters
@@ -71,6 +71,15 @@ def create(shape, chunks=True, dtype=None, compressor='default',
7171
dimension_separator : {'.', '/'}, optional
7272
Separator placed between the dimensions of a chunk.
7373
.. versionadded:: 2.8
74+
write_empty_chunks : bool, optional
75+
If True (default), all chunks will be stored regardless of their
76+
contents. If False, each chunk is compared to the array's fill
77+
value prior to storing. If a chunk is uniformly equal to the fill
78+
value, then that chunk is not be stored, and the store entry for
79+
that chunk's key is deleted. This setting enables sparser storage,
80+
as only chunks with non-fill-value data are stored, at the expense
81+
of overhead associated with checking the data of each chunk.
82+
7483
7584
Returns
7685
-------
@@ -142,7 +151,8 @@ def create(shape, chunks=True, dtype=None, compressor='default',
142151

143152
# instantiate array
144153
z = Array(store, path=path, chunk_store=chunk_store, synchronizer=synchronizer,
145-
cache_metadata=cache_metadata, cache_attrs=cache_attrs, read_only=read_only)
154+
cache_metadata=cache_metadata, cache_attrs=cache_attrs, read_only=read_only,
155+
write_empty_chunks=write_empty_chunks)
146156

147157
return z
148158

@@ -400,6 +410,7 @@ def open_array(
400410
chunk_store=None,
401411
storage_options=None,
402412
partial_decompress=False,
413+
write_empty_chunks=True,
403414
**kwargs
404415
):
405416
"""Open an array using file-mode-like semantics.
@@ -454,8 +465,14 @@ def open_array(
454465
If True and while the chunk_store is a FSStore and the compresion used
455466
is Blosc, when getting data from the array chunks will be partially
456467
read and decompressed when possible.
457-
458-
.. versionadded:: 2.7
468+
write_empty_chunks : bool, optional
469+
If True (default), all chunks will be stored regardless of their
470+
contents. If False, each chunk is compared to the array's fill
471+
value prior to storing. If a chunk is uniformly equal to the fill
472+
value, then that chunk is not be stored, and the store entry for
473+
that chunk's key is deleted. This setting enables sparser storage,
474+
as only chunks with non-fill-value data are stored, at the expense
475+
of overhead associated with checking the data of each chunk.
459476
460477
Returns
461478
-------
@@ -545,7 +562,7 @@ def open_array(
545562
# instantiate array
546563
z = Array(store, read_only=read_only, synchronizer=synchronizer,
547564
cache_metadata=cache_metadata, cache_attrs=cache_attrs, path=path,
548-
chunk_store=chunk_store)
565+
chunk_store=chunk_store, write_empty_chunks=write_empty_chunks)
549566

550567
return z
551568

zarr/storage.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1154,6 +1154,15 @@ def __delitem__(self, key):
11541154
else:
11551155
del self.map[key]
11561156

1157+
def delitems(self, keys):
1158+
if self.mode == 'r':
1159+
raise ReadOnlyError()
1160+
# only remove the keys that exist in the store
1161+
nkeys = [self._normalize_key(key) for key in keys if key in self]
1162+
# rm errors if you pass an empty collection
1163+
if len(nkeys) > 0:
1164+
self.map.delitems(nkeys)
1165+
11571166
def __contains__(self, key):
11581167
key = self._normalize_key(key)
11591168
return key in self.map

0 commit comments

Comments
 (0)