Skip to content

Commit 569915d

Browse files
authored
Special case chunk encoding for dict chunk store (#359)
* Add `values` method to `CustomMapping` For doing a quick pass through the values within a `MutableMapping`, it is helpful to have the `values` method. This will be needed in some of the tests that we are adding. So add a quick implementation of `values` in the `CustomMapping`. * Test stores contain bytes-like data * Add a test to check that all values are bytes Currently this is a somewhat loose requirement that not all stores enforce. However it is a useful check to have in some cases. Particularly this is a useful constraint with in-memory stores where there is a concern that the data in the store might be manipulated externally due to ndarray views. The bytes instances don't have this problem as they are immutable and own their data. While views can be take onto bytes instances, they will be read-only views and thus are not a concern. For other stores that place their data in some other storage backend (e.g. on disk), this is less of a concern. Also other stores may choose to represent their data in other ways (e.g. LMDB with `memoryview`s). Manipulating the loaded data from these stores is less of a concern since as it doesn't affect their actually contents, only an in-memory representation. In these cases, it may make sense to disable this test. * Disable bytes value test for LMDBStore w/buffers It's possible to request that `LMDBStore` return `buffer`s/`memoryview`s instead of returning `bytes` for values. This is incredibly useful as LMDB is memory-mapped. So this avoids a copy when accessing the data. However this means it will fail `test_store_has_bytes_values`. Though this is ok as noted previously since that is not a hard requirement of stores. So we just disable that test for this `LMDBStore` case. The other `LMDBStore` case returns `bytes` instances instead (copying the data); so, it passes this test without issues. * Have CustomMapping ensure it stores bytes Since people will coming looking to see how a store should be implemented, we should show them good behavior. Namely we should ensure the values provided to `__setitem__` are `bytes`-like and immutable (once stored). By coercing the data to `bytes` we ensure that the data is `bytes`-like and we ensure the data is immutable since `bytes` are immutable. * Disable bytes value test for LRUStoreCache As the point of the `LRUStoreCache` is to merely hold onto values retrieved from the underlying store and keep them around in memory unaltered, the caching layer doesn't have any control over what type of values are returned to it. Thus it doesn't make much sense to test whether the values it returns are of `bytes` type or not. Though this is fine as there is not strict requirement that values of `bytes` type be returned by stores, so simply disable `test_store_has_bytes_values` for the `LRUStoreCache` test case with a note explaining this. * Ensure `bytes` in `_encode_chunk` for `dict` In `Array` when running `_encode_chunk` and using a `dict`-based chunk store, ensure that the chunk data is of `bytes` type. This is done to convert the underlying data to an immutable value to protect against external views onto the data from changing it (as is the case with NumPy arrays). Also this is done to ensure it is possible to compare `dict`-based stores easily. While coercing to a `bytes` object can introduce a copying, this generally won't happen if a compressor is involved as it will usually have returned a `bytes` object. Though filters may not, which could cause this to introduce an extra copy if only filters and no compressors are used. However this is an unlikely case and is not as important as guaranteeing the values own their own data and are read-only. Plus this should allow us to drop the preventive copy that happens earlier when storing values as this neatly handles that case of no filters and no compressors. * Skip copying when no compressor or filter is used This copy was taken primarily to protect in-memory stores from being mutated by external views of the array. However all stores we define (including in-memory ones like `DictStore`) already perform this safeguarding themselves on the values they store. For the builtin `dict` store, we perform this safeguarding for it (since `dict`'s don't do this themselves) by ensuring we only store `bytes` objects into them. As these are immutable and own their own data, there isn't a way to mutate their content after storing them. Thus this preventive copy here is not needed and can be dropped. * Note dict stores bytes and preemptive copy dropped * Ignore fail case coverage in bytes-like value test
1 parent 37672e1 commit 569915d

File tree

3 files changed

+51
-16
lines changed

3 files changed

+51
-16
lines changed

docs/release.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,11 @@ Bug fixes
5656
* Always use a ``tuple`` when indexing a NumPy ``ndarray``.
5757
By :user:`John Kirkham <jakirkham>`, :issue:`376`
5858

59+
* Ensure when ``Array`` uses a ``dict``-based chunk store that it only contains
60+
``bytes`` to facilitate comparisons and protect against writes. Drop the copy
61+
for the no filter/compressor case as this handles that case.
62+
By :user:`John Kirkham <jakirkham>`, :issue:`359`
63+
5964
Maintenance
6065
~~~~~~~~~~~
6166

zarr/core.py

Lines changed: 9 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99

1010
import numpy as np
11-
from numcodecs.compat import ensure_ndarray
11+
from numcodecs.compat import ensure_bytes, ensure_ndarray
1212

1313

1414
from zarr.util import (is_total_slice, human_readable_size, normalize_resize_args,
@@ -1677,21 +1677,11 @@ def _chunk_setitem_nosync(self, chunk_coords, chunk_selection, value, fields=Non
16771677

16781678
else:
16791679

1680-
if not self._compressor and not self._filters:
1681-
1682-
# https://github.com/alimanfoo/zarr/issues/79
1683-
# Ensure a copy is taken so we don't end up storing
1684-
# a view into someone else's array.
1685-
# N.B., this assumes that filters or compressor always
1686-
# take a copy and never attempt to apply encoding in-place.
1687-
chunk = np.array(value, dtype=self._dtype, order=self._order)
1688-
1680+
# ensure array is contiguous
1681+
if self._order == 'F':
1682+
chunk = np.asfortranarray(value, dtype=self._dtype)
16891683
else:
1690-
# ensure array is contiguous
1691-
if self._order == 'F':
1692-
chunk = np.asfortranarray(value, dtype=self._dtype)
1693-
else:
1694-
chunk = np.ascontiguousarray(value, dtype=self._dtype)
1684+
chunk = np.ascontiguousarray(value, dtype=self._dtype)
16951685

16961686
else:
16971687
# partially replace the contents of this chunk
@@ -1788,6 +1778,10 @@ def _encode_chunk(self, chunk):
17881778
else:
17891779
cdata = chunk
17901780

1781+
# ensure in-memory data is immutable and easy to compare
1782+
if isinstance(self.chunk_store, dict):
1783+
cdata = ensure_bytes(cdata)
1784+
17911785
return cdata
17921786

17931787
def __repr__(self):

zarr/tests/test_core.py

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
from zarr.util import buffer_size
2424
from numcodecs import (Delta, FixedScaleOffset, Zlib, Blosc, BZ2, MsgPack, Pickle,
2525
Categorize, JSON, VLenUTF8, VLenBytes, VLenArray)
26+
from numcodecs.compat import ensure_bytes, ensure_ndarray
2627
from numcodecs.tests.common import greetings
2728

2829

@@ -83,6 +84,31 @@ def create_array(self, read_only=False, **kwargs):
8384
return Array(store, read_only=read_only, cache_metadata=cache_metadata,
8485
cache_attrs=cache_attrs)
8586

87+
def test_store_has_binary_values(self):
88+
# Initialize array
89+
np.random.seed(42)
90+
z = self.create_array(shape=(1050,), chunks=100, dtype='f8', compressor=[])
91+
z[:] = np.random.random(z.shape)
92+
93+
for v in z.chunk_store.values():
94+
try:
95+
ensure_ndarray(v)
96+
except TypeError: # pragma: no cover
97+
pytest.fail("Non-bytes-like value: %s" % repr(v))
98+
99+
def test_store_has_bytes_values(self):
100+
# Test that many stores do hold bytes values.
101+
# Though this is not a strict requirement.
102+
# Should be disabled by any stores that fail this as needed.
103+
104+
# Initialize array
105+
np.random.seed(42)
106+
z = self.create_array(shape=(1050,), chunks=100, dtype='f8', compressor=[])
107+
z[:] = np.random.random(z.shape)
108+
109+
# Check in-memory array only contains `bytes`
110+
assert all([isinstance(v, binary_type) for v in z.chunk_store.values()])
111+
86112
def test_nbytes_stored(self):
87113

88114
# dict as store
@@ -1401,6 +1427,9 @@ def create_array(read_only=False, **kwargs):
14011427
return Array(store, read_only=read_only, cache_metadata=cache_metadata,
14021428
cache_attrs=cache_attrs)
14031429

1430+
def test_store_has_bytes_values(self):
1431+
pass # returns values as memoryviews/buffers instead of bytes
1432+
14041433
def test_nbytes_stored(self):
14051434
pass # not implemented
14061435

@@ -1725,6 +1754,9 @@ def __init__(self):
17251754
def keys(self):
17261755
return self.inner.keys()
17271756

1757+
def values(self):
1758+
return self.inner.values()
1759+
17281760
def get(self, item, default=None):
17291761
try:
17301762
return self.inner[item]
@@ -1735,7 +1767,7 @@ def __getitem__(self, item):
17351767
return self.inner[item]
17361768

17371769
def __setitem__(self, item, value):
1738-
self.inner[item] = value
1770+
self.inner[item] = ensure_bytes(value)
17391771

17401772
def __delitem__(self, key):
17411773
del self.inner[key]
@@ -1846,3 +1878,7 @@ def create_array(read_only=False, **kwargs):
18461878
init_array(store, **kwargs)
18471879
return Array(store, read_only=read_only, cache_metadata=cache_metadata,
18481880
cache_attrs=cache_attrs)
1881+
1882+
def test_store_has_bytes_values(self):
1883+
# skip as the cache has no control over how the store provides values
1884+
pass

0 commit comments

Comments
 (0)