Skip to content

Commit 769ae52

Browse files
committed
Update documentation for cache module
1 parent 86de289 commit 769ae52

File tree

3 files changed

+55
-23
lines changed

3 files changed

+55
-23
lines changed

docs/releases/development.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,6 @@ Next release (in development)
1111
but the trade off is worth the added security
1212
after the invalid polygons found in :pr:`154`
1313
(:pr:`156`).
14-
* Added Dataset cache key methods and associated hashing utility methods,
15-
as specified in issue #153.
16-
(:pr:`158`).
14+
* Added new :mod:`emsarray.operations.cache` module
15+
for generating cache keys based on dataset geometry.
16+
(:issue:`153`, :pr:`158`).

src/emsarray/conventions/_base.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1955,7 +1955,6 @@ def normalize_depth_variables(
19551955
def hash_geometry(self, hash: "hashlib._Hash") -> None:
19561956
"""
19571957
Updates the provided hash with all of the relevant geometry data for this dataset.
1958-
Note this includes the attribute data contained within each geometry.
19591958
19601959
Parameters
19611960
----------

src/emsarray/operations/cache.py

Lines changed: 52 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,21 @@
11
"""
2-
Operations for making caching keys for a given dataset.
2+
Operations for making cache keys based on dataset geometry.
3+
4+
Some operations such as :func:`~.operations.triangulate.triangulate_dataset`
5+
only depend on the dataset geometry and are expensive to compute.
6+
For applications that need to derive data from the dataset geometry
7+
it would be useful if the derived data could be reused between different runs of the same application
8+
or between multiple time slices of the same geometry distributed across multiple files.
9+
This module provides :func:`.make_cache_key` to assist in this process
10+
by deriving a cache key from the important parts of a dataset geometry.
11+
Applications can use this cache key
12+
as part of a filename when save derived geometry data to disk
13+
or as a key to an in-memory cache of derived geometry.
14+
15+
The derived cache keys will be identical between different instances of an application,
16+
and between different files in multi-file datasets split over an unlimited dimension.
17+
18+
This module does not provide an actual cache implementation.
319
"""
420
import hashlib
521
import marshal
@@ -12,15 +28,21 @@
1228

1329
def hash_attributes(hash: "hashlib._Hash", attributes: dict) -> None:
1430
"""
15-
Updates the provided hash with with a marshal serialised byte representation of the given attribute dictionary.
31+
Adds the contents of an :attr:`attributes dictionary <xarray.DataArray.attrs>`
32+
to a hash.
1633
1734
Parameters
1835
----------
1936
hash : hashlib-style hash instance
20-
The hash instance to update with the given attribute dict.
37+
The hash instance to add the attribute dictionary to.
2138
This must follow the interface defined in :mod:`hashlib`.
22-
attributes: dict
23-
Expects a marshal compatible dictionary.
39+
attributes : dict
40+
A dictionary of attributes from a :class:`~xarray.Dataset` or :class:`~xarray.DataArray`.
41+
42+
Notes
43+
-----
44+
The attribute dictionary is serialized to bytes using :func:`marshal.dumps`.
45+
This is an implementation detail that may change in future releases.
2446
"""
2547
# Prepend the marshal encoding version
2648
marshal_version = 4
@@ -36,32 +58,44 @@ def hash_attributes(hash: "hashlib._Hash", attributes: dict) -> None:
3658

3759
def hash_string(hash: "hashlib._Hash", value: str) -> None:
3860
"""
39-
Updates the provided hash with with a utf-8 encoded byte representation of the provided string.
61+
Adds a :class:`string <str>` to a hash.
4062
4163
Parameters
4264
----------
4365
hash : hashlib-style hash instance
44-
The hash instance to update with the given attribute dict.
66+
The hash instance to add the string to.
4567
This must follow the interface defined in :mod:`hashlib`.
46-
attributes: str
47-
Expects a string that can be encoded in utf-8.
68+
value : str
69+
Any unicode string.
70+
71+
Notes
72+
-----
73+
The string is UTF-8 encoded as part of being added to the hash.
74+
This is an implementation detail that may change in future releases.
4875
"""
49-
# Prepend the str length
76+
# Prepend the length of the string to the hash
77+
# to prevent malicious datasets generating overlapping string hashes.
5078
hash_int(hash, len(value))
5179
hash.update(value.encode('utf-8'))
5280

5381

5482
def hash_int(hash: "hashlib._Hash", value: int) -> None:
5583
"""
56-
Updates the provided hash with an encoded byte representation of the provided int.
84+
Adds an :class:`int` to a hash.
5785
5886
Parameters
5987
----------
6088
hash : hashlib-style hash instance
61-
The hash instance to update with the given attribute dict.
89+
The hash instance to add the integer to.
6290
This must follow the interface defined in :mod:`hashlib`.
63-
attributes: int
64-
Expects an int that can be represented in a numpy int32.
91+
value : int
92+
Any int representable as an :data:`numpy.int32`
93+
94+
Notes
95+
-----
96+
The int is cast to a :data:`numpy.int32` as part of being added to the hash.
97+
This is an implementation detail that may change in the future
98+
if larger integers are required.
6599
"""
66100
with numpy.errstate(over='raise'):
67101
# Manual overflow check as older numpy versions dont throw the exception
@@ -73,17 +107,16 @@ def hash_int(hash: "hashlib._Hash", value: int) -> None:
73107

74108
def make_cache_key(dataset: xarray.Dataset, hash: "hashlib._Hash | None" = None) -> str:
75109
"""
76-
Generate a key suitable for caching data derived from the geometry of a dataset.
110+
Derive a cache key from the geometry of a dataset.
77111
78112
Parameters
79113
----------
80114
dataset : xarray.Dataset
81115
The dataset to generate a cache key from.
82-
hash : hashlib._Hash
116+
hash : :mod:`hashlib`-compatible hash instance, optional
83117
An instance of a hashlib hash class.
84-
Defaults to `hashlib.blake2b`, which is secure enough and fast enough for most purposes.
85-
The hash algorithm does not need to be cryptographically secure,
86-
so faster algorithms such as `xxhash` can be swapped in if desired.
118+
Defaults to :func:`hashlib.blake2b` with a digest size of 32,
119+
which is secure enough and fast enough for most purposes.
87120
88121
Returns
89122
-------

0 commit comments

Comments
 (0)