Skip to content

Commit f818fde

Browse files
authored
Merge pull request #276 from d-v-b/add_old_specs
v1 and v2 specs
2 parents b9ae65b + 8052755 commit f818fde

File tree

4 files changed

+845
-2
lines changed

4 files changed

+845
-2
lines changed

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,3 +6,9 @@ docs/_build
66

77
# pycharm
88
.idea
9+
10+
# virtual environments
11+
.venv
12+
13+
# visual studio code
14+
.vscode

docs/specs.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,10 @@ Specifications
1616
:maxdepth: 1
1717
:caption: v2
1818

19-
Zarr spec v2 <https://zarr.readthedocs.io/en/stable/spec/v2.html>
19+
Zarr spec v2 <v2/v2.0.rst>
2020

2121
.. toctree::
2222
:maxdepth: 1
2323
:caption: v1
2424

25-
Zarr spec v1 <https://zarr.readthedocs.io/en/stable/spec/v1.html>
25+
Zarr spec v1 <v1/v1.0.rst>

docs/v1/v1.0.rst

Lines changed: 270 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,270 @@
1+
.. _spec_v1:
2+
3+
Zarr Storage Specification Version 1
4+
====================================
5+
6+
This document provides a technical specification of the protocol and
7+
format used for storing a Zarr array. The key words "MUST", "MUST
8+
NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT",
9+
"RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be
10+
interpreted as described in `RFC 2119
11+
<https://www.ietf.org/rfc/rfc2119.txt>`_.
12+
13+
Status
14+
------
15+
16+
This specification is deprecated. See :ref:`spec` for the latest version.
17+
18+
Storage
19+
-------
20+
21+
A Zarr array can be stored in any storage system that provides a
22+
key/value interface, where a key is an ASCII string and a value is an
23+
arbitrary sequence of bytes, and the supported operations are read
24+
(get the sequence of bytes associated with a given key), write (set
25+
the sequence of bytes associated with a given key) and delete (remove
26+
a key/value pair).
27+
28+
For example, a directory in a file system can provide this interface,
29+
where keys are file names, values are file contents, and files can be
30+
read, written or deleted via the operating system. Equally, an S3
31+
bucket can provide this interface, where keys are resource names,
32+
values are resource contents, and resources can be read, written or
33+
deleted via HTTP.
34+
35+
Below an "array store" refers to any system implementing this
36+
interface.
37+
38+
Metadata
39+
--------
40+
41+
Each array requires essential configuration metadata to be stored,
42+
enabling correct interpretation of the stored data. This metadata is
43+
encoded using JSON and stored as the value of the 'meta' key within an
44+
array store.
45+
46+
The metadata resource is a JSON object. The following keys MUST be
47+
present within the object:
48+
49+
zarr_format
50+
An integer defining the version of the storage specification to which the
51+
array store adheres.
52+
shape
53+
A list of integers defining the length of each dimension of the array.
54+
chunks
55+
A list of integers defining the length of each dimension of a chunk of the
56+
array. Note that all chunks within a Zarr array have the same shape.
57+
dtype
58+
A string or list defining a valid data type for the array. See also
59+
the subsection below on data type encoding.
60+
compression
61+
A string identifying the primary compression library used to compress
62+
each chunk of the array.
63+
compression_opts
64+
An integer, string or dictionary providing options to the primary
65+
compression library.
66+
fill_value
67+
A scalar value providing the default value to use for uninitialized
68+
portions of the array.
69+
order
70+
Either 'C' or 'F', defining the layout of bytes within each chunk of the
71+
array. 'C' means row-major order, i.e., the last dimension varies fastest;
72+
'F' means column-major order, i.e., the first dimension varies fastest.
73+
74+
Other keys MAY be present within the metadata object however they MUST
75+
NOT alter the interpretation of the required fields defined above.
76+
77+
For example, the JSON object below defines a 2-dimensional array of
78+
64-bit little-endian floating point numbers with 10000 rows and 10000
79+
columns, divided into chunks of 1000 rows and 1000 columns (so there
80+
will be 100 chunks in total arranged in a 10 by 10 grid). Within each
81+
chunk the data are laid out in C contiguous order, and each chunk is
82+
compressed using the Blosc compression library::
83+
84+
{
85+
"chunks": [
86+
1000,
87+
1000
88+
],
89+
"compression": "blosc",
90+
"compression_opts": {
91+
"clevel": 5,
92+
"cname": "lz4",
93+
"shuffle": 1
94+
},
95+
"dtype": "<f8",
96+
"fill_value": null,
97+
"order": "C",
98+
"shape": [
99+
10000,
100+
10000
101+
],
102+
"zarr_format": 1
103+
}
104+
105+
Data type encoding
106+
~~~~~~~~~~~~~~~~~~
107+
108+
Simple data types are encoded within the array metadata resource as a
109+
string, following the `NumPy array protocol type string (typestr)
110+
format
111+
<https://numpy.org/doc/stable/reference/arrays.interface.html>`_. The
112+
format consists of 3 parts: a character describing the byteorder of
113+
the data (``<``: little-endian, ``>``: big-endian, ``|``:
114+
not-relevant), a character code giving the basic type of the array,
115+
and an integer providing the number of bytes the type uses. The byte
116+
order MUST be specified. E.g., ``"<f8"``, ``">i4"``, ``"|b1"`` and
117+
``"|S12"`` are valid data types.
118+
119+
Structure data types (i.e., with multiple named fields) are encoded as
120+
a list of two-element lists, following `NumPy array protocol type
121+
descriptions (descr)
122+
<https://numpy.org/doc/stable/reference/arrays.interface.html>`_.
123+
For example, the JSON list ``[["r", "|u1"], ["g", "|u1"], ["b",
124+
"|u1"]]`` defines a data type composed of three single-byte unsigned
125+
integers labelled 'r', 'g' and 'b'.
126+
127+
Chunks
128+
------
129+
130+
Each chunk of the array is compressed by passing the raw bytes for the
131+
chunk through the primary compression library to obtain a new sequence
132+
of bytes comprising the compressed chunk data. No header is added to
133+
the compressed bytes or any other modification made. The internal
134+
structure of the compressed bytes will depend on which primary
135+
compressor was used. For example, the `Blosc compressor
136+
<https://github.com/Blosc/c-blosc/blob/main/README_CHUNK_FORMAT.rst>`_
137+
produces a sequence of bytes that begins with a 16-byte header
138+
followed by compressed data.
139+
140+
The compressed sequence of bytes for each chunk is stored under a key
141+
formed from the index of the chunk within the grid of chunks
142+
representing the array. To form a string key for a chunk, the indices
143+
are converted to strings and concatenated with the period character
144+
('.') separating each index. For example, given an array with shape
145+
(10000, 10000) and chunk shape (1000, 1000) there will be 100 chunks
146+
laid out in a 10 by 10 grid. The chunk with indices (0, 0) provides
147+
data for rows 0-999 and columns 0-999 and is stored under the key
148+
'0.0'; the chunk with indices (2, 4) provides data for rows 2000-2999
149+
and columns 4000-4999 and is stored under the key '2.4'; etc.
150+
151+
There is no need for all chunks to be present within an array
152+
store. If a chunk is not present then it is considered to be in an
153+
uninitialized state. An uninitialized chunk MUST be treated as if it
154+
was uniformly filled with the value of the 'fill_value' field in the
155+
array metadata. If the 'fill_value' field is ``null`` then the
156+
contents of the chunk are undefined.
157+
158+
Note that all chunks in an array have the same shape. If the length of
159+
any array dimension is not exactly divisible by the length of the
160+
corresponding chunk dimension then some chunks will overhang the edge
161+
of the array. The contents of any chunk region falling outside the
162+
array are undefined.
163+
164+
Attributes
165+
----------
166+
167+
Each array can also be associated with custom attributes, which are
168+
simple key/value items with application-specific meaning. Custom
169+
attributes are encoded as a JSON object and stored under the 'attrs'
170+
key within an array store. Even if the attributes are empty, the
171+
'attrs' key MUST be present within an array store.
172+
173+
For example, the JSON object below encodes three attributes named
174+
'foo', 'bar' and 'baz'::
175+
176+
{
177+
"foo": 42,
178+
"bar": "apples",
179+
"baz": [1, 2, 3, 4]
180+
}
181+
182+
Example
183+
-------
184+
185+
Below is an example of storing a Zarr array, using a directory on the
186+
local file system as storage.
187+
188+
Initialize the store::
189+
190+
>>> import zarr
191+
>>> store = zarr.DirectoryStore('example.zarr')
192+
>>> zarr.init_store(store, shape=(20, 20), chunks=(10, 10),
193+
... dtype='i4', fill_value=42, compression='zlib',
194+
... compression_opts=1, overwrite=True)
195+
196+
No chunks are initialized yet, so only the 'meta' and 'attrs' keys
197+
have been set::
198+
199+
>>> import os
200+
>>> sorted(os.listdir('example.zarr'))
201+
['attrs', 'meta']
202+
203+
Inspect the array metadata::
204+
205+
>>> print(open('example.zarr/meta').read())
206+
{
207+
"chunks": [
208+
10,
209+
10
210+
],
211+
"compression": "zlib",
212+
"compression_opts": 1,
213+
"dtype": "<i4",
214+
"fill_value": 42,
215+
"order": "C",
216+
"shape": [
217+
20,
218+
20
219+
],
220+
"zarr_format": 1
221+
}
222+
223+
Inspect the array attributes::
224+
225+
>>> print(open('example.zarr/attrs').read())
226+
{}
227+
228+
Set some data::
229+
230+
>>> z = zarr.Array(store)
231+
>>> z[0:10, 0:10] = 1
232+
>>> sorted(os.listdir('example.zarr'))
233+
['0.0', 'attrs', 'meta']
234+
235+
Set some more data::
236+
237+
>>> z[0:10, 10:20] = 2
238+
>>> z[10:20, :] = 3
239+
>>> sorted(os.listdir('example.zarr'))
240+
['0.0', '0.1', '1.0', '1.1', 'attrs', 'meta']
241+
242+
Manually decompress a single chunk for illustration::
243+
244+
>>> import zlib
245+
>>> b = zlib.decompress(open('example.zarr/0.0', 'rb').read())
246+
>>> import numpy as np
247+
>>> a = np.frombuffer(b, dtype='<i4')
248+
>>> a
249+
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
250+
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
251+
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
252+
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
253+
1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
254+
255+
Modify the array attributes::
256+
257+
>>> z.attrs['foo'] = 42
258+
>>> z.attrs['bar'] = 'apples'
259+
>>> z.attrs['baz'] = [1, 2, 3, 4]
260+
>>> print(open('example.zarr/attrs').read())
261+
{
262+
"bar": "apples",
263+
"baz": [
264+
1,
265+
2,
266+
3,
267+
4
268+
],
269+
"foo": 42
270+
}

0 commit comments

Comments
 (0)