|
3 | 3 | Zarr Storage Specification Version 1 |
4 | 4 | ==================================== |
5 | 5 |
|
6 | | -This document provides a technical specification of the protocol and |
7 | | -format used for storing a Zarr array. The key words "MUST", "MUST |
8 | | -NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", |
9 | | -"RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be |
10 | | -interpreted as described in `RFC 2119 |
11 | | -<https://www.ietf.org/rfc/rfc2119.txt>`_. |
12 | | - |
13 | | -Status |
14 | | ------- |
15 | | - |
16 | | -This specification is deprecated. See :ref:`spec` for the latest version. |
17 | | - |
18 | | -Storage |
19 | | -------- |
20 | | - |
21 | | -A Zarr array can be stored in any storage system that provides a |
22 | | -key/value interface, where a key is an ASCII string and a value is an |
23 | | -arbitrary sequence of bytes, and the supported operations are read |
24 | | -(get the sequence of bytes associated with a given key), write (set |
25 | | -the sequence of bytes associated with a given key) and delete (remove |
26 | | -a key/value pair). |
27 | | - |
28 | | -For example, a directory in a file system can provide this interface, |
29 | | -where keys are file names, values are file contents, and files can be |
30 | | -read, written or deleted via the operating system. Equally, an S3 |
31 | | -bucket can provide this interface, where keys are resource names, |
32 | | -values are resource contents, and resources can be read, written or |
33 | | -deleted via HTTP. |
34 | | - |
35 | | -Below an "array store" refers to any system implementing this |
36 | | -interface. |
37 | | - |
38 | | -Metadata |
39 | | --------- |
40 | | - |
41 | | -Each array requires essential configuration metadata to be stored, |
42 | | -enabling correct interpretation of the stored data. This metadata is |
43 | | -encoded using JSON and stored as the value of the 'meta' key within an |
44 | | -array store. |
45 | | - |
46 | | -The metadata resource is a JSON object. The following keys MUST be |
47 | | -present within the object: |
48 | | - |
49 | | -zarr_format |
50 | | - An integer defining the version of the storage specification to which the |
51 | | - array store adheres. |
52 | | -shape |
53 | | - A list of integers defining the length of each dimension of the array. |
54 | | -chunks |
55 | | - A list of integers defining the length of each dimension of a chunk of the |
56 | | - array. Note that all chunks within a Zarr array have the same shape. |
57 | | -dtype |
58 | | - A string or list defining a valid data type for the array. See also |
59 | | - the subsection below on data type encoding. |
60 | | -compression |
61 | | - A string identifying the primary compression library used to compress |
62 | | - each chunk of the array. |
63 | | -compression_opts |
64 | | - An integer, string or dictionary providing options to the primary |
65 | | - compression library. |
66 | | -fill_value |
67 | | - A scalar value providing the default value to use for uninitialized |
68 | | - portions of the array. |
69 | | -order |
70 | | - Either 'C' or 'F', defining the layout of bytes within each chunk of the |
71 | | - array. 'C' means row-major order, i.e., the last dimension varies fastest; |
72 | | - 'F' means column-major order, i.e., the first dimension varies fastest. |
73 | | - |
74 | | -Other keys MAY be present within the metadata object however they MUST |
75 | | -NOT alter the interpretation of the required fields defined above. |
76 | | - |
77 | | -For example, the JSON object below defines a 2-dimensional array of |
78 | | -64-bit little-endian floating point numbers with 10000 rows and 10000 |
79 | | -columns, divided into chunks of 1000 rows and 1000 columns (so there |
80 | | -will be 100 chunks in total arranged in a 10 by 10 grid). Within each |
81 | | -chunk the data are laid out in C contiguous order, and each chunk is |
82 | | -compressed using the Blosc compression library:: |
83 | | - |
84 | | - { |
85 | | - "chunks": [ |
86 | | - 1000, |
87 | | - 1000 |
88 | | - ], |
89 | | - "compression": "blosc", |
90 | | - "compression_opts": { |
91 | | - "clevel": 5, |
92 | | - "cname": "lz4", |
93 | | - "shuffle": 1 |
94 | | - }, |
95 | | - "dtype": "<f8", |
96 | | - "fill_value": null, |
97 | | - "order": "C", |
98 | | - "shape": [ |
99 | | - 10000, |
100 | | - 10000 |
101 | | - ], |
102 | | - "zarr_format": 1 |
103 | | - } |
104 | | - |
105 | | -Data type encoding |
106 | | -~~~~~~~~~~~~~~~~~~ |
107 | | - |
108 | | -Simple data types are encoded within the array metadata resource as a |
109 | | -string, following the `NumPy array protocol type string (typestr) |
110 | | -format |
111 | | -<numpy:arrays.interface>`_. The |
112 | | -format consists of 3 parts: a character describing the byteorder of |
113 | | -the data (``<``: little-endian, ``>``: big-endian, ``|``: |
114 | | -not-relevant), a character code giving the basic type of the array, |
115 | | -and an integer providing the number of bytes the type uses. The byte |
116 | | -order MUST be specified. E.g., ``"<f8"``, ``">i4"``, ``"|b1"`` and |
117 | | -``"|S12"`` are valid data types. |
118 | | - |
119 | | -Structure data types (i.e., with multiple named fields) are encoded as |
120 | | -a list of two-element lists, following `NumPy array protocol type |
121 | | -descriptions (descr) |
122 | | -<numpy:arrays.interface>`_. |
123 | | -For example, the JSON list ``[["r", "|u1"], ["g", "|u1"], ["b", |
124 | | -"|u1"]]`` defines a data type composed of three single-byte unsigned |
125 | | -integers labelled 'r', 'g' and 'b'. |
126 | | - |
127 | | -Chunks |
128 | | ------- |
129 | | - |
130 | | -Each chunk of the array is compressed by passing the raw bytes for the |
131 | | -chunk through the primary compression library to obtain a new sequence |
132 | | -of bytes comprising the compressed chunk data. No header is added to |
133 | | -the compressed bytes or any other modification made. The internal |
134 | | -structure of the compressed bytes will depend on which primary |
135 | | -compressor was used. For example, the `Blosc compressor |
136 | | -<https://github.com/Blosc/c-blosc/blob/main/README_HEADER.rst>`_ |
137 | | -produces a sequence of bytes that begins with a 16-byte header |
138 | | -followed by compressed data. |
139 | | - |
140 | | -The compressed sequence of bytes for each chunk is stored under a key |
141 | | -formed from the index of the chunk within the grid of chunks |
142 | | -representing the array. To form a string key for a chunk, the indices |
143 | | -are converted to strings and concatenated with the period character |
144 | | -('.') separating each index. For example, given an array with shape |
145 | | -(10000, 10000) and chunk shape (1000, 1000) there will be 100 chunks |
146 | | -laid out in a 10 by 10 grid. The chunk with indices (0, 0) provides |
147 | | -data for rows 0-999 and columns 0-999 and is stored under the key |
148 | | -'0.0'; the chunk with indices (2, 4) provides data for rows 2000-2999 |
149 | | -and columns 4000-4999 and is stored under the key '2.4'; etc. |
150 | | - |
151 | | -There is no need for all chunks to be present within an array |
152 | | -store. If a chunk is not present then it is considered to be in an |
153 | | -uninitialized state. An uninitialized chunk MUST be treated as if it |
154 | | -was uniformly filled with the value of the 'fill_value' field in the |
155 | | -array metadata. If the 'fill_value' field is ``null`` then the |
156 | | -contents of the chunk are undefined. |
157 | | - |
158 | | -Note that all chunks in an array have the same shape. If the length of |
159 | | -any array dimension is not exactly divisible by the length of the |
160 | | -corresponding chunk dimension then some chunks will overhang the edge |
161 | | -of the array. The contents of any chunk region falling outside the |
162 | | -array are undefined. |
163 | | - |
164 | | -Attributes |
165 | | ----------- |
166 | | - |
167 | | -Each array can also be associated with custom attributes, which are |
168 | | -simple key/value items with application-specific meaning. Custom |
169 | | -attributes are encoded as a JSON object and stored under the 'attrs' |
170 | | -key within an array store. Even if the attributes are empty, the |
171 | | -'attrs' key MUST be present within an array store. |
172 | | - |
173 | | -For example, the JSON object below encodes three attributes named |
174 | | -'foo', 'bar' and 'baz':: |
175 | | - |
176 | | - { |
177 | | - "foo": 42, |
178 | | - "bar": "apples", |
179 | | - "baz": [1, 2, 3, 4] |
180 | | - } |
181 | | - |
182 | | -Example |
183 | | -------- |
184 | | - |
185 | | -Below is an example of storing a Zarr array, using a directory on the |
186 | | -local file system as storage. |
187 | | - |
188 | | -Initialize the store:: |
189 | | - |
190 | | - >>> import zarr |
191 | | - >>> store = zarr.DirectoryStore('example.zarr') |
192 | | - >>> zarr.init_store(store, shape=(20, 20), chunks=(10, 10), |
193 | | - ... dtype='i4', fill_value=42, compression='zlib', |
194 | | - ... compression_opts=1, overwrite=True) |
195 | | - |
196 | | -No chunks are initialized yet, so only the 'meta' and 'attrs' keys |
197 | | -have been set:: |
198 | | - |
199 | | - >>> import os |
200 | | - >>> sorted(os.listdir('example.zarr')) |
201 | | - ['attrs', 'meta'] |
202 | | - |
203 | | -Inspect the array metadata:: |
204 | | - |
205 | | - >>> print(open('example.zarr/meta').read()) |
206 | | - { |
207 | | - "chunks": [ |
208 | | - 10, |
209 | | - 10 |
210 | | - ], |
211 | | - "compression": "zlib", |
212 | | - "compression_opts": 1, |
213 | | - "dtype": "<i4", |
214 | | - "fill_value": 42, |
215 | | - "order": "C", |
216 | | - "shape": [ |
217 | | - 20, |
218 | | - 20 |
219 | | - ], |
220 | | - "zarr_format": 1 |
221 | | - } |
222 | | - |
223 | | -Inspect the array attributes:: |
224 | | - |
225 | | - >>> print(open('example.zarr/attrs').read()) |
226 | | - {} |
227 | | - |
228 | | -Set some data:: |
229 | | - |
230 | | - >>> z = zarr.Array(store) |
231 | | - >>> z[0:10, 0:10] = 1 |
232 | | - >>> sorted(os.listdir('example.zarr')) |
233 | | - ['0.0', 'attrs', 'meta'] |
234 | | - |
235 | | -Set some more data:: |
236 | | - |
237 | | - >>> z[0:10, 10:20] = 2 |
238 | | - >>> z[10:20, :] = 3 |
239 | | - >>> sorted(os.listdir('example.zarr')) |
240 | | - ['0.0', '0.1', '1.0', '1.1', 'attrs', 'meta'] |
241 | | - |
242 | | -Manually decompress a single chunk for illustration:: |
243 | | - |
244 | | - >>> import zlib |
245 | | - >>> b = zlib.decompress(open('example.zarr/0.0', 'rb').read()) |
246 | | - >>> import numpy as np |
247 | | - >>> a = np.frombuffer(b, dtype='<i4') |
248 | | - >>> a |
249 | | - array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, |
250 | | - 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, |
251 | | - 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, |
252 | | - 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, |
253 | | - 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32) |
254 | | - |
255 | | -Modify the array attributes:: |
256 | | - |
257 | | - >>> z.attrs['foo'] = 42 |
258 | | - >>> z.attrs['bar'] = 'apples' |
259 | | - >>> z.attrs['baz'] = [1, 2, 3, 4] |
260 | | - >>> print(open('example.zarr/attrs').read()) |
261 | | - { |
262 | | - "bar": "apples", |
263 | | - "baz": [ |
264 | | - 1, |
265 | | - 2, |
266 | | - 3, |
267 | | - 4 |
268 | | - ], |
269 | | - "foo": 42 |
270 | | - } |
| 6 | +The V1 Specification has been migrated to its website → |
| 7 | +https://zarr-specs.readthedocs.io/. |
0 commit comments