|
| 1 | +.. _spec_v1: |
| 2 | + |
| 3 | +Zarr Storage Specification Version 1 |
| 4 | +==================================== |
| 5 | + |
| 6 | +This document provides a technical specification of the protocol and |
| 7 | +format used for storing a Zarr array. The key words "MUST", "MUST |
| 8 | +NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", |
| 9 | +"RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be |
| 10 | +interpreted as described in `RFC 2119 |
| 11 | +<https://www.ietf.org/rfc/rfc2119.txt>`_. |
| 12 | + |
| 13 | +Status |
| 14 | +------ |
| 15 | + |
| 16 | +This specification is deprecated. See :ref:`spec` for the latest version. |
| 17 | + |
| 18 | +Storage |
| 19 | +------- |
| 20 | + |
| 21 | +A Zarr array can be stored in any storage system that provides a |
| 22 | +key/value interface, where a key is an ASCII string and a value is an |
| 23 | +arbitrary sequence of bytes, and the supported operations are read |
| 24 | +(get the sequence of bytes associated with a given key), write (set |
| 25 | +the sequence of bytes associated with a given key) and delete (remove |
| 26 | +a key/value pair). |
| 27 | + |
| 28 | +For example, a directory in a file system can provide this interface, |
| 29 | +where keys are file names, values are file contents, and files can be |
| 30 | +read, written or deleted via the operating system. Equally, an S3 |
| 31 | +bucket can provide this interface, where keys are resource names, |
| 32 | +values are resource contents, and resources can be read, written or |
| 33 | +deleted via HTTP. |
| 34 | + |
| 35 | +Below an "array store" refers to any system implementing this |
| 36 | +interface. |
| 37 | + |
| 38 | +Metadata |
| 39 | +-------- |
| 40 | + |
| 41 | +Each array requires essential configuration metadata to be stored, |
| 42 | +enabling correct interpretation of the stored data. This metadata is |
| 43 | +encoded using JSON and stored as the value of the 'meta' key within an |
| 44 | +array store. |
| 45 | + |
| 46 | +The metadata resource is a JSON object. The following keys MUST be |
| 47 | +present within the object: |
| 48 | + |
| 49 | +zarr_format |
| 50 | + An integer defining the version of the storage specification to which the |
| 51 | + array store adheres. |
| 52 | +shape |
| 53 | + A list of integers defining the length of each dimension of the array. |
| 54 | +chunks |
| 55 | + A list of integers defining the length of each dimension of a chunk of the |
| 56 | + array. Note that all chunks within a Zarr array have the same shape. |
| 57 | +dtype |
| 58 | + A string or list defining a valid data type for the array. See also |
| 59 | + the subsection below on data type encoding. |
| 60 | +compression |
| 61 | + A string identifying the primary compression library used to compress |
| 62 | + each chunk of the array. |
| 63 | +compression_opts |
| 64 | + An integer, string or dictionary providing options to the primary |
| 65 | + compression library. |
| 66 | +fill_value |
| 67 | + A scalar value providing the default value to use for uninitialized |
| 68 | + portions of the array. |
| 69 | +order |
| 70 | + Either 'C' or 'F', defining the layout of bytes within each chunk of the |
| 71 | + array. 'C' means row-major order, i.e., the last dimension varies fastest; |
| 72 | + 'F' means column-major order, i.e., the first dimension varies fastest. |
| 73 | + |
| 74 | +Other keys MAY be present within the metadata object however they MUST |
| 75 | +NOT alter the interpretation of the required fields defined above. |
| 76 | + |
| 77 | +For example, the JSON object below defines a 2-dimensional array of |
| 78 | +64-bit little-endian floating point numbers with 10000 rows and 10000 |
| 79 | +columns, divided into chunks of 1000 rows and 1000 columns (so there |
| 80 | +will be 100 chunks in total arranged in a 10 by 10 grid). Within each |
| 81 | +chunk the data are laid out in C contiguous order, and each chunk is |
| 82 | +compressed using the Blosc compression library:: |
| 83 | + |
| 84 | + { |
| 85 | + "chunks": [ |
| 86 | + 1000, |
| 87 | + 1000 |
| 88 | + ], |
| 89 | + "compression": "blosc", |
| 90 | + "compression_opts": { |
| 91 | + "clevel": 5, |
| 92 | + "cname": "lz4", |
| 93 | + "shuffle": 1 |
| 94 | + }, |
| 95 | + "dtype": "<f8", |
| 96 | + "fill_value": null, |
| 97 | + "order": "C", |
| 98 | + "shape": [ |
| 99 | + 10000, |
| 100 | + 10000 |
| 101 | + ], |
| 102 | + "zarr_format": 1 |
| 103 | + } |
| 104 | + |
| 105 | +Data type encoding |
| 106 | +~~~~~~~~~~~~~~~~~~ |
| 107 | + |
| 108 | +Simple data types are encoded within the array metadata resource as a |
| 109 | +string, following the `NumPy array protocol type string (typestr) |
| 110 | +format |
| 111 | +<https://numpy.org/doc/stable/reference/arrays.interface.html>`_. The |
| 112 | +format consists of 3 parts: a character describing the byteorder of |
| 113 | +the data (``<``: little-endian, ``>``: big-endian, ``|``: |
| 114 | +not-relevant), a character code giving the basic type of the array, |
| 115 | +and an integer providing the number of bytes the type uses. The byte |
| 116 | +order MUST be specified. E.g., ``"<f8"``, ``">i4"``, ``"|b1"`` and |
| 117 | +``"|S12"`` are valid data types. |
| 118 | + |
| 119 | +Structure data types (i.e., with multiple named fields) are encoded as |
| 120 | +a list of two-element lists, following `NumPy array protocol type |
| 121 | +descriptions (descr) |
| 122 | +<https://numpy.org/doc/stable/reference/arrays.interface.html>`_. |
| 123 | +For example, the JSON list ``[["r", "|u1"], ["g", "|u1"], ["b", |
| 124 | +"|u1"]]`` defines a data type composed of three single-byte unsigned |
| 125 | +integers labelled 'r', 'g' and 'b'. |
| 126 | + |
| 127 | +Chunks |
| 128 | +------ |
| 129 | + |
| 130 | +Each chunk of the array is compressed by passing the raw bytes for the |
| 131 | +chunk through the primary compression library to obtain a new sequence |
| 132 | +of bytes comprising the compressed chunk data. No header is added to |
| 133 | +the compressed bytes or any other modification made. The internal |
| 134 | +structure of the compressed bytes will depend on which primary |
| 135 | +compressor was used. For example, the `Blosc compressor |
| 136 | +<https://github.com/Blosc/c-blosc/blob/main/README_CHUNK_FORMAT.rst>`_ |
| 137 | +produces a sequence of bytes that begins with a 16-byte header |
| 138 | +followed by compressed data. |
| 139 | + |
| 140 | +The compressed sequence of bytes for each chunk is stored under a key |
| 141 | +formed from the index of the chunk within the grid of chunks |
| 142 | +representing the array. To form a string key for a chunk, the indices |
| 143 | +are converted to strings and concatenated with the period character |
| 144 | +('.') separating each index. For example, given an array with shape |
| 145 | +(10000, 10000) and chunk shape (1000, 1000) there will be 100 chunks |
| 146 | +laid out in a 10 by 10 grid. The chunk with indices (0, 0) provides |
| 147 | +data for rows 0-999 and columns 0-999 and is stored under the key |
| 148 | +'0.0'; the chunk with indices (2, 4) provides data for rows 2000-2999 |
| 149 | +and columns 4000-4999 and is stored under the key '2.4'; etc. |
| 150 | + |
| 151 | +There is no need for all chunks to be present within an array |
| 152 | +store. If a chunk is not present then it is considered to be in an |
| 153 | +uninitialized state. An uninitialized chunk MUST be treated as if it |
| 154 | +was uniformly filled with the value of the 'fill_value' field in the |
| 155 | +array metadata. If the 'fill_value' field is ``null`` then the |
| 156 | +contents of the chunk are undefined. |
| 157 | + |
| 158 | +Note that all chunks in an array have the same shape. If the length of |
| 159 | +any array dimension is not exactly divisible by the length of the |
| 160 | +corresponding chunk dimension then some chunks will overhang the edge |
| 161 | +of the array. The contents of any chunk region falling outside the |
| 162 | +array are undefined. |
| 163 | + |
| 164 | +Attributes |
| 165 | +---------- |
| 166 | + |
| 167 | +Each array can also be associated with custom attributes, which are |
| 168 | +simple key/value items with application-specific meaning. Custom |
| 169 | +attributes are encoded as a JSON object and stored under the 'attrs' |
| 170 | +key within an array store. Even if the attributes are empty, the |
| 171 | +'attrs' key MUST be present within an array store. |
| 172 | + |
| 173 | +For example, the JSON object below encodes three attributes named |
| 174 | +'foo', 'bar' and 'baz':: |
| 175 | + |
| 176 | + { |
| 177 | + "foo": 42, |
| 178 | + "bar": "apples", |
| 179 | + "baz": [1, 2, 3, 4] |
| 180 | + } |
| 181 | + |
| 182 | +Example |
| 183 | +------- |
| 184 | + |
| 185 | +Below is an example of storing a Zarr array, using a directory on the |
| 186 | +local file system as storage. |
| 187 | + |
| 188 | +Initialize the store:: |
| 189 | + |
| 190 | + >>> import zarr |
| 191 | + >>> store = zarr.DirectoryStore('example.zarr') |
| 192 | + >>> zarr.init_store(store, shape=(20, 20), chunks=(10, 10), |
| 193 | + ... dtype='i4', fill_value=42, compression='zlib', |
| 194 | + ... compression_opts=1, overwrite=True) |
| 195 | + |
| 196 | +No chunks are initialized yet, so only the 'meta' and 'attrs' keys |
| 197 | +have been set:: |
| 198 | + |
| 199 | + >>> import os |
| 200 | + >>> sorted(os.listdir('example.zarr')) |
| 201 | + ['attrs', 'meta'] |
| 202 | + |
| 203 | +Inspect the array metadata:: |
| 204 | + |
| 205 | + >>> print(open('example.zarr/meta').read()) |
| 206 | + { |
| 207 | + "chunks": [ |
| 208 | + 10, |
| 209 | + 10 |
| 210 | + ], |
| 211 | + "compression": "zlib", |
| 212 | + "compression_opts": 1, |
| 213 | + "dtype": "<i4", |
| 214 | + "fill_value": 42, |
| 215 | + "order": "C", |
| 216 | + "shape": [ |
| 217 | + 20, |
| 218 | + 20 |
| 219 | + ], |
| 220 | + "zarr_format": 1 |
| 221 | + } |
| 222 | + |
| 223 | +Inspect the array attributes:: |
| 224 | + |
| 225 | + >>> print(open('example.zarr/attrs').read()) |
| 226 | + {} |
| 227 | + |
| 228 | +Set some data:: |
| 229 | + |
| 230 | + >>> z = zarr.Array(store) |
| 231 | + >>> z[0:10, 0:10] = 1 |
| 232 | + >>> sorted(os.listdir('example.zarr')) |
| 233 | + ['0.0', 'attrs', 'meta'] |
| 234 | + |
| 235 | +Set some more data:: |
| 236 | + |
| 237 | + >>> z[0:10, 10:20] = 2 |
| 238 | + >>> z[10:20, :] = 3 |
| 239 | + >>> sorted(os.listdir('example.zarr')) |
| 240 | + ['0.0', '0.1', '1.0', '1.1', 'attrs', 'meta'] |
| 241 | + |
| 242 | +Manually decompress a single chunk for illustration:: |
| 243 | + |
| 244 | + >>> import zlib |
| 245 | + >>> b = zlib.decompress(open('example.zarr/0.0', 'rb').read()) |
| 246 | + >>> import numpy as np |
| 247 | + >>> a = np.frombuffer(b, dtype='<i4') |
| 248 | + >>> a |
| 249 | + array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, |
| 250 | + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, |
| 251 | + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, |
| 252 | + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, |
| 253 | + 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32) |
| 254 | + |
| 255 | +Modify the array attributes:: |
| 256 | + |
| 257 | + >>> z.attrs['foo'] = 42 |
| 258 | + >>> z.attrs['bar'] = 'apples' |
| 259 | + >>> z.attrs['baz'] = [1, 2, 3, 4] |
| 260 | + >>> print(open('example.zarr/attrs').read()) |
| 261 | + { |
| 262 | + "bar": "apples", |
| 263 | + "baz": [ |
| 264 | + 1, |
| 265 | + 2, |
| 266 | + 3, |
| 267 | + 4 |
| 268 | + ], |
| 269 | + "foo": 42 |
| 270 | + } |
0 commit comments