Skip to content

Commit df8c902

Browse files
committed
document persistence layer, mostly resolves #5
1 parent c8837f7 commit df8c902

File tree

3 files changed

+132
-7
lines changed

3 files changed

+132
-7
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -67,3 +67,4 @@ zarr/version.py
6767

6868
# test data
6969
*.zarr
70+
*~

PERSISTENCE.rst

Lines changed: 124 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,124 @@
1+
zarr - Persistence
2+
==================
3+
4+
This document describes the file organisation and formats used to save zarr
5+
arrays on disk.
6+
7+
All data and metadata associated with a zarr array is stored within a
8+
directory on the file system. Within this directory there are a number
9+
of files and sub-directories storing different components of the data
10+
and metadata. Here I'll refer to a directory containing a zarr array
11+
as a root directory.
12+
13+
Configuration metadata
14+
----------------------
15+
16+
Within a root directory, a file called "__zmeta__" contains essential
17+
configuration metadata about the array. This comprises the shape of the
18+
array, chunk shape, data type (dtype), compression library,
19+
compression level, shuffle filter and default fill value for
20+
uninitialised portions of the array. The format of this file is JSON.
21+
22+
Mandatory fields and allowed values are as follows:
23+
24+
* ``shape`` - list of integers - the size of each dimension of the array
25+
* ``chunks`` - list of integers - the size of each dimension of a chunk, i.e., the chunk shape
26+
* ``dtype`` - string or list of lists - a description of the data type, following Numpy convention
27+
* ``fill_value`` - scalar value - value to use for uninitialised portions of the array
28+
* ``cname`` - string - name of the compression library used
29+
* ``clevel`` - integer - compression level
30+
* ``shuffle`` - integer - shuffle filter (0 = no shuffle, 1 = byte shuffle, 2 = bit shuffle)
31+
32+
For example::
33+
34+
>>> import zarr
35+
>>> z = zarr.open('example.zarr', mode='w', shape=(1000000, 1000),
36+
... chunks=(10000, 100), dtype='i4', fill_value=42,
37+
... cname='lz4', clevel=3, shuffle=1)
38+
>>> print(open('example.zarr/__zmeta__').read())
39+
{
40+
"chunks": [
41+
10000,
42+
100
43+
],
44+
"clevel": 3,
45+
"cname": "lz4",
46+
"dtype": "<i4",
47+
"fill_value": 42,
48+
"shape": [
49+
1000000,
50+
1000
51+
],
52+
"shuffle": 1
53+
}
54+
55+
User metadata (attributes)
56+
--------------------------
57+
58+
Within a root directory, a file called "__zattr__" contains user
59+
metadata associated with the array, i.e., user attributes. The format
60+
of this file is JSON.
61+
62+
For example::
63+
64+
>>> import zarr
65+
>>> z = zarr.open('example.zarr', mode='w', shape=(1000000, 1000),
66+
... chunks=(10000, 100), dtype='i4', fill_value=42,
67+
... cname='lz4', clevel=3, shuffle=1)
68+
>>> z.attrs['foo'] = 42
69+
>>> z.attrs['bar'] = 4.2
70+
>>> z.attrs['baz'] = 'quux'
71+
>>> print(open('example.zarr/__zattr__').read())
72+
73+
TODO add results above
74+
75+
Array data
76+
----------
77+
78+
Within a root directory, a sub-directory called "__zdata__" contains
79+
the array data. The array data is divided into chunks, each of which
80+
is compressed using the [blosc meta-compression library](TODO). Each
81+
chunk is stored in a separate file.
82+
83+
The chunk files are named according to the chunk indices. E.g., for a
84+
2-dimensional array with shape (100, 100) and chunk shape (10, 10)
85+
there will be 100 chunks in total. The file "0.0.blosc" stores data
86+
for the chunk with indices (0, 0) within chunk rows and columns
87+
respectively, i.e., the first chunk, containing data for the segment
88+
of the array that would be obtained by the slice ``z[0:10, 0:10]``;
89+
the file "4.2.blosc" stores the chunk in the fifth row third column,
90+
containing data for the slize ``z[40:50, 20:30]``; etc.
91+
92+
Each chunk file is a binary file following the blosc version 1 format,
93+
comprising a 16 byte header followed by the compressed data. The
94+
header is organised as follows::
95+
96+
|-0-|-1-|-2-|-3-|-4-|-5-|-6-|-7-|-8-|-9-|-A-|-B-|-C-|-D-|-E-|-F-|
97+
^ ^ ^ ^ | nbytes | blocksize | cbytes |
98+
| | | |
99+
| | | +--typesize
100+
| | +------flags
101+
| +----------blosclz version
102+
+--------------blosc version
103+
104+
For more details on the header, see the [C-Blosc header
105+
description](https://github.com/Blosc/c-blosc/blob/master/README_HEADER.rst).
106+
107+
If a file does not exist on the file system for any given chunk in an
108+
array, that indicates the chunk has not been initialised, and the
109+
chunk should be interpreted as completely filled with whatever value
110+
has been configured as the fill value for the array. I.e., chunk files
111+
are not required to exist.
112+
113+
For example::
114+
115+
>>> import zarr
116+
>>> z = zarr.open('example.zarr', mode='w', shape=(1000000, 1000),
117+
... chunks=(10000, 100), dtype='i4', fill_value=42,
118+
... cname='lz4', clevel=3, shuffle=1)
119+
>>> import os
120+
>>> os.listdir('example.zarr/__zdata__')
121+
[]
122+
>>> z[:] = 0
123+
>>> sorted(os.listdir('example.zarr/__zdata__'))[:5]
124+
['0.0.blosc', '0.1.blosc', '0.2.blosc', '0.3.blosc', '0.4.blosc']

README.rst

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ Python.
1010
Installation
1111
------------
1212

13-
Installation requires NumPy and Cython pre-installed. Currently only
14-
compatible with Python >= 3.4.
13+
Installation requires Numpy and Cython pre-installed. Can only be installed on
14+
Linux currently.
1515

1616
Install from PyPI::
1717

@@ -114,10 +114,6 @@ append data to any axis
114114
115115
>>> a = np.arange(10000000, dtype='i4').reshape(10000, 1000)
116116
>>> z = zarr.array(a, chunks=(1000, 100))
117-
>>> z
118-
zarr.ext.SynchronizedArray((10000, 1000), int32, chunks=(1000, 100))
119-
cname: blosclz; clevel: 5; shuffle: 1 (BYTESHUFFLE)
120-
nbytes: 38.1M; cbytes: 2.0M; ratio: 19.3; initialized: 100/100
121117
>>> z.append(a+a)
122118
>>> z
123119
zarr.ext.SynchronizedArray((20000, 1000), int32, chunks=(1000, 100))
@@ -129,6 +125,9 @@ append data to any axis
129125
cname: blosclz; clevel: 5; shuffle: 1 (BYTESHUFFLE)
130126
nbytes: 152.6M; cbytes: 7.6M; ratio: 20.2; initialized: 400/400
131127
128+
Persistence
129+
-----------
130+
132131
Create a persistent array (data stored on disk)
133132

134133
.. code-block:: python
@@ -157,7 +156,8 @@ If you're working with really big arrays, try the 'lazy' option
157156
nbytes: 3.6P; cbytes: 0; initialized: 0/1000000000
158157
mode: a; path: big.zarr
159158
160-
Yes, that is 3.6 petabytes.
159+
See the [persistence documentation](PERSISTENCE.rst) for more details of the
160+
file format.
161161

162162
Tuning
163163
------

0 commit comments

Comments
 (0)