Skip to content

Commit f42c176

Browse files
committed
Use doctests for preformance.rst
1 parent 0477b52 commit f42c176

File tree

1 file changed

+121
-94
lines changed

1 file changed

+121
-94
lines changed

docs/user-guide/performance.rst

Lines changed: 121 additions & 94 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,10 @@
11
user-guide-performance
22

33
Optimizing performance
4-
======================
4+
======================:
55

6-
.. ipython:: python
7-
:suppress:
8-
9-
rm -r data/
6+
>>> import shutil
7+
>>> shutil.rmtree("./data", ignore_errors=True)
108

119
.. _user-guide-chunks:
1210

@@ -25,48 +23,43 @@ The optimal chunk shape will depend on how you want to access the data. E.g.,
2523
for a 2-dimensional array, if you only ever take slices along the first
2624
dimension, then chunk across the second dimension. If you know you want to chunk
2725
across an entire dimension you can use ``None`` or ``-1`` within the ``chunks``
28-
argument, e.g.:
29-
30-
.. ipython:: python
31-
32-
import zarr
26+
argument, e.g.::
3327

34-
z1 = zarr.zeros((10000, 10000), chunks=(100, None), dtype='i4')
35-
z1.chunks
28+
>>> import zarr
29+
>>>
30+
>>> z1 = zarr.zeros((10000, 10000), chunks=(100, None), dtype='i4')
31+
>>> z1.chunks
32+
(100, 10000)
3633

3734
Alternatively, if you only ever take slices along the second dimension, then
38-
chunk across the first dimension, e.g.:
35+
chunk across the first dimension, e.g.::
3936

40-
.. ipython:: python
41-
42-
z2 = zarr.zeros((10000, 10000), chunks=(None, 100), dtype='i4')
43-
z2.chunks
37+
>>> z2 = zarr.zeros((10000, 10000), chunks=(None, 100), dtype='i4')
38+
>>> z2.chunks
39+
(10000, 100)
4440

4541
If you require reasonable performance for both access patterns then you need to
46-
find a compromise, e.g.:
47-
48-
.. ipython:: python
42+
find a compromise, e.g.::
4943

50-
z3 = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
51-
z3.chunks
44+
>>> z3 = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
45+
>>> z3.chunks
46+
(1000, 1000)
5247

5348
If you are feeling lazy, you can let Zarr guess a chunk shape for your data by
5449
providing ``chunks=True``, although please note that the algorithm for guessing
55-
a chunk shape is based on simple heuristics and may be far from optimal. E.g.:
56-
57-
.. ipython:: python
50+
a chunk shape is based on simple heuristics and may be far from optimal. E.g.::
5851

59-
z4 = zarr.zeros((10000, 10000), chunks=True, dtype='i4')
60-
z4.chunks
52+
>>> z4 = zarr.zeros((10000, 10000), chunks=True, dtype='i4')
53+
>>> z4.chunks
54+
(625, 625)
6155

6256
If you know you are always going to be loading the entire array into memory, you
6357
can turn off chunks by providing ``chunks=False``, in which case there will be
64-
one single chunk for the array:
58+
one single chunk for the array::
6559

66-
.. ipython:: python
67-
68-
z5 = zarr.zeros((10000, 10000), chunks=False, dtype='i4')
69-
z5.chunks
60+
>>> z5 = zarr.zeros((10000, 10000), chunks=False, dtype='i4')
61+
>>> z5.chunks
62+
(10000, 10000)
7063

7164
.. _user-guide-chunks-order:
7265

@@ -76,17 +69,43 @@ Chunk memory layout
7669
The order of bytes **within each chunk** of an array can be changed via the
7770
``order`` config option, to use either C or Fortran layout. For
7871
multi-dimensional arrays, these two layouts may provide different compression
79-
ratios, depending on the correlation structure within the data. E.g.:
80-
81-
.. ipython:: python
82-
83-
a = np.arange(100000000, dtype='i4').reshape(10000, 10000).T
84-
# TODO: replace with create_array after #2463
85-
c = zarr.array(a, chunks=(1000, 1000))
86-
c.info_complete()
87-
with zarr.config.set({'array.order': 'F'}):
88-
f = zarr.array(a, chunks=(1000, 1000))
89-
f.info_complete()
72+
ratios, depending on the correlation structure within the data. E.g.::
73+
74+
>>> import numpy as np
75+
>>>
76+
>>> a = np.arange(100000000, dtype='i4').reshape(10000, 10000).T
77+
>>> # TODO: replace with create_array after #2463
78+
>>> c = zarr.array(a, chunks=(1000, 1000))
79+
>>> c.info_complete()
80+
Type : Array
81+
Zarr format : 3
82+
Data type : DataType.int32
83+
Shape : (10000, 10000)
84+
Chunk shape : (1000, 1000)
85+
Order : C
86+
Read-only : False
87+
Store type : MemoryStore
88+
Codecs : [{'endian': <Endian.little: 'little'>}, {'level': 0, 'checksum': False}]
89+
No. bytes : 400000000 (381.5M)
90+
No. bytes stored : 342588717
91+
Storage ratio : 1.2
92+
Chunks Initialized : 100
93+
>>> with zarr.config.set({'array.order': 'F'}):
94+
... f = zarr.array(a, chunks=(1000, 1000))
95+
>>> f.info_complete()
96+
Type : Array
97+
Zarr format : 3
98+
Data type : DataType.int32
99+
Shape : (10000, 10000)
100+
Chunk shape : (1000, 1000)
101+
Order : F
102+
Read-only : False
103+
Store type : MemoryStore
104+
Codecs : [{'endian': <Endian.little: 'little'>}, {'level': 0, 'checksum': False}]
105+
No. bytes : 400000000 (381.5M)
106+
No. bytes stored : 342588717
107+
Storage ratio : 1.2
108+
Chunks Initialized : 100
90109

91110
In the above example, Fortran order gives a better compression ratio. This is an
92111
artificial example but illustrates the general point that changing the order of
@@ -112,45 +131,53 @@ If you know that your data will form chunks that are almost always non-empty, th
112131
In this case, creating an array with ``write_empty_chunks=True`` (the default) will instruct Zarr to write every chunk without checking for emptiness.
113132

114133
The following example illustrates the effect of the ``write_empty_chunks`` flag on
115-
the time required to write an array with different values.:
116-
117-
.. ipython:: python
118-
119-
import zarr
120-
import numpy as np
121-
import time
122-
123-
def timed_write(write_empty_chunks):
124-
"""
125-
Measure the time required and number of objects created when writing
126-
to a Zarr array with random ints or fill value.
127-
"""
128-
chunks = (8192,)
129-
shape = (chunks[0] * 1024,)
130-
data = np.random.randint(0, 255, shape)
131-
dtype = 'uint8'
132-
with zarr.config.set({"array.write_empty_chunks": write_empty_chunks}):
133-
arr = zarr.open(
134-
f"data/example-{write_empty_chunks}.zarr",
135-
shape=shape,
136-
chunks=chunks,
137-
dtype=dtype,
138-
fill_value=0,
139-
mode='w'
140-
)
141-
# initialize all chunks
142-
arr[:] = 100
143-
result = []
144-
for value in (data, arr.fill_value):
145-
start = time.time()
146-
arr[:] = value
147-
elapsed = time.time() - start
148-
result.append((elapsed, arr.nchunks_initialized))
149-
return result
150-
# log results
151-
for write_empty_chunks in (True, False):
152-
full, empty = timed_write(write_empty_chunks)
153-
print(f'\nwrite_empty_chunks={write_empty_chunks}:\n\tRandom Data: {full[0]:.4f}s, {full[1]} objects stored\n\t Empty Data: {empty[0]:.4f}s, {empty[1]} objects stored\n')
134+
the time required to write an array with different values.::
135+
136+
>>> import zarr
137+
>>> import numpy as np
138+
>>> import time
139+
>>>
140+
>>> def timed_write(write_empty_chunks):
141+
... """
142+
... Measure the time required and number of objects created when writing
143+
... to a Zarr array with random ints or fill value.
144+
... """
145+
... chunks = (8192,)
146+
... shape = (chunks[0] * 1024,)
147+
... data = np.random.randint(0, 255, shape)
148+
... dtype = 'uint8'
149+
... with zarr.config.set({"array.write_empty_chunks": write_empty_chunks}):
150+
... arr = zarr.open(
151+
... f"data/example-{write_empty_chunks}.zarr",
152+
... shape=shape,
153+
... chunks=chunks,
154+
... dtype=dtype,
155+
... fill_value=0,
156+
... mode='w'
157+
... )
158+
... # initialize all chunks
159+
... arr[:] = 100
160+
... result = []
161+
... for value in (data, arr.fill_value):
162+
... start = time.time()
163+
... arr[:] = value
164+
... elapsed = time.time() - start
165+
... result.append((elapsed, arr.nchunks_initialized))
166+
... return result
167+
... # log results
168+
>>> for write_empty_chunks in (True, False):
169+
... full, empty = timed_write(write_empty_chunks)
170+
... print(f'\nwrite_empty_chunks={write_empty_chunks}:\n\tRandom Data: {full[0]:.4f}s, {full[1]} objects stored\n\t Empty Data: {empty[0]:.4f}s, {empty[1]} objects stored\n')
171+
172+
write_empty_chunks=True:
173+
Random Data: 0.2044s, 1024 objects stored
174+
Empty Data: 0.2036s, 1024 objects stored
175+
<BLANKLINE>
176+
177+
write_empty_chunks=False:
178+
Random Data: 0.2279s, 1024 objects stored
179+
Empty Data: 0.1767s, 0 objects stored
180+
<BLANKLINE>
154181

155182
In this example, writing random data is slightly slower with ``write_empty_chunks=True``,
156183
but writing empty data is substantially faster and generates far fewer objects in storage.
@@ -183,18 +210,18 @@ If an array or group is backed by a persistent store such as the a :class:`zarr.
183210
**are not** pickled. The only thing that is pickled is the necessary parameters to allow the store
184211
to re-open any underlying files or databases upon being unpickled.
185212

186-
E.g., pickle/unpickle an local store array:
187-
188-
.. ipython:: python
189-
190-
import pickle
191-
192-
# TODO: replace with create_array after #2463
193-
z1 = zarr.array(store="data/example-2", data=np.arange(100000))
194-
s = pickle.dumps(z1)
195-
z2 = pickle.loads(s)
196-
z1 == z2
197-
np.all(z1[:] == z2[:])
213+
E.g., pickle/unpickle an local store array::
214+
215+
>>> import pickle
216+
>>>
217+
>>> # TODO: replace with create_array after #2463
218+
>>> z1 = zarr.array(store="data/example-2", data=np.arange(100000))
219+
>>> s = pickle.dumps(z1)
220+
>>> z2 = pickle.loads(s)
221+
>>> z1 == z2
222+
True
223+
>>> np.all(z1[:] == z2[:])
224+
np.True_
198225

199226
.. _user-guide-tips-blosc:
200227

0 commit comments

Comments
 (0)