11user-guide-performance
22
33Optimizing performance
4- ======================
4+ ======================:
55
6- .. ipython :: python
7- :suppress:
8-
9- rm - r data/
6+ >>> import shutil
7+ >>> shutil.rmtree(" ./data" , ignore_errors = True )
108
119.. _user-guide-chunks :
1210
@@ -25,48 +23,43 @@ The optimal chunk shape will depend on how you want to access the data. E.g.,
2523for a 2-dimensional array, if you only ever take slices along the first
2624dimension, then chunk across the second dimension. If you know you want to chunk
2725across an entire dimension you can use ``None `` or ``-1 `` within the ``chunks ``
28- argument, e.g.:
29-
30- .. ipython :: python
31-
32- import zarr
26+ argument, e.g.::
3327
34- z1 = zarr.zeros((10000 , 10000 ), chunks = (100 , None ), dtype = ' i4' )
35- z1.chunks
28+ >>> import zarr
29+ >>>
30+ >>> z1 = zarr.zeros((10000, 10000), chunks=(100, None), dtype='i4')
31+ >>> z1.chunks
32+ (100, 10000)
3633
3734Alternatively, if you only ever take slices along the second dimension, then
38- chunk across the first dimension, e.g.:
35+ chunk across the first dimension, e.g.::
3936
40- .. ipython :: python
41-
42- z2 = zarr.zeros((10000 , 10000 ), chunks = (None , 100 ), dtype = ' i4' )
43- z2.chunks
37+ >>> z2 = zarr.zeros((10000, 10000), chunks=(None, 100), dtype='i4')
38+ >>> z2.chunks
39+ (10000, 100)
4440
4541If you require reasonable performance for both access patterns then you need to
46- find a compromise, e.g.:
47-
48- .. ipython :: python
42+ find a compromise, e.g.::
4943
50- z3 = zarr.zeros((10000 , 10000 ), chunks = (1000 , 1000 ), dtype = ' i4' )
51- z3.chunks
44+ >>> z3 = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
45+ >>> z3.chunks
46+ (1000, 1000)
5247
5348If you are feeling lazy, you can let Zarr guess a chunk shape for your data by
5449providing ``chunks=True ``, although please note that the algorithm for guessing
55- a chunk shape is based on simple heuristics and may be far from optimal. E.g.:
56-
57- .. ipython :: python
50+ a chunk shape is based on simple heuristics and may be far from optimal. E.g.::
5851
59- z4 = zarr.zeros((10000 , 10000 ), chunks = True , dtype = ' i4' )
60- z4.chunks
52+ >>> z4 = zarr.zeros((10000, 10000), chunks=True, dtype='i4')
53+ >>> z4.chunks
54+ (625, 625)
6155
6256If you know you are always going to be loading the entire array into memory, you
6357can turn off chunks by providing ``chunks=False ``, in which case there will be
64- one single chunk for the array:
58+ one single chunk for the array::
6559
66- .. ipython :: python
67-
68- z5 = zarr.zeros((10000 , 10000 ), chunks = False , dtype = ' i4' )
69- z5.chunks
60+ >>> z5 = zarr.zeros((10000, 10000), chunks=False, dtype='i4')
61+ >>> z5.chunks
62+ (10000, 10000)
7063
7164.. _user-guide-chunks-order :
7265
@@ -76,17 +69,43 @@ Chunk memory layout
7669The order of bytes **within each chunk ** of an array can be changed via the
7770``order `` config option, to use either C or Fortran layout. For
7871multi-dimensional arrays, these two layouts may provide different compression
79- ratios, depending on the correlation structure within the data. E.g.:
80-
81- .. ipython :: python
82-
83- a = np.arange(100000000 , dtype = ' i4' ).reshape(10000 , 10000 ).T
84- # TODO : replace with create_array after #2463
85- c = zarr.array(a, chunks = (1000 , 1000 ))
86- c.info_complete()
87- with zarr.config.set({' array.order' : ' F' }):
88- f = zarr.array(a, chunks = (1000 , 1000 ))
89- f.info_complete()
72+ ratios, depending on the correlation structure within the data. E.g.::
73+
74+ >>> import numpy as np
75+ >>>
76+ >>> a = np.arange(100000000, dtype='i4').reshape(10000, 10000).T
77+ >>> # TODO: replace with create_array after #2463
78+ >>> c = zarr.array(a, chunks=(1000, 1000))
79+ >>> c.info_complete()
80+ Type : Array
81+ Zarr format : 3
82+ Data type : DataType.int32
83+ Shape : (10000, 10000)
84+ Chunk shape : (1000, 1000)
85+ Order : C
86+ Read-only : False
87+ Store type : MemoryStore
88+ Codecs : [{'endian': <Endian.little: 'little'>}, {'level': 0, 'checksum': False}]
89+ No. bytes : 400000000 (381.5M)
90+ No. bytes stored : 342588717
91+ Storage ratio : 1.2
92+ Chunks Initialized : 100
93+ >>> with zarr.config.set({'array.order': 'F'}):
94+ ... f = zarr.array(a, chunks=(1000, 1000))
95+ >>> f.info_complete()
96+ Type : Array
97+ Zarr format : 3
98+ Data type : DataType.int32
99+ Shape : (10000, 10000)
100+ Chunk shape : (1000, 1000)
101+ Order : F
102+ Read-only : False
103+ Store type : MemoryStore
104+ Codecs : [{'endian': <Endian.little: 'little'>}, {'level': 0, 'checksum': False}]
105+ No. bytes : 400000000 (381.5M)
106+ No. bytes stored : 342588717
107+ Storage ratio : 1.2
108+ Chunks Initialized : 100
90109
91110In the above example, Fortran order gives a better compression ratio. This is an
92111artificial example but illustrates the general point that changing the order of
@@ -112,45 +131,53 @@ If you know that your data will form chunks that are almost always non-empty, th
112131In this case, creating an array with ``write_empty_chunks=True `` (the default) will instruct Zarr to write every chunk without checking for emptiness.
113132
114133The following example illustrates the effect of the ``write_empty_chunks `` flag on
115- the time required to write an array with different values.:
116-
117- .. ipython :: python
118-
119- import zarr
120- import numpy as np
121- import time
122-
123- def timed_write (write_empty_chunks ):
124- """
125- Measure the time required and number of objects created when writing
126- to a Zarr array with random ints or fill value.
127- """
128- chunks = (8192 ,)
129- shape = (chunks[0 ] * 1024 ,)
130- data = np.random.randint(0 , 255 , shape)
131- dtype = ' uint8'
132- with zarr.config.set({" array.write_empty_chunks" : write_empty_chunks}):
133- arr = zarr.open(
134- f " data/example- { write_empty_chunks} .zarr " ,
135- shape = shape,
136- chunks = chunks,
137- dtype = dtype,
138- fill_value = 0 ,
139- mode = ' w'
140- )
141- # initialize all chunks
142- arr[:] = 100
143- result = []
144- for value in (data, arr.fill_value):
145- start = time.time()
146- arr[:] = value
147- elapsed = time.time() - start
148- result.append((elapsed, arr.nchunks_initialized))
149- return result
150- # log results
151- for write_empty_chunks in (True , False ):
152- full, empty = timed_write(write_empty_chunks)
153- print (f ' \n write_empty_chunks= { write_empty_chunks} : \n\t Random Data: { full[0 ]:.4f } s, { full[1 ]} objects stored \n\t Empty Data: { empty[0 ]:.4f } s, { empty[1 ]} objects stored \n ' )
134+ the time required to write an array with different values.::
135+
136+ >>> import zarr
137+ >>> import numpy as np
138+ >>> import time
139+ >>>
140+ >>> def timed_write(write_empty_chunks):
141+ ... """
142+ ... Measure the time required and number of objects created when writing
143+ ... to a Zarr array with random ints or fill value.
144+ ... """
145+ ... chunks = (8192,)
146+ ... shape = (chunks[0] * 1024,)
147+ ... data = np.random.randint(0, 255, shape)
148+ ... dtype = 'uint8'
149+ ... with zarr.config.set({"array.write_empty_chunks": write_empty_chunks}):
150+ ... arr = zarr.open(
151+ ... f"data/example-{write_empty_chunks}.zarr",
152+ ... shape=shape,
153+ ... chunks=chunks,
154+ ... dtype=dtype,
155+ ... fill_value=0,
156+ ... mode='w'
157+ ... )
158+ ... # initialize all chunks
159+ ... arr[:] = 100
160+ ... result = []
161+ ... for value in (data, arr.fill_value):
162+ ... start = time.time()
163+ ... arr[:] = value
164+ ... elapsed = time.time() - start
165+ ... result.append((elapsed, arr.nchunks_initialized))
166+ ... return result
167+ ... # log results
168+ >>> for write_empty_chunks in (True, False):
169+ ... full, empty = timed_write(write_empty_chunks)
170+ ... print(f'\nwrite_empty_chunks={write_empty_chunks}:\n\tRandom Data: {full[0]:.4f}s, {full[1]} objects stored\n\t Empty Data: {empty[0]:.4f}s, {empty[1]} objects stored\n')
171+
172+ write_empty_chunks=True:
173+ Random Data: 0.2044s, 1024 objects stored
174+ Empty Data: 0.2036s, 1024 objects stored
175+ <BLANKLINE>
176+
177+ write_empty_chunks=False:
178+ Random Data: 0.2279s, 1024 objects stored
179+ Empty Data: 0.1767s, 0 objects stored
180+ <BLANKLINE>
154181
155182In this example, writing random data is slightly slower with ``write_empty_chunks=True ``,
156183but writing empty data is substantially faster and generates far fewer objects in storage.
@@ -183,18 +210,18 @@ If an array or group is backed by a persistent store such as the a :class:`zarr.
183210**are not ** pickled. The only thing that is pickled is the necessary parameters to allow the store
184211to re-open any underlying files or databases upon being unpickled.
185212
186- E.g., pickle/unpickle an local store array:
187-
188- .. ipython :: python
189-
190- import pickle
191-
192- # TODO : replace with create_array after #2463
193- z1 = zarr.array( store = " data/example-2 " , data = np.arange( 100000 ) )
194- s = pickle.dumps(z1)
195- z2 = pickle.loads(s)
196- z1 == z2
197- np.all(z1[:] == z2[:])
213+ E.g., pickle/unpickle an local store array::
214+
215+ >>> import pickle
216+ >>>
217+ >>> # TODO: replace with create_array after #2463
218+ >>> z1 = zarr.array(store="data/example-2", data=np.arange(100000))
219+ >>> s = pickle.dumps(z1)
220+ >>> z2 = pickle.loads(s )
221+ >>> z1 == z2
222+ True
223+ >>> np.all(z1[:] == z2[:])
224+ np.True_
198225
199226.. _user-guide-tips-blosc :
200227
0 commit comments