Use doctests for preformance.rst

dstansby · dstansby · commit f42c17617773 · 2025-01-02T20:14:27.000Z
diff --git a/docs/user-guide/performance.rst b/docs/user-guide/performance.rst
@@ -1,12 +1,10 @@
 user-guide-performance
 
 Optimizing performance
-======================
+======================:
 
-.. ipython:: python
-   :suppress:
-
-   rm -r data/
+   >>> import shutil
+   >>> shutil.rmtree("./data", ignore_errors=True)
 
 .. _user-guide-chunks:
 
@@ -25,48 +23,43 @@ The optimal chunk shape will depend on how you want to access the data. E.g.,
 for a 2-dimensional array, if you only ever take slices along the first
 dimension, then chunk across the second dimension. If you know you want to chunk
 across an entire dimension you can use ``None`` or ``-1`` within the ``chunks``
-argument, e.g.:
-
-.. ipython:: python
-
-   import zarr
+argument, e.g.::
 
-   z1 = zarr.zeros((10000, 10000), chunks=(100, None), dtype='i4')
-   z1.chunks
+   >>> import zarr
+   >>>
+   >>> z1 = zarr.zeros((10000, 10000), chunks=(100, None), dtype='i4')
+   >>> z1.chunks
+   (100, 10000)
 
 Alternatively, if you only ever take slices along the second dimension, then
-chunk across the first dimension, e.g.:
+chunk across the first dimension, e.g.::
 
-.. ipython:: python
-
-   z2 = zarr.zeros((10000, 10000), chunks=(None, 100), dtype='i4')
-   z2.chunks
+   >>> z2 = zarr.zeros((10000, 10000), chunks=(None, 100), dtype='i4')
+   >>> z2.chunks
+   (10000, 100)
 
 If you require reasonable performance for both access patterns then you need to
-find a compromise, e.g.:
-
-.. ipython:: python
+find a compromise, e.g.::
 
-   z3 = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
-   z3.chunks
+   >>> z3 = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
+   >>> z3.chunks
+   (1000, 1000)
 
 If you are feeling lazy, you can let Zarr guess a chunk shape for your data by
 providing ``chunks=True``, although please note that the algorithm for guessing
-a chunk shape is based on simple heuristics and may be far from optimal. E.g.:
-
-.. ipython:: python
+a chunk shape is based on simple heuristics and may be far from optimal. E.g.::
 
-   z4 = zarr.zeros((10000, 10000), chunks=True, dtype='i4')
-   z4.chunks
+   >>> z4 = zarr.zeros((10000, 10000), chunks=True, dtype='i4')
+   >>> z4.chunks
+   (625, 625)
 
 If you know you are always going to be loading the entire array into memory, you
 can turn off chunks by providing ``chunks=False``, in which case there will be
-one single chunk for the array:
+one single chunk for the array::
 
-.. ipython:: python
-
-   z5 = zarr.zeros((10000, 10000), chunks=False, dtype='i4')
-   z5.chunks
+   >>> z5 = zarr.zeros((10000, 10000), chunks=False, dtype='i4')
+   >>> z5.chunks
+   (10000, 10000)
 
 .. _user-guide-chunks-order:
 
@@ -76,17 +69,43 @@ Chunk memory layout
 The order of bytes **within each chunk** of an array can be changed via the
 ``order`` config option, to use either C or Fortran layout. For
 multi-dimensional arrays, these two layouts may provide different compression
-ratios, depending on the correlation structure within the data. E.g.:
-
-.. ipython:: python
-
-   a = np.arange(100000000, dtype='i4').reshape(10000, 10000).T
-   # TODO: replace with create_array after #2463
-   c = zarr.array(a, chunks=(1000, 1000))
-   c.info_complete()
-   with zarr.config.set({'array.order': 'F'}):
-       f = zarr.array(a, chunks=(1000, 1000))
-   f.info_complete()
+ratios, depending on the correlation structure within the data. E.g.::
+
+   >>> import numpy as np
+   >>>
+   >>> a = np.arange(100000000, dtype='i4').reshape(10000, 10000).T
+   >>> # TODO: replace with create_array after #2463
+   >>> c = zarr.array(a, chunks=(1000, 1000))
+   >>> c.info_complete()
+   Type               : Array
+   Zarr format        : 3
+   Data type          : DataType.int32
+   Shape              : (10000, 10000)
+   Chunk shape        : (1000, 1000)
+   Order              : C
+   Read-only          : False
+   Store type         : MemoryStore
+   Codecs             : [{'endian': <Endian.little: 'little'>}, {'level': 0, 'checksum': False}]
+   No. bytes          : 400000000 (381.5M)
+   No. bytes stored   : 342588717
+   Storage ratio      : 1.2
+   Chunks Initialized : 100
+   >>> with zarr.config.set({'array.order': 'F'}):
+   ...     f = zarr.array(a, chunks=(1000, 1000))
+   >>> f.info_complete()
+   Type               : Array
+   Zarr format        : 3
+   Data type          : DataType.int32
+   Shape              : (10000, 10000)
+   Chunk shape        : (1000, 1000)
+   Order              : F
+   Read-only          : False
+   Store type         : MemoryStore
+   Codecs             : [{'endian': <Endian.little: 'little'>}, {'level': 0, 'checksum': False}]
+   No. bytes          : 400000000 (381.5M)
+   No. bytes stored   : 342588717
+   Storage ratio      : 1.2
+   Chunks Initialized : 100
 
 In the above example, Fortran order gives a better compression ratio. This is an
 artificial example but illustrates the general point that changing the order of
@@ -112,45 +131,53 @@ If you know that your data will form chunks that are almost always non-empty, th
 In this case, creating an array with ``write_empty_chunks=True`` (the default) will instruct Zarr to write every chunk without checking for emptiness.
 
 The following example illustrates the effect of the ``write_empty_chunks`` flag on
-the time required to write an array with different values.:
-
-.. ipython:: python
-
-   import zarr
-   import numpy as np
-   import time
-
-   def timed_write(write_empty_chunks):
-       """
-       Measure the time required and number of objects created when writing
-       to a Zarr array with random ints or fill value.
-       """
-       chunks = (8192,)
-       shape = (chunks[0] * 1024,)
-       data = np.random.randint(0, 255, shape)
-       dtype = 'uint8'
-       with zarr.config.set({"array.write_empty_chunks": write_empty_chunks}):
-           arr = zarr.open(
-               f"data/example-{write_empty_chunks}.zarr",
-               shape=shape,
-               chunks=chunks,
-               dtype=dtype,
-               fill_value=0,
-               mode='w'
-            )
-       # initialize all chunks
-       arr[:] = 100
-       result = []
-       for value in (data, arr.fill_value):
-           start = time.time()
-           arr[:] = value
-           elapsed = time.time() - start
-           result.append((elapsed, arr.nchunks_initialized))
-       return result
-   # log results
-   for write_empty_chunks in (True, False):
-       full, empty = timed_write(write_empty_chunks)
-       print(f'\nwrite_empty_chunks={write_empty_chunks}:\n\tRandom Data: {full[0]:.4f}s, {full[1]} objects stored\n\t Empty Data: {empty[0]:.4f}s, {empty[1]} objects stored\n')
+the time required to write an array with different values.::
+
+   >>> import zarr
+   >>> import numpy as np
+   >>> import time
+   >>>
+   >>> def timed_write(write_empty_chunks):
+   ...     """
+   ...     Measure the time required and number of objects created when writing
+   ...     to a Zarr array with random ints or fill value.
+   ...     """
+   ...     chunks = (8192,)
+   ...     shape = (chunks[0] * 1024,)
+   ...     data = np.random.randint(0, 255, shape)
+   ...     dtype = 'uint8'
+   ...     with zarr.config.set({"array.write_empty_chunks": write_empty_chunks}):
+   ...         arr = zarr.open(
+   ...             f"data/example-{write_empty_chunks}.zarr",
+   ...             shape=shape,
+   ...             chunks=chunks,
+   ...             dtype=dtype,
+   ...             fill_value=0,
+   ...             mode='w'
+   ...          )
+   ...     # initialize all chunks
+   ...     arr[:] = 100
+   ...     result = []
+   ...     for value in (data, arr.fill_value):
+   ...         start = time.time()
+   ...         arr[:] = value
+   ...         elapsed = time.time() - start
+   ...         result.append((elapsed, arr.nchunks_initialized))
+   ...     return result
+   ... # log results
+   >>> for write_empty_chunks in (True, False):
+   ...     full, empty = timed_write(write_empty_chunks)
+   ...     print(f'\nwrite_empty_chunks={write_empty_chunks}:\n\tRandom Data: {full[0]:.4f}s, {full[1]} objects stored\n\t Empty Data: {empty[0]:.4f}s, {empty[1]} objects stored\n')
+
+   write_empty_chunks=True:
+   	Random Data: 0.2044s, 1024 objects stored
+   	 Empty Data: 0.2036s, 1024 objects stored
+   <BLANKLINE>
+
+   write_empty_chunks=False:
+   	Random Data: 0.2279s, 1024 objects stored
+   	 Empty Data: 0.1767s, 0 objects stored
+   <BLANKLINE>
 
 In this example, writing random data is slightly slower with ``write_empty_chunks=True``,
 but writing empty data is substantially faster and generates far fewer objects in storage.
@@ -183,18 +210,18 @@ If an array or group is backed by a persistent store such as the a :class:`zarr.
 **are not** pickled. The only thing that is pickled is the necessary parameters to allow the store
 to re-open any underlying files or databases upon being unpickled.
 
-E.g., pickle/unpickle an local store array:
-
-.. ipython:: python
-
-   import pickle
-
-   # TODO: replace with create_array after #2463
-   z1 = zarr.array(store="data/example-2", data=np.arange(100000))
-   s = pickle.dumps(z1)
-   z2 = pickle.loads(s)
-   z1 == z2
-   np.all(z1[:] == z2[:])
+E.g., pickle/unpickle an local store array::
+
+   >>> import pickle
+   >>>
+   >>> # TODO: replace with create_array after #2463
+   >>> z1 = zarr.array(store="data/example-2", data=np.arange(100000))
+   >>> s = pickle.dumps(z1)
+   >>> z2 = pickle.loads(s)
+   >>> z1 == z2
+   True
+   >>> np.all(z1[:] == z2[:])
+   np.True_
 
 .. _user-guide-tips-blosc: