@@ -223,3 +223,73 @@ You can find the notebooks for these benchmarks at:
223223https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/lazyarray-dask-small.ipynb
224224
225225https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/lazyarray-dask-large.ipynb
226+
227+ Reductions and disk-based computations
228+ ======================================
229+
230+ One of the most interesting features of the new computing engine in Python-Blosc2 is
231+ the ability to perform reductions on compressed data that can optionally be stored
232+ on disk. This allows to perform calculations on data that is too large to fit in
233+ memory.
234+
235+ Here is a simple example:
236+
237+ .. code-block :: python
238+
239+ import blosc2
240+
241+ N = 20_000 # for small scenario
242+ # N = 100_000 # for large scenario
243+ a = blosc2.linspace(0 , 1 , N * N, shape = (N, N), urlpath = " a.b2nd" , mode = " w" )
244+ b = blosc2.linspace(1 , 2 , N * N, shape = (N, N), urlpath = " b.b2nd" , mode = " w" )
245+ c = blosc2.linspace(- 10 , 10 , N * N, shape = (N, N)) # small and in-memory
246+ # Expression
247+ expr = np.sum(((a** 3 + np.sin(a * 2 )) < c) & (b > 0 ), axis = 1 )
248+
249+ # Evaluate and get a NDArray as result
250+ out = expr.compute()
251+ print (out.info)
252+
253+ In this case, we are performing a reduction that computes the sum of the boolean
254+ array that results from a expression. The result is a 1D array that will be
255+ stored in memory, but that can be optionally stored on disk (via the ``out= ``
256+ parameter in the ``compute() `` or ``sum() `` methods).
257+
258+ Here you can see a plot on how this performs for a series of operands that
259+ vary in size (run on a modern desktop machine, using Linux and a 8-core
260+ AMD CPU, with a large 96 MB L3 cache, and 64 GB of RAM):
261+
262+ .. image :: https://github.com/Blosc/python-blosc2/blob/main/images/reduc-float64-amd.png?raw=true
263+ :width: 100%
264+ :alt: Performance vs operand sizes for reductions
265+
266+ As you can see, we are expressing the performance in GB/s, which is a very
267+ useful metric when dealing with large datasets. The performance is quite
268+ good and, when compression is used, it is kept constant for all operand sizes,
269+ which is a sign that Blosc2 is using the CPU caches (and memory) efficiently.
270+
271+ On the other hand, when compression is not used the performance degrades as
272+ the operand size increases, which is a sign that the CPU caches are not being
273+ used efficiently. This is a because data needs more time to be fetched from
274+ (disk) storage, and the CPU is not able to keep up with the data flow.
275+
276+ Finally, here is a plot for a much larger set of datasets (up to 400,000 x 400,000),
277+ where the operands do not fit in memory even when compressed:
278+
279+ .. image :: https://github.com/Blosc/python-blosc2/blob/main/images/reduc-float64-log-amd.png?raw=true
280+ :width: 100%
281+ :alt: Performance vs large operand sizes for reductions
282+
283+ In this case, we see that for operand sizes exceeding 2 TB, the performance
284+ degrades significantly as well, but it is still quite good, specially when using
285+ disk-based operands. This demonstrates that Blosc2 is able to load data from disk
286+ more efficiently than the swap subsystem of the operating system.
287+
288+ You can find the script for these benchmarks at:
289+
290+ https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/jit-reduc-sizes.py
291+
292+ All in all, thanks to compression and a fine-tuned partitioning for leveraging modern
293+ CPU caches and efficient I/O that overlaps computation, Blosc2 allows to perform
294+ calculations on data that is too large to fit in memory, and that can be stored in
295+ memory, on disk or `on the network <https://github.com/ironArray/Caterva2 >`_.
0 commit comments