@@ -207,7 +207,7 @@ Here, the performance compared to Dask is pretty competitive. Note that, when th
207207is compressed (lower plot), the memory consumption is much lower than Dask, and kept constant
208208during the computation, which is testimonial of the smart use of CPU caches and memory by the
209209Blosc2 engine --for example, the CPU used in the experiment has 128 MB of L3, which is very
210- close to the amount of memory used by Blosc2. This is a very important point, because
210+ close to the amount of memory used by Blosc2. This is an important point, because
211211fitting the working set in memory is not enough; you also need to
212212`use caches and memory efficiently <https://purplesyringa.moe/blog/the-ram-myth >`_
213213to get the best performance.
@@ -268,28 +268,32 @@ useful metric when dealing with large datasets. The performance is quite
268268good and, when compression is used, it is kept constant for all operand sizes,
269269which is a sign that Blosc2 is using the CPU caches (and memory) efficiently.
270270
271- On the other hand, when compression is not used the performance degrades as
271+ On the other hand, when compression is not used, the performance degrades as
272272the operand size increases, which is a sign that the CPU caches are not being
273273used efficiently. This is a because data needs more time to be fetched from
274- (disk) storage, and the CPU is not able to keep up with the data flow.
274+ (slower disk) storage, and the CPU is not able to keep up with the data flow.
275275
276- Finally, here is a plot for a much larger set of datasets (up to 400,000 x 400,000),
277- where the operands do not fit in memory even when compressed:
276+ Finally, here is a plot for a much larger set of datasets (up to
277+ 400,000 x 400,000, or 2.3 TB), where the operands do not fit in memory, even
278+ when compressed:
278279
279280.. image :: https://github.com/Blosc/python-blosc2/blob/main/images/reduc-float64-log-amd.png?raw=true
280281 :width: 100%
281282 :alt: Performance vs large operand sizes for reductions
282283
283- In this case, we see that for operand sizes exceeding 2 TB, the performance
284+ In this case, we see that for operand sizes exceeding ~1 TB, the performance
284285degrades significantly as well, but it is still quite good, specially when using
285- disk-based operands. This demonstrates that Blosc2 is able to load data from disk
286- more efficiently than the swap subsystem of the operating system.
286+ disk-based operands. This demonstrates how Blosc2 is able to load data from disk
287+ more efficiently than the swap subsystem of the operating system; it can do so
288+ because it is able to grab data from disk while it is computing, so it can
289+ overlap I/O with computation.
287290
288291You can find the script for these benchmarks at:
289292
290293https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/jit-reduc-sizes.py
291294
292- All in all, thanks to compression and a fine-tuned partitioning for leveraging modern
293- CPU caches and efficient I/O that overlaps computation, Blosc2 allows to perform
294- calculations on data that is too large to fit in memory, and that can be stored in
295- memory, on disk or `on the network <https://github.com/ironArray/Caterva2 >`_.
295+ All in all, thanks to compression, a fine-tuned partitioning for leveraging modern
296+ CPU caches, and an efficient I/O that overlaps with computation, the Blosc2 compute
297+ engine allows to perform calculations on data that is too large to fit in memory,
298+ and that can be stored in memory, on disk or
299+ `on the network <https://github.com/ironArray/Caterva2 >`_.
0 commit comments