Skip to content

Commit a10b8a6

Browse files
committed
Introduce the new computing engine in main README
1 parent 62d7c24 commit a10b8a6

File tree

3 files changed

+78
-63
lines changed

3 files changed

+78
-63
lines changed

README.rst

Lines changed: 78 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -31,78 +31,30 @@ What it is
3131
both the C-Blosc1 API and its in-memory format. Python-Blosc2 is a Python package
3232
that wraps C-Blosc2, the newest version of the Blosc compressor.
3333

34-
Currently Python-Blosc2 already reproduces the API of
35-
`Python-Blosc <https://github.com/Blosc/python-blosc>`_, so it can be
36-
used as a drop-in replacement. However, there are a `few exceptions
37-
for a full compatibility.
38-
<https://github.com/Blosc/python-blosc2/blob/main/RELEASE_NOTES.md#changes-from-python-blosc-to-python-blosc2>`_
34+
Starting with version 3.0.0, Python-Blosc2 is including a powerful computing engine
35+
that can operate on compressed data that can be either in-memory, on-disk or on the network.
36+
This engine also supports advanced features like reductions, filters, user-defined functions
37+
and broadcasting (still in beta). You can read our tutorial on how to use this new feature at:
38+
https://github.com/Blosc/python-blosc2/blob/main/doc/getting_started/tutorials/03.lazyarray-expressions.ipynb and
39+
https://github.com/Blosc/python-blosc2/blob/main/doc/getting_started/tutorials/03.lazyarray-udf.ipynb
3940

4041
In addition, Python-Blosc2 aims to leverage the full C-Blosc2 functionality to support
4142
super-chunks (`SChunk <https://www.blosc.org/python-blosc2/reference/schunk_api.html>`_),
4243
multi-dimensional arrays
4344
(`NDArray <https://www.blosc.org/python-blosc2/reference/ndarray_api.html>`_),
4445
metadata, serialization and other bells and whistles introduced in C-Blosc2.
4546

46-
**Note:** Python-Blosc2 is meant to be backward compatible with Python-Blosc data.
47-
That means that it can read data generated with Python-Blosc, but the opposite
47+
**Note:** Blosc2 is meant to be backward compatible with Blosc(1) data.
48+
That means that it can read data generated with Blosc, but the opposite
4849
is not true (i.e. there is no *forward* compatibility).
4950

50-
SChunk: a 64-bit compressed store
51-
=================================
52-
53-
A `SChunk <https://www.blosc.org/python-blosc2/reference/schunk_api.html>`_ is a simple data
54-
container that handles setting, expanding and getting
55-
data and metadata. Contrarily to chunks, a super-chunk can update and resize the data
56-
that it contains, supports user metadata, and it does not have the 2 GB storage limitation.
57-
58-
Additionally, you can convert a SChunk into a contiguous, serialized buffer (aka
59-
`cframe <https://github.com/Blosc/c-blosc2/blob/main/README_CFRAME_FORMAT.rst>`_)
60-
and vice-versa; as a bonus, the serialization/deserialization process also works with NumPy
61-
arrays and PyTorch/TensorFlow tensors at a blazing speed:
62-
63-
.. |compress| image:: https://github.com/Blosc/python-blosc2/blob/main/images/linspace-compress.png?raw=true
64-
:width: 100%
65-
:alt: Compression speed for different codecs
66-
67-
.. |decompress| image:: https://github.com/Blosc/python-blosc2/blob/main/images/linspace-decompress.png?raw=true
68-
:width: 100%
69-
:alt: Decompression speed for different codecs
70-
71-
+----------------+---------------+
72-
| |compress| | |decompress| |
73-
+----------------+---------------+
74-
75-
while reaching excellent compression ratios:
76-
77-
.. image:: https://github.com/Blosc/python-blosc2/blob/main/images/pack-array-cratios.png?raw=true
78-
:width: 75%
79-
:align: center
80-
:alt: Compression ratio for different codecs
81-
82-
Also, if you are a Mac M1/M2 owner, make you a favor and use its native arm64 arch (yes, we are
83-
distributing Mac arm64 wheels too; you are welcome ;-):
84-
85-
.. |pack_arm| image:: https://github.com/Blosc/python-blosc2/blob/main/images/M1-i386-vs-arm64-pack.png?raw=true
86-
:width: 100%
87-
:alt: Compression speed for different codecs on Apple M1
88-
89-
.. |unpack_arm| image:: https://github.com/Blosc/python-blosc2/blob/main/images/M1-i386-vs-arm64-unpack.png?raw=true
90-
:width: 100%
91-
:alt: Decompression speed for different codecs on Apple M1
92-
93-
+------------+--------------+
94-
| |pack_arm| | |unpack_arm| |
95-
+------------+--------------+
96-
97-
Read more about `SChunk` features in our blog entry at: https://www.blosc.org/posts/python-blosc2-improvements
98-
9951
NDArray: an N-Dimensional store
10052
===============================
10153

102-
One of the latest and more exciting additions in Python-Blosc2 is the
54+
One of the more useful abstractions in Python-Blosc2 is the
10355
`NDArray <https://www.blosc.org/python-blosc2/reference/ndarray_api.html>`_ object.
10456
It can write and read n-dimensional datasets in an extremely efficient way thanks
105-
to a n-dim 2-level partitioning, allowing to slice and dice arbitrary large and
57+
to a n-dimensional 2-level partitioning, allowing to slice and dice arbitrary large and
10658
compressed data in a more fine-grained way:
10759

10860
.. image:: https://github.com/Blosc/python-blosc2/blob/main/images/b2nd-2level-parts.png?raw=true
@@ -124,6 +76,68 @@ is useful <https://www.youtube.com/watch?v=LvP9zxMGBng>`_:
12476
:alt: Slicing a dataset in pineapple-style
12577
:target: https://www.youtube.com/watch?v=LvP9zxMGBng
12678

79+
Operating with NDArrays
80+
=======================
81+
82+
The `NDArray` objects can be operated with very easily inside Python-Blosc2.
83+
Here it is a simple example:
84+
85+
.. code-block:: python
86+
87+
import numpy as np
88+
import blosc2
89+
90+
N = 10_000
91+
na = np.linspace(0, 1, N * N, dtype=np.float32).reshape(N, N)
92+
nb = np.linspace(1, 2, N * N).reshape(N, N)
93+
nc = np.linspace(-10, 10, N * N).reshape(N, N)
94+
95+
# Convert to blosc2
96+
a = blosc2.asarray(na)
97+
b = blosc2.asarray(nb)
98+
c = blosc2.asarray(nc)
99+
100+
# Expression
101+
expr = ((a ** 3 + blosc2.sin(c * 2)) < b) & (c > 0)
102+
103+
# Evaluate and get a NDArray as result
104+
out = expr.eval()
105+
print(out.info)
106+
107+
As you can see, the `NDArray` instances are very similar to NumPy arrays, but behind the scenes
108+
it holds compressed data that can be operated in a very efficient way with the new computing
109+
engine that is included in Python-Blosc2.
110+
111+
So as to whet your appetite, here it is the performance (with a MacBook Air M2 with 24 GB of RAM)
112+
that you can reach when the operands fit comfortably in-memory:
113+
114+
.. image:: https://github.com/Blosc/python-blosc2/blob/main/images/eval-expr-full-mem-M2.png?raw=true
115+
:width: 100%
116+
:alt: Performance when operands fit in-memory
117+
118+
In this case, performance is a bit far from top-level libraries like Numexpr or Numba, but
119+
it is still pretty nice (and probably using CPUs with more cores than M2 would allow closing the
120+
performance gap even further).
121+
122+
It is important to note that the `NDArray` object can use memory-mapped files as well, and the
123+
benchmark above is actually using a memory-mapped file as the storage for the operands.
124+
Memory-mapped files are very useful when the operands do not fit in-memory, and the performance
125+
is still very good. Thanks to Jan Sellner for his implementation in Blosc2.
126+
127+
And here it is the performance when the operands do not fit well in-memory:
128+
129+
.. image:: https://github.com/Blosc/python-blosc2/blob/main/images/eval-expr-scarce-mem-M2.png?raw=true
130+
:width: 100%
131+
:alt: Performance when operands do not fit in-memory
132+
133+
In the latter case, the memory consumption lines look a bit crazy, but this is because what
134+
is displayed is the real memory consumption, not the virtual one (so, during the evaluation
135+
the OS has to swap out some memory to disk). In this case, the performance when compared with
136+
top-level libraries like Numexpr or Numba is very competitive.
137+
138+
You can find the benchmark for the above examples at:
139+
https://github.com/Blosc/python-blosc2/blob/main/bench/ndarray/lazyarray-expr.ipynb
140+
127141
Installing
128142
==========
129143

@@ -198,10 +212,11 @@ [email protected]
198212

199213
https://groups.google.es/group/blosc
200214

201-
Twitter
202-
=======
215+
Mastodon
216+
========
203217

204-
Please follow `@Blosc2 <https://twitter.com/Blosc2>`_ to get informed about the latest developments.
218+
Please follow `@Blosc2 <https://fosstodon.org/@Blosc2>`_ to get informed about the latest
219+
developments. We lately moved from Twitter to Mastodon.
205220

206221
Citing Blosc
207222
============
@@ -213,11 +228,11 @@ You can cite our work on the different libraries under the Blosc umbrella as:
213228
@ONLINE{blosc,
214229
author = {{Blosc Development Team}},
215230
title = "{A fast, compressed and persistent data store library}",
216-
year = {2009-2023},
231+
year = {2009-2024},
217232
note = {https://blosc.org}
218233
}
219234
220235
221236
----
222237

223-
**Enjoy!**
238+
**Make compression better!**

images/eval-expr-full-mem-M2.png

46.9 KB
Loading

images/eval-expr-scarce-mem-M2.png

58.4 KB
Loading

0 commit comments

Comments
 (0)