Skip to content

Commit 6cd09d0

Browse files
committed
Add slicing and beyond tutorial
1 parent b0b8d65 commit 6cd09d0

File tree

1 file changed

+312
-0
lines changed

1 file changed

+312
-0
lines changed

examples/slicing_and_beyond.ipynb

Lines changed: 312 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,312 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Slicing chunks and beyond\n",
8+
"\n",
9+
"The newest and coolest way to store data in python-blosc2 is through a SChunk (super-chunk) object. Here the data is split into chunks of the same size. So in the past, the only way of working with it was chunk by chunk (see tutorials-basics.ipynb). But now, python-blosc2 can retrieve, update or append data all at once (i.e. avoiding doing it chunk by chunk). To see how this works, let's first create our SChunk."
10+
]
11+
},
12+
{
13+
"cell_type": "code",
14+
"execution_count": 1,
15+
"metadata": {},
16+
"outputs": [],
17+
"source": [
18+
"import blosc2\n",
19+
"import numpy as np\n",
20+
"\n",
21+
"nchunks = 10\n",
22+
"data = np.arange(200 * 1000 * nchunks, dtype=np.int32)\n",
23+
"cparams = {\"typesize\": 4}\n",
24+
"schunk = blosc2.SChunk(chunksize=200 * 1000 * 4, data=data, cparams=cparams)"
25+
]
26+
},
27+
{
28+
"cell_type": "markdown",
29+
"metadata": {},
30+
"source": [
31+
"It is important to set the `typesize` correctly as these methods will work with items and not with bytes.\n",
32+
"\n",
33+
"## Getting data in a SChunk\n",
34+
"\n",
35+
"Let's begin by retrieving the data from the whole SChunk. We could use the `decompress_chunk` method:"
36+
]
37+
},
38+
{
39+
"cell_type": "code",
40+
"execution_count": 2,
41+
"metadata": {},
42+
"outputs": [],
43+
"source": [
44+
"out = np.empty(200 * 1000 * nchunks, dtype=np.int32)\n",
45+
"for i in range(nchunks):\n",
46+
" schunk.decompress_chunk(i, out[200 * 1000 * i : 200 * 1000 * (i + 1)])"
47+
]
48+
},
49+
{
50+
"cell_type": "markdown",
51+
"metadata": {},
52+
"source": [
53+
"But instead of the code above, we can simply use the `__getitem__` or the `get_slice` methods. Let's begin with `__getitem__`:"
54+
]
55+
},
56+
{
57+
"cell_type": "code",
58+
"execution_count": 3,
59+
"metadata": {
60+
"pycharm": {
61+
"name": "#%%\n"
62+
}
63+
},
64+
"outputs": [
65+
{
66+
"name": "stdout",
67+
"output_type": "stream",
68+
"text": [
69+
"b'\\x00\\x00\\x00\\x00'\n"
70+
]
71+
}
72+
],
73+
"source": [
74+
"out_slice = schunk[:]\n",
75+
"print(out_slice[:4])"
76+
]
77+
},
78+
{
79+
"cell_type": "markdown",
80+
"metadata": {},
81+
"source": [
82+
"As you can see, the data is returned as a bytestring. If we want to better visualize the data, we will use `get_slice`. You can pass any Python object (supporting the Buffer Protocol) as the `out` param to fill it with the data."
83+
]
84+
},
85+
{
86+
"cell_type": "code",
87+
"execution_count": 4,
88+
"metadata": {
89+
"pycharm": {
90+
"name": "#%%\n"
91+
}
92+
},
93+
"outputs": [
94+
{
95+
"name": "stdout",
96+
"output_type": "stream",
97+
"text": [
98+
"[0 1 2 3]\n"
99+
]
100+
}
101+
],
102+
"source": [
103+
"out_slice = np.empty(200 * 1000 * nchunks, dtype=np.int32)\n",
104+
"schunk.get_slice(out=out_slice)\n",
105+
"np.array_equal(out, out_slice)\n",
106+
"print(out_slice[:4])"
107+
]
108+
},
109+
{
110+
"cell_type": "markdown",
111+
"metadata": {},
112+
"source": [
113+
"That's much better!\n",
114+
"\n",
115+
"## Setting data in a SChunk\n",
116+
"\n",
117+
"We can also set the data of an area to any python object supporting the Buffer Protocol. Let's see a quick example:"
118+
]
119+
},
120+
{
121+
"cell_type": "code",
122+
"execution_count": 5,
123+
"metadata": {
124+
"pycharm": {
125+
"name": "#%%\n"
126+
}
127+
},
128+
"outputs": [],
129+
"source": [
130+
"start = 34\n",
131+
"stop = 1000 * 200 * 4\n",
132+
"new_value = np.ones(stop - start, dtype=np.int32)\n",
133+
"schunk[start:stop] = new_value"
134+
]
135+
},
136+
{
137+
"cell_type": "markdown",
138+
"metadata": {},
139+
"source": [
140+
"So now, we are able to get or set data all at once. But what if we would like to add data? Well, you can still do it with `__setitem__`. Indeed, this method can update and append data at the same time. To do so, `stop` will be the new SChunk nitems:"
141+
]
142+
},
143+
{
144+
"cell_type": "code",
145+
"execution_count": 6,
146+
"metadata": {
147+
"pycharm": {
148+
"name": "#%%\n"
149+
}
150+
},
151+
"outputs": [],
152+
"source": [
153+
"schunk_nelems = 1000 * 200 * nchunks\n",
154+
"\n",
155+
"new_value = np.zeros(1000 * 200 * 2 + 53, dtype=np.int32)\n",
156+
"start = schunk_nelems - 123\n",
157+
"new_nitems = start + new_value.size\n",
158+
"schunk[start:new_nitems] = new_value"
159+
]
160+
},
161+
{
162+
"cell_type": "markdown",
163+
"metadata": {},
164+
"source": [
165+
"## Getting a SChunk from/as a contiguous buffer\n",
166+
"\n",
167+
"Furthermore, you can pass from a SChunk to a contiguous buffer and vice versa. Let's get that buffer:"
168+
]
169+
},
170+
{
171+
"cell_type": "code",
172+
"execution_count": 7,
173+
"metadata": {},
174+
"outputs": [
175+
{
176+
"data": {
177+
"text/plain": "b'\\x9e\\xa8b2f'"
178+
},
179+
"execution_count": 7,
180+
"metadata": {},
181+
"output_type": "execute_result"
182+
}
183+
],
184+
"source": [
185+
"buf = schunk.to_cframe()\n",
186+
"buf[:5]"
187+
]
188+
},
189+
{
190+
"cell_type": "markdown",
191+
"metadata": {},
192+
"source": [
193+
"And now the other way around:"
194+
]
195+
},
196+
{
197+
"cell_type": "code",
198+
"execution_count": 8,
199+
"metadata": {
200+
"pycharm": {
201+
"name": "#%%\n"
202+
}
203+
},
204+
"outputs": [],
205+
"source": [
206+
"schunk2 = blosc2.schunk_from_cframe(cframe=buf, copy=True)"
207+
]
208+
},
209+
{
210+
"cell_type": "markdown",
211+
"metadata": {
212+
"pycharm": {
213+
"name": "#%% md\n"
214+
}
215+
},
216+
"source": [
217+
"In this case we set the `copy` param to `True`. If you do not want to copy the buffer,\n",
218+
"be mindful that you will have to keep its reference until you do not\n",
219+
"want the SChunk anymore.\n",
220+
"\n",
221+
"## Compressing NumPy arrays\n",
222+
"\n",
223+
"If the object you want to get as a compressed buffer is a NumPy array, you can use the newer and faster functions to store it in-memory or on-disk.\n",
224+
"\n",
225+
"### In-memory\n",
226+
"\n",
227+
"To store it in-memory you can use `pack_array2`. In comparison with its former version, it is faster (see `pack_compress.py` bench) and does not have the 2 GB size limitation."
228+
]
229+
},
230+
{
231+
"cell_type": "code",
232+
"execution_count": 9,
233+
"metadata": {
234+
"pycharm": {
235+
"name": "#%%\n"
236+
}
237+
},
238+
"outputs": [],
239+
"source": [
240+
"np_array = np.arange(2**30, dtype=np.int32)\n",
241+
"\n",
242+
"packed_arr2 = blosc2.pack_array2(np_array)\n",
243+
"unpacked_arr2 = blosc2.unpack_array2(packed_arr2)"
244+
]
245+
},
246+
{
247+
"cell_type": "markdown",
248+
"metadata": {
249+
"pycharm": {
250+
"name": "#%% md\n"
251+
}
252+
},
253+
"source": [
254+
"### On-disk\n",
255+
"\n",
256+
"To perform the same but store the buffer on-disk you would use `save_array` and `load_array` like so:"
257+
]
258+
},
259+
{
260+
"cell_type": "code",
261+
"execution_count": 10,
262+
"metadata": {
263+
"pycharm": {
264+
"name": "#%%\n"
265+
}
266+
},
267+
"outputs": [],
268+
"source": [
269+
"blosc2.save_array(np_array, urlpath=\"ondisk_array.b2frame\", mode=\"w\")\n",
270+
"np_array2 = blosc2.load_array(\"ondisk_array.b2frame\")\n",
271+
"np.array_equal(np_array, np_array2)\n",
272+
"\n",
273+
"# Remove it\n",
274+
"blosc2.remove_urlpath(\"ondisk_array.b2frame\")"
275+
]
276+
},
277+
{
278+
"cell_type": "markdown",
279+
"metadata": {
280+
"pycharm": {
281+
"name": "#%% md\n"
282+
}
283+
},
284+
"source": [
285+
"# Conclusions\n",
286+
"\n",
287+
"Now python-blosc2 has an easy way of creating, getting, setting, deleting and expanding data in a SChunk. Moreover, you can get a contiguous compressed representation (aka [cframe](https://github.com/Blosc/c-blosc2/blob/main/README_CFRAME_FORMAT.rst)) of it and create it again latter. And you can do the same with NumPy arrays faster than with the former functions.\n"
288+
]
289+
}
290+
],
291+
"metadata": {
292+
"kernelspec": {
293+
"display_name": "Python 3 (ipykernel)",
294+
"language": "python",
295+
"name": "python3"
296+
},
297+
"language_info": {
298+
"codemirror_mode": {
299+
"name": "ipython",
300+
"version": 3
301+
},
302+
"file_extension": ".py",
303+
"mimetype": "text/x-python",
304+
"name": "python",
305+
"nbconvert_exporter": "python",
306+
"pygments_lexer": "ipython3",
307+
"version": "3.9.6"
308+
}
309+
},
310+
"nbformat": 4,
311+
"nbformat_minor": 1
312+
}

0 commit comments

Comments
 (0)