Add slicing and beyond tutorial

martaiborra · martaiborra · commit 6cd09d05dc5f · 2022-09-30T14:07:46.000+02:00
diff --git a/examples/slicing_and_beyond.ipynb b/examples/slicing_and_beyond.ipynb
@@ -0,0 +1,312 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Slicing chunks and beyond\n",
+    "\n",
+    "The newest and coolest way to store data in python-blosc2 is through a SChunk (super-chunk) object. Here the data is split into chunks of the same size. So in the past, the only way of working with it was chunk by chunk (see  tutorials-basics.ipynb). But now, python-blosc2 can retrieve, update or append data all at once (i.e. avoiding doing it chunk by chunk). To see how this works, let's first create our SChunk."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import blosc2\n",
+    "import numpy as np\n",
+    "\n",
+    "nchunks = 10\n",
+    "data = np.arange(200 * 1000 * nchunks, dtype=np.int32)\n",
+    "cparams = {\"typesize\": 4}\n",
+    "schunk = blosc2.SChunk(chunksize=200 * 1000 * 4, data=data, cparams=cparams)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "It is important to set the `typesize` correctly as these methods will work with items and not with bytes.\n",
+    "\n",
+    "## Getting data in a SChunk\n",
+    "\n",
+    "Let's begin by retrieving the data from the whole SChunk. We could use the `decompress_chunk` method:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "out = np.empty(200 * 1000 * nchunks, dtype=np.int32)\n",
+    "for i in range(nchunks):\n",
+    "    schunk.decompress_chunk(i, out[200 * 1000 * i : 200 * 1000 * (i + 1)])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "But instead of the code above, we can simply use the `__getitem__` or the `get_slice` methods. Let's begin with `__getitem__`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "b'\\x00\\x00\\x00\\x00'\n"
+     ]
+    }
+   ],
+   "source": [
+    "out_slice = schunk[:]\n",
+    "print(out_slice[:4])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As you can see, the data is returned as a bytestring. If we want to better visualize the data, we will use `get_slice`. You can pass any Python object (supporting the Buffer Protocol) as the `out` param to fill it with the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "[0 1 2 3]\n"
+     ]
+    }
+   ],
+   "source": [
+    "out_slice = np.empty(200 * 1000 * nchunks, dtype=np.int32)\n",
+    "schunk.get_slice(out=out_slice)\n",
+    "np.array_equal(out, out_slice)\n",
+    "print(out_slice[:4])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "That's much better!\n",
+    "\n",
+    "## Setting data in a SChunk\n",
+    "\n",
+    "We can also set the data of an area to any python object supporting the Buffer Protocol. Let's see a quick example:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "start = 34\n",
+    "stop = 1000 * 200 * 4\n",
+    "new_value = np.ones(stop - start, dtype=np.int32)\n",
+    "schunk[start:stop] = new_value"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "So now, we are able to get or set data all at once. But what if we would like to add data? Well, you can still do it with `__setitem__`. Indeed, this method can update and append data at the same time. To do so, `stop` will be the new SChunk nitems:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "schunk_nelems = 1000 * 200 * nchunks\n",
+    "\n",
+    "new_value = np.zeros(1000 * 200 * 2 + 53, dtype=np.int32)\n",
+    "start = schunk_nelems - 123\n",
+    "new_nitems = start + new_value.size\n",
+    "schunk[start:new_nitems] = new_value"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Getting a SChunk from/as a contiguous buffer\n",
+    "\n",
+    "Furthermore, you can pass from a SChunk to a contiguous buffer and vice versa. Let's get that buffer:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": "b'\\x9e\\xa8b2f'"
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "buf = schunk.to_cframe()\n",
+    "buf[:5]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And now the other way around:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "schunk2 = blosc2.schunk_from_cframe(cframe=buf, copy=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "In this case we set the `copy` param to `True`. If you do not want to copy the buffer,\n",
+    "be mindful that you will have to keep its reference until you do not\n",
+    "want the SChunk anymore.\n",
+    "\n",
+    "## Compressing NumPy arrays\n",
+    "\n",
+    "If the object you want to get as a compressed buffer is a NumPy array, you can use the newer and faster functions to store it in-memory or on-disk.\n",
+    "\n",
+    "### In-memory\n",
+    "\n",
+    "To store it in-memory you can use `pack_array2`. In comparison with its former version, it is faster (see `pack_compress.py` bench)  and does not have the 2 GB size limitation."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "np_array = np.arange(2**30, dtype=np.int32)\n",
+    "\n",
+    "packed_arr2 = blosc2.pack_array2(np_array)\n",
+    "unpacked_arr2 = blosc2.unpack_array2(packed_arr2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "### On-disk\n",
+    "\n",
+    "To perform the same but store the buffer on-disk you would use `save_array` and `load_array` like so:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "blosc2.save_array(np_array, urlpath=\"ondisk_array.b2frame\", mode=\"w\")\n",
+    "np_array2 = blosc2.load_array(\"ondisk_array.b2frame\")\n",
+    "np.array_equal(np_array, np_array2)\n",
+    "\n",
+    "# Remove it\n",
+    "blosc2.remove_urlpath(\"ondisk_array.b2frame\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   },
+   "source": [
+    "# Conclusions\n",
+    "\n",
+    "Now python-blosc2 has an easy way of creating, getting, setting, deleting and expanding data in a SChunk. Moreover, you can get a contiguous compressed representation (aka [cframe](https://github.com/Blosc/c-blosc2/blob/main/README_CFRAME_FORMAT.rst)) of it and create it again latter. And you can do the same with NumPy arrays faster than with the former functions.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.6"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}