sharding

negin513 · negin513 · commit a12435ee9330 · 2025-07-04T14:53:24.000-06:00
diff --git a/intermediate/intro-to-zarr.ipynb b/intermediate/intro-to-zarr.ipynb
@@ -516,13 +516,147 @@
    "id": "a63ebdd7",
    "metadata": {},
    "source": [
-    "### How to Examine and Modify the Chunk Shape\n",
+    "#### Chunking\n",
+    "Chunking is the process of dividing the data arrays into smaller pieces. This allows for parallel processing and efficient storage.\n",
     "\n",
-    "If your data is sufficiently large, Zarr will chose a chunksize for you.\n",
+    "One of the important parameters in Zarr is the chunk shape, which determines how the data is divided into smaller, manageable pieces. This is crucial for performance, especially when working with large datasets.\n",
     "\n",
+    "To examine the chunk shape of a Zarr array, you can use the `chunks` attribute. This will show you the size of each chunk in each dimension."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "id": "cd5e7ec0",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(10, 10)"
+      ]
+     },
+     "execution_count": 44,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "z.chunks"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a1f62ab3",
+   "metadata": {},
+   "source": [
+    "When selecting chunk shapes, we need to keep in mind two constraints:\n",
+    "\n",
+    "- Concurrent writes are possible as long as different processes write to separate chunks, enabling highly parallel data writing. \n",
+    "- When reading data, if any piece of the chunk is needed, the entire chunk has to be loaded. \n",
+    "\n",
+    "The optimal chunk shape will depend on how you want to access the data. E.g., for a 2-dimensional array, if you only ever take slices along the first dimension, then chunk across the second dimension.\n",
+    "\n",
+    "Here we will compare two different chunking strategies.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 46,
+   "id": "b7929741",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "c = zarr.create(shape=(200, 200, 200), chunks=(1, 200, 200), dtype='f8', store='c.zarr')\n",
+    "c[:] = np.random.randn(*c.shape)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "id": "68d6d671",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 112 ms, sys: 55.4 ms, total: 167 ms\n",
+      "Wall time: 67.5 ms\n"
+     ]
+    }
+   ],
+   "source": [
+    "%time _ = c[:, 0, 0]\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "id": "9ad7e371",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "d = zarr.create(shape=(200, 200, 200), chunks=(200, 200, 1), dtype='f8', store='d.zarr')\n",
+    "d[:] = np.random.randn(*d.shape)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 49,
+   "id": "51094774",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "CPU times: user 1.63 ms, sys: 1.3 ms, total: 2.93 ms\n",
+      "Wall time: 2.14 ms\n"
+     ]
+    }
+   ],
+   "source": [
+    "%time _ = d[:, 0, 0]\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3fa2b41a",
+   "metadata": {},
+   "source": [
+    "### Sharding\n",
+    "When working with large arrays and small chunks, Zarr’s sharding feature can improve storage efficiency and performance. Instead of writing each chunk to a separate file—which can overwhelm file systems and cloud object stores—sharding groups multiple chunks into a single storage object.\n",
+    "\n",
+    "Why Use Sharding?\n",
+    "\n",
+    "- File systems struggle with too many small files.\n",
+    "- Small files (e.g., 1 MB or less) may waste space due to filesystem block size.\n",
+    "- Object storage systems (e.g., S3) can slow down with a high number of objects.\n",
+    "With sharding, you choose:\n",
     "\n"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4fec37ba",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import zarr\n",
+    "\n",
+    "z6 = zarr.create_array(\n",
+    "    store={},\n",
+    "    shape=(10000, 10000, 1000),\n",
+    "    chunks=(100, 100, 100),\n",
+    "    shards=(1000, 1000, 1000),\n",
+    "    dtype='uint8'\n",
+    ")\n",
+    "\n",
+    "z6.info"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "1e0d1a8e",
@@ -636,12 +770,6 @@
     "So far we have only been dealing in single array Zarr data stores. In this next example, we will create a zarr store with multiple arrays and then consolidate metadata. The speed up is significant when dealing in remote storage options, which we will see in the following example on accessing cloud storage."
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "cdb3f822",
-   "metadata": {},
-   "source": []
-  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -748,7 +876,7 @@
    "id": "c6454acf",
    "metadata": {},
    "source": [
-    "## Object Storage as a Zarr Store\n",
+    "### Object Storage as a Zarr Store\n",
     "\n",
     "Zarr’s layout (many files/chunks per array) maps perfectly onto object storage, such as Amazon S3, Google Cloud Storage, or Azure Blob Storage. Each chunk is stored as a separate object, enabling distributed reads/writes.\n",
     "\n"
@@ -2294,12 +2422,6 @@
     "- [Cloud Optimized Geospatial Formats](https://guide.cloudnativegeo.org/zarr/zarr-in-practice.html)\n",
     "- [Scalable and Computationally Reproducible Approaches to Arctic Research](https://learning.nceas.ucsb.edu/2025-04-arctic/sections/zarr.html)\n"
    ]
-  },
-  {
-   "cell_type": "markdown",
-   "id": "5ed9088a",
-   "metadata": {},
-   "source": []
   }
  ],
  "metadata": {