minor updates to zarr tutorial

jhamman · jhamman · commit 1674e45c7449 · 2025-07-06T16:16:30.000-07:00
diff --git a/intermediate/intro-to-zarr.ipynb b/intermediate/intro-to-zarr.ipynb
@@ -9,14 +9,14 @@
     "\n",
     "## Learning Objectives:\n",
     "\n",
-    "- Understand the principles of the Zarr file format\n",
-    "- Learn how to read and write Zarr files using the `zarr-python` library\n",
-    "- Explore how to use Zarr files with `xarray` for data analysis and visualization\n",
+    "- Understand the principles of the Zarr data format\n",
+    "- Learn how to read and write Zarr stores using the `zarr-python` library\n",
+    "- Explore how to use Zarr stores with `xarray` for data analysis and visualization\n",
     "\n",
     "This notebook provides a brief introduction to Zarr and how to\n",
     "use it in cloud environments for scalable, chunked, and compressed data storage.\n",
     "\n",
-    "Zarr is a file format with implementations in different languages. In this tutorial, we will look at an example of how to use the Zarr format by looking at some features of the `zarr-python` library and how Zarr files can be opened with `xarray`.\n",
+    "Zarr is a data format with implementations in different languages. In this tutorial, we will look at an example of how to use the Zarr format by looking at some features of the `zarr-python` library and how Zarr files can be opened with `xarray`.\n",
     "\n",
     "## What is Zarr?\n",
     "\n",
@@ -25,17 +25,16 @@
     "### Zarr Data Organization:\n",
     "- **Arrays**: N-dimensional arrays that can be chunked and compressed.\n",
     "- **Groups**: A container for organizing multiple arrays and other groups with a hierarchical structure.\n",
-    "- **Metadata**: JSON-like metadata describing the arrays and groups, including information about dimensions, data types, groups, and compression.\n",
+    "- **Metadata**: JSON-like metadata describing the arrays and groups, including information about data types, dimensions, chunking, compression, and user-defined key-value fields. \n",
     "- **Dimensions and Shape**: Arrays can have any number of dimensions, and their shape is defined by the number of elements in each dimension.\n",
     "- **Coordinates & Indexing**: Zarr supports coordinate arrays for each dimension, allowing for efficient indexing and slicing.\n",
     "\n",
-    "The diagram below from [the NASA Earthdata wiki](https://wiki.earthdata.nasa.gov/display/ESO/Zarr+Format) showing the structure of a Zarr store:\n",
+    "The diagram below from [the Zarr v3 specification](https://wiki.earthdata.nasa.gov/display/ESO/Zarr+Format) showing the structure of a Zarr store:\n",
     "\n",
-    "![EarthData](https://learning.nceas.ucsb.edu/2025-04-arctic/images/zarr-chunks.png)\n",
+    "![ZarrSpec](https://zarr-specs.readthedocs.io/en/latest/_images/terminology-hierarchy.excalidraw.png)\n",
     "\n",
     "\n",
-    "NetCDF and Zarr share similar terminology and functionality, but the key difference is that NetCDF is a single file, while Zarr is a directory-based “store” composed of many chunked files—making it better suited for distributed and cloud-based workflows.\n",
-    "\n"
+    "NetCDF and Zarr share similar terminology and functionality, but the key difference is that NetCDF is a single file, while Zarr is a directory-based “store” composed of many chunked files, making it better suited for distributed and cloud-based workflows."
    ]
   },
   {
@@ -69,7 +68,7 @@
    "source": [
     "import zarr\n",
     "\n",
-    "z = zarr.create(shape=(40, 50), chunks=(10, 10), dtype='f8', store='test.zarr', mode='w')\n",
+    "z = zarr.create_array(shape=(40, 50), chunks=(10, 10), dtype='f8', store='test.zarr')\n",
     "z"
    ]
   },
@@ -228,11 +227,13 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "root = zarr.group()\n",
+    "store = zarr.storage.MemoryStore()\n",
+    "root = zarr.create_group(store=store)\n",
     "temp = root.create_group('temp')\n",
     "precip = root.create_group('precip')\n",
     "t2m = temp.create_array('t2m', shape=(100, 100), chunks=(10, 10), dtype='i4')\n",
-    "prcp = precip.create_array('prcp', shape=(1000, 1000), chunks=(10, 10), dtype='i4')"
+    "prcp = precip.create_array('prcp', shape=(1000, 1000), chunks=(10, 10), dtype='i4')\n",
+    "root.tree()"
    ]
   },
   {
@@ -251,7 +252,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "root['temp']\n",
+    "display(root['temp'])\n",
     "root['temp/t2m'][:, 3]"
    ]
   },
@@ -281,7 +282,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "root.tree(expand=True)"
+    "root.tree()"
    ]
   },
   {
@@ -290,7 +291,7 @@
    "metadata": {},
    "source": [
     "#### Chunking\n",
-    "Chunking is the process of dividing the data arrays into smaller pieces. This allows for parallel processing and efficient storage.\n",
+    "Chunking is the process of dividing Zarr arrays into smaller pieces. This allows for parallel processing and efficient storage.\n",
     "\n",
     "One of the important parameters in Zarr is the chunk shape, which determines how the data is divided into smaller, manageable pieces. This is crucial for performance, especially when working with large datasets.\n",
     "\n",
@@ -329,7 +330,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "c = zarr.create(shape=(200, 200, 200), chunks=(1, 200, 200), dtype='f8', store='c.zarr')\n",
+    "c = zarr.create_array(shape=(200, 200, 200), chunks=(1, 200, 200), dtype='f8', store='c.zarr')\n",
     "c[:] = np.random.randn(*c.shape)"
    ]
   },
@@ -350,7 +351,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "d = zarr.create(shape=(200, 200, 200), chunks=(200, 200, 1), dtype='f8', store='d.zarr')\n",
+    "d = zarr.create_array(shape=(200, 200, 200), chunks=(200, 200, 1), dtype='f8', store='d.zarr')\n",
     "d[:] = np.random.randn(*d.shape)"
    ]
   },
@@ -377,8 +378,12 @@
     "- File systems struggle with too many small files.\n",
     "- Small files (e.g., 1 MB or less) may waste space due to filesystem block size.\n",
     "- Object storage systems (e.g., S3) can slow down with a high number of objects.\n",
+    "\n",
     "With sharding, you choose:\n",
-    "\n"
+    "- Shard size: the logical shape of each shard, which is expected to include one or more chunks\n",
+    "- Chunk size: the shape of each compressed chunk\n",
+    "\n",
+    "It is important to remember that the shard is the minimum unit of writing. This means that writers must be able to fit the entire shard (including all of the compressed chunks) in memory before writing a shard to a store.\n"
    ]
   },
   {
@@ -526,13 +531,27 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "from pprint import pprint\n",
+    "\n",
     "consolidated = zarr.open_group(store=store)\n",
     "consolidated_metadata = consolidated.metadata.consolidated_metadata.metadata\n",
-    "from pprint import pprint\n",
     "\n",
     "pprint(dict(sorted(consolidated_metadata.items())))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "a571acec-7a65-4a51-ad1e-c80b17494cd3",
+   "metadata": {},
+   "source": [
+    "Note that while Zarr-Python supports consolidated metadata for v2 and v3 formatted Zarr stores, it is not technically part of the specification (hence the warning above). \n",
+    "\n",
+    "⚠️ Use Caution When ⚠️\n",
+    "- **Stale or incomplete consolidated metadata**: If the dataset is updated but the consolidated metadata entrypoint isn't re-consolidated, readers may miss chunks or metadata. Always run zarr.consolidate_metadata() after changes.\n",
+    "- **Concurrent writes or multi-writer pipelines**: Consolidated metadata can lead to inconsistent reads if multiple processes write without coordination. Use with caution in dynamic or shared write environments.\n",
+    "- **Local filesystems or mixed toolchains**: On local storage, consolidation offers little benefit as hierarchy discovery is generally quite cheap. "
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "46",
@@ -575,6 +594,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
+    "import xarray as xr\n",
+    "\n",
     "store = 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge/gpcp-feedstock/gpcp.zarr'\n",
     "\n",
     "ds = xr.open_dataset(store, engine='zarr', chunks={}, consolidated=True)\n",
@@ -599,14 +620,13 @@
     "::::{admonition} Exercise\n",
     ":class: tip\n",
     "\n",
-    "Can you calculate the mean precipitation over the time dimension in the GPCP dataset and plot it?\n",
+    "Can you calculate the mean precipitation for January 2020 in the GPCP dataset and plot it?\n",
     "\n",
     ":::{admonition} Solution\n",
     ":class: dropdown\n",
     "\n",
     "```python\n",
-    "ds.precip.mean(dim='time').plot()\n",
-    "\n",
+    "ds.precip.sel(time=slice('2020-01-01', '2020-01-31')).mean(dim='time').plot()\n",
     "```\n",
     ":::\n",
     "::::"
@@ -628,13 +648,20 @@
    ]
   },
   {
-   "cell_type": "markdown",
-   "id": "53",
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "09c50842-b522-4f3f-b04a-da22f9131b86",
    "metadata": {},
+   "outputs": [],
    "source": []
   }
  ],
  "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
@@ -644,7 +671,8 @@
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3"
+   "pygments_lexer": "ipython3",
+   "version": "3.12.11"
   }
  },
  "nbformat": 4,