diff --git a/myst.yml b/myst.yml index b048b5ed..842a87e3 100644 --- a/myst.yml +++ b/myst.yml @@ -13,6 +13,10 @@ project: - title: Preamble children: - file: notebooks/how-to-cite.md + - title: Xbatcher fundamentals + children: + - file: notebooks/xbatcher_dataloading.ipynb + - file: notebooks/xbatcher_reconstruction.ipynb - title: Testing model inference children: - file: notebooks/inference-testing.ipynb diff --git a/notebooks/xbatcher_dataloading.ipynb b/notebooks/xbatcher_dataloading.ipynb new file mode 100644 index 00000000..3dfeee52 --- /dev/null +++ b/notebooks/xbatcher_dataloading.ipynb @@ -0,0 +1,2033 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Dataloading from Xarray Datasets" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Working with large, multi-dimensional datasets, common in fields like climate science and oceanography, presents a significant challenge when preparing data for machine learning models. The `xbatcher` library is designed to simplify this crucial preprocessing step.\n", + "\n", + "`xbatcher` is a Python package that facilitates the generation of data batches from `xarray` objects for machine learning. It serves as a bridge between the labeled, multi-dimensional data structures of `xarray` and the tensor-based inputs required by deep learning frameworks such as PyTorch and TensorFlow.\n", + "\n", + "This guide provides an introduction to the fundamentals of `xbatcher`. We will cover how to create a `BatchGenerator`, customize it for specific needs, and prepare the resulting data for integration with a PyTorch model." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Imports" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "import xarray as xr\n", + "import numpy as np\n", + "import torch\n", + "import xbatcher\n", + "from xbatcher.loaders.torch import MapDataset, IterableDataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Creating a Sample Dataset\n", + "\n", + "To begin, we will create a sample `xarray.Dataset`. This allows us to focus on the mechanics of `xbatcher` without the overhead of a specific real-world dataset. This sample can be replaced by any `xarray.Dataset` loaded from a file (e.g., NetCDF, Zarr)." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
<xarray.Dataset> Size: 8MB\n", + "Dimensions: (x: 100, y: 100, time: 50)\n", + "Coordinates:\n", + " * x (x) int64 800B 0 1 2 3 4 5 6 7 8 ... 92 93 94 95 96 97 98 99\n", + " * y (y) int64 800B 0 1 2 3 4 5 6 7 8 ... 92 93 94 95 96 97 98 99\n", + " * time (time) int64 400B 0 1 2 3 4 5 6 7 ... 42 43 44 45 46 47 48 49\n", + "Data variables:\n", + " temperature (x, y, time) float64 4MB 0.6357 0.8989 ... 0.1376 0.1089\n", + " precipitation (x, y, time) float64 4MB 0.05915 0.2899 ... 0.0906 0.969
<xarray.Dataset> Size: 81kB\n", + "Dimensions: (x: 10, y: 10, time: 50)\n", + "Coordinates:\n", + " * x (x) int64 80B 0 1 2 3 4 5 6 7 8 9\n", + " * y (y) int64 80B 0 1 2 3 4 5 6 7 8 9\n", + " * time (time) int64 400B 0 1 2 3 4 5 6 7 ... 42 43 44 45 46 47 48 49\n", + "Data variables:\n", + " temperature (x, y, time) float64 40kB 0.6357 0.8989 ... 0.7347 0.4043\n", + " precipitation (x, y, time) float64 40kB 0.05915 0.2899 ... 0.1648 0.06016
<xarray.Dataset> Size: 81kB\n", + "Dimensions: (x: 10, y: 10, time: 50)\n", + "Coordinates:\n", + " * x (x) int64 80B 0 1 2 3 4 5 6 7 8 9\n", + " * y (y) int64 80B 0 1 2 3 4 5 6 7 8 9\n", + " * time (time) int64 400B 0 1 2 3 4 5 6 7 ... 42 43 44 45 46 47 48 49\n", + "Data variables:\n", + " temperature (x, y, time) float64 40kB 0.6357 0.8989 ... 0.7347 0.4043\n", + " precipitation (x, y, time) float64 40kB 0.05915 0.2899 ... 0.1648 0.06016
<xarray.DataArray (x: 50, y: 40)> Size: 8kB\n", + "array([[0.94426095, 0.7027894 , 0.02029528, ..., 0.16328041, 0.5883387 ,\n", + " 0.8879921 ],\n", + " [0.6830533 , 0.8331848 , 0.44004276, ..., 0.6508039 , 0.8455495 ,\n", + " 0.66443324],\n", + " [0.36509654, 0.9623709 , 0.44621307, ..., 0.66530186, 0.31605566,\n", + " 0.9226282 ],\n", + " ...,\n", + " [0.2908776 , 0.3381197 , 0.7494014 , ..., 0.19071114, 0.10994843,\n", + " 0.17150152],\n", + " [0.6378889 , 0.95425236, 0.51718473, ..., 0.52702767, 0.9290716 ,\n", + " 0.819217 ],\n", + " [0.59220934, 0.6537968 , 0.06189981, ..., 0.75576884, 0.0942427 ,\n", + " 0.36704108]], shape=(50, 40), dtype=float32)\n", + "Coordinates:\n", + " * x (x) int64 400B 0 1 2 3 4 5 6 7 8 9 ... 41 42 43 44 45 46 47 48 49\n", + " * y (y) int64 320B 0 1 2 3 4 5 6 7 8 9 ... 31 32 33 34 35 36 37 38 39
<xarray.DataArray (x: 100, y: 40)> Size: 32kB\n", + "array([[0.94426095, 0.70278943, 0.02029528, ..., 0.16328041, 0.58833867,\n", + " 0.88799208],\n", + " [0.94426095, 0.70278943, 0.02029528, ..., 0.16328041, 0.58833867,\n", + " 0.88799208],\n", + " [0.68305331, 0.83318478, 0.44004276, ..., 0.65080392, 0.84554952,\n", + " 0.66443324],\n", + " ...,\n", + " [0.63788891, 0.95425236, 0.51718473, ..., 0.52702767, 0.92907161,\n", + " 0.81921703],\n", + " [0.59220934, 0.65379679, 0.06189981, ..., 0.75576884, 0.0942427 ,\n", + " 0.36704108],\n", + " [0.59220934, 0.65379679, 0.06189981, ..., 0.75576884, 0.0942427 ,\n", + " 0.36704108]], shape=(100, 40))\n", + "Coordinates:\n", + " * x (x) float64 800B 0.0 0.5 1.0 1.5 2.0 ... 47.5 48.0 48.5 49.0 49.5\n", + " * y (y) float64 320B 0.0 1.0 2.0 3.0 4.0 ... 35.0 36.0 37.0 38.0 39.0