|
9 | 9 | "In this lesson, we cover the basics of Xarray data structures. Our\n",
|
10 | 10 | "learning goals are as follows. By the end of the lesson, we will be able to:\n",
|
11 | 11 | "\n",
|
| 12 | + ":::{admonition} Learning Goals\n", |
12 | 13 | "- Understand the basic data structures (`DataArray` and `Dataset` objects) in Xarray\n",
|
13 |
| - "\n", |
14 |
| - "---\n", |
15 |
| - "\n", |
16 |
| - "## Introduction\n", |
17 |
| - "\n", |
18 |
| - "Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”)\n", |
19 |
| - "are an essential part of computational science. They are encountered in a wide\n", |
20 |
| - "range of fields, including physics, astronomy, geoscience, bioinformatics,\n", |
21 |
| - "engineering, finance, and deep learning. In Python, [NumPy](https://numpy.org/)\n", |
22 |
| - "provides the fundamental data structure and API for working with raw ND arrays.\n", |
23 |
| - "However, real-world datasets are usually more than just raw numbers; they have\n", |
24 |
| - "labels which encode information about how the array values map to locations in\n", |
25 |
| - "space, time, etc.\n", |
26 |
| - "\n", |
27 |
| - "Here is an example of how we might structure a dataset for a weather forecast:\n", |
28 |
| - "\n", |
29 |
| - "<img src=\"https://docs.xarray.dev/en/stable/_images/dataset-diagram.png\" align=\"center\" width=\"80%\">\n", |
30 |
| - "\n", |
31 |
| - "You'll notice multiple data variables (temperature, precipitation), coordinate\n", |
32 |
| - "variables (latitude, longitude), and dimensions (x, y, t). We'll cover how these\n", |
33 |
| - "fit into Xarray's data structures below.\n", |
34 |
| - "\n", |
35 |
| - "Xarray doesn’t just keep track of labels on arrays – it uses them to provide a\n", |
36 |
| - "powerful and concise interface. For example:\n", |
37 |
| - "\n", |
38 |
| - "- Apply operations over dimensions by name: `x.sum('time')`.\n", |
39 |
| - "\n", |
40 |
| - "- Select values by label (or logical location) instead of integer location:\n", |
41 |
| - " `x.loc['2014-01-01']` or `x.sel(time='2014-01-01')`.\n", |
42 |
| - "\n", |
43 |
| - "- Mathematical operations (e.g., `x - y`) vectorize across multiple dimensions\n", |
44 |
| - " (array broadcasting) based on dimension names, not shape.\n", |
45 |
| - "\n", |
46 |
| - "- Easily use the split-apply-combine paradigm with groupby:\n", |
47 |
| - " `x.groupby('time.dayofyear').mean()`.\n", |
48 |
| - "\n", |
49 |
| - "- Database-like alignment based on coordinate labels that smoothly handles\n", |
50 |
| - " missing values: `x, y = xr.align(x, y, join='outer')`.\n", |
51 |
| - "\n", |
52 |
| - "- Keep track of arbitrary metadata in the form of a Python dictionary:\n", |
53 |
| - " `x.attrs`.\n", |
54 |
| - "\n", |
55 |
| - "The N-dimensional nature of xarray’s data structures makes it suitable for\n", |
56 |
| - "dealing with multi-dimensional scientific data, and its use of dimension names\n", |
57 |
| - "instead of axis labels (`dim='time'` instead of `axis=0`) makes such arrays much\n", |
58 |
| - "more manageable than the raw numpy ndarray: with xarray, you don’t need to keep\n", |
59 |
| - "track of the order of an array’s dimensions or insert dummy dimensions of size 1\n", |
60 |
| - "to align arrays (e.g., using np.newaxis).\n", |
61 |
| - "\n", |
62 |
| - "The immediate payoff of using xarray is that you’ll write less code. The\n", |
63 |
| - "long-term payoff is that you’ll understand what you were thinking when you come\n", |
64 |
| - "back to look at it weeks or months later.\n" |
| 14 | + "- Customize the display of Xarray objects\n", |
| 15 | + "- Access variables, coordinates, and arbitrary metadata\n", |
| 16 | + "- Transform to tabular Pandas data structures\n", |
| 17 | + ":::" |
65 | 18 | ]
|
66 | 19 | },
|
67 | 20 | {
|
|
72 | 25 | "\n",
|
73 | 26 | "Xarray provides two data structures: the `DataArray` and `Dataset`. The\n",
|
74 | 27 | "`DataArray` class attaches dimension names, coordinates and attributes to\n",
|
75 |
| - "multi-dimensional arrays while `Dataset` combines multiple arrays.\n", |
| 28 | + "multi-dimensional arrays while `Dataset` combines multiple DataArrays.\n", |
76 | 29 | "\n",
|
77 | 30 | "Both classes are most commonly created by reading data.\n",
|
78 |
| - "To learn how to create a DataArray or Dataset manually, see the [Creating Data Structures](01.1_creating_data_structures.ipynb) tutorial.\n", |
79 |
| - "\n", |
80 |
| - "Xarray has a few small real-world tutorial datasets hosted in this GitHub repository https://github.com/pydata/xarray-data.\n", |
81 |
| - "We'll use the [xarray.tutorial.load_dataset](https://docs.xarray.dev/en/stable/generated/xarray.tutorial.open_dataset.html#xarray.tutorial.open_dataset) convenience function to download and open the `air_temperature` (National Centers for Environmental Prediction) Dataset by name." |
| 31 | + "To learn how to create a DataArray or Dataset manually, see the [Creating Data Structures](01.1_creating_data_structures.ipynb) tutorial." |
82 | 32 | ]
|
83 | 33 | },
|
84 | 34 | {
|
|
88 | 38 | "outputs": [],
|
89 | 39 | "source": [
|
90 | 40 | "import numpy as np\n",
|
91 |
| - "import xarray as xr" |
| 41 | + "import xarray as xr\n", |
| 42 | + "import pandas as pd\n", |
| 43 | + "\n", |
| 44 | + "# When working in a Jupyter Notebook you might want to customize Xarray display settings to your liking\n", |
| 45 | + "# The following settings reduce the amount of data displayed out by default\n", |
| 46 | + "xr.set_options(display_expand_attrs=False, display_expand_data=False)\n", |
| 47 | + "np.set_printoptions(threshold=10, edgeitems=2)" |
92 | 48 | ]
|
93 | 49 | },
|
94 | 50 | {
|
|
97 | 53 | "source": [
|
98 | 54 | "### Dataset\n",
|
99 | 55 | "\n",
|
100 |
| - "`Dataset` objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.\n" |
| 56 | + "`Dataset` objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.\n", |
| 57 | + "\n", |
| 58 | + "Xarray has a few small real-world tutorial datasets hosted in this GitHub repository https://github.com/pydata/xarray-data.\n", |
| 59 | + "We'll use the [xarray.tutorial.load_dataset](https://docs.xarray.dev/en/stable/generated/xarray.tutorial.open_dataset.html#xarray.tutorial.open_dataset) convenience function to download and open the `air_temperature` (National Centers for Environmental Prediction) Dataset by name." |
101 | 60 | ]
|
102 | 61 | },
|
103 | 62 | {
|
|
147 | 106 | "cell_type": "markdown",
|
148 | 107 | "metadata": {},
|
149 | 108 | "source": [
|
150 |
| - "#### What is all this anyway? (String representations)\n", |
| 109 | + "#### HTML vs text representations\n", |
151 | 110 | "\n",
|
152 | 111 | "Xarray has two representation types: `\"html\"` (which is only available in\n",
|
153 | 112 | "notebooks) and `\"text\"`. To choose between them, use the `display_style` option.\n",
|
154 | 113 | "\n",
|
155 | 114 | "So far, our notebook has automatically displayed the `\"html\"` representation (which we will continue using).\n",
|
156 |
| - "The `\"html\"` representation is interactive, allowing you to collapse sections (left arrows) and\n", |
157 |
| - "view attributes and values for each value (right hand sheet icon and data symbol)." |
| 115 | + "The `\"html\"` representation is interactive, allowing you to collapse sections (▶) and\n", |
| 116 | + "view attributes and values for each value (📄 and ≡)." |
158 | 117 | ]
|
159 | 118 | },
|
160 | 119 | {
|
|
180 | 139 | "- an unordered list of *coordinates* or dimensions with coordinates with one item\n",
|
181 | 140 | " per line. Each item has a name, one or more dimensions in parentheses, a dtype\n",
|
182 | 141 | " and a preview of the values. Also, if it is a dimension coordinate, it will be\n",
|
183 |
| - " marked with a `*`.\n", |
| 142 | + " printed in **bold** font.\n", |
184 | 143 | "- an alphabetically sorted list of *dimensions without coordinates* (if there are any)\n",
|
185 | 144 | "- an unordered list of *attributes*, or metadata"
|
186 | 145 | ]
|
|
379 | 338 | "methods on `xarray` objects:\n"
|
380 | 339 | ]
|
381 | 340 | },
|
382 |
| - { |
383 |
| - "cell_type": "code", |
384 |
| - "execution_count": null, |
385 |
| - "metadata": {}, |
386 |
| - "outputs": [], |
387 |
| - "source": [ |
388 |
| - "import pandas as pd" |
389 |
| - ] |
390 |
| - }, |
391 | 341 | {
|
392 | 342 | "cell_type": "code",
|
393 | 343 | "execution_count": null,
|
|
429 | 379 | "cell_type": "markdown",
|
430 | 380 | "metadata": {},
|
431 | 381 | "source": [
|
432 |
| - "**<code>to_series</code>**: This will always convert `DataArray` objects to\n", |
433 |
| - "`pandas.Series`, using a `MultiIndex` for higher dimensions\n" |
| 382 | + "### to_series\n", |
| 383 | + "This will always convert `DataArray` objects to `pandas.Series`, using a `MultiIndex` for higher dimensions\n" |
434 | 384 | ]
|
435 | 385 | },
|
436 | 386 | {
|
|
446 | 396 | "cell_type": "markdown",
|
447 | 397 | "metadata": {},
|
448 | 398 | "source": [
|
449 |
| - "**<code>to_dataframe</code>**: This will always convert `DataArray` or `Dataset`\n", |
450 |
| - "objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named\n", |
451 |
| - "for this.\n" |
| 399 | + "### to_dataframe\n", |
| 400 | + "\n", |
| 401 | + "This will always convert `DataArray` or `Dataset` objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named for this. Since columns in a `DataFrame` need to have the same index, they are\n", |
| 402 | + "broadcasted." |
452 | 403 | ]
|
453 | 404 | },
|
454 | 405 | {
|
|
459 | 410 | "source": [
|
460 | 411 | "ds.air.to_dataframe()"
|
461 | 412 | ]
|
462 |
| - }, |
463 |
| - { |
464 |
| - "cell_type": "markdown", |
465 |
| - "metadata": {}, |
466 |
| - "source": [ |
467 |
| - "Since columns in a `DataFrame` need to have the same index, they are\n", |
468 |
| - "broadcasted.\n" |
469 |
| - ] |
470 |
| - }, |
471 |
| - { |
472 |
| - "cell_type": "code", |
473 |
| - "execution_count": null, |
474 |
| - "metadata": {}, |
475 |
| - "outputs": [], |
476 |
| - "source": [ |
477 |
| - "ds.to_dataframe()" |
478 |
| - ] |
479 | 413 | }
|
480 | 414 | ],
|
481 | 415 | "metadata": {
|
|
0 commit comments