Skip to content

Commit df11caa

Browse files
committed
reorg data structure intro, add malaria example
1 parent 1d75a4f commit df11caa

File tree

3 files changed

+102
-96
lines changed

3 files changed

+102
-96
lines changed

fundamentals/01_data_structures.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,69 @@
11
# Data Structures
22

3+
Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”)
4+
are an essential part of computational science. They are encountered in a wide
5+
range of fields, including physics, astronomy, geoscience, bioinformatics,
6+
engineering, finance, and deep learning. In Python, [NumPy](https://numpy.org/)
7+
provides the fundamental data structure and API for working with raw ND arrays.
8+
However, real-world datasets are usually more than just raw numbers; they have
9+
labels which encode information about how the array values map to locations in
10+
space, time, etc.
11+
12+
The N-dimensional nature of xarray’s data structures makes it suitable for
13+
dealing with multi-dimensional scientific data, and its use of dimension names
14+
instead of axis labels (`dim='time'` instead of `axis=0`) makes such arrays much
15+
more manageable than the raw numpy ndarray: with xarray, you don’t need to keep
16+
track of the order of an array’s dimensions or insert dummy dimensions of size 1
17+
to align arrays (e.g., using np.newaxis).
18+
19+
The immediate payoff of using xarray is that you’ll write less code. The
20+
long-term payoff is that you’ll understand what you were thinking when you come
21+
back to look at it weeks or months later.
22+
23+
## Example: Weather forecast
24+
25+
Here is an example of how we might structure a dataset for a weather forecast:
26+
27+
<img src="https://docs.xarray.dev/en/stable/_images/dataset-diagram.png" align="center" width="80%">
28+
29+
You'll notice multiple data variables (temperature, precipitation), coordinate
30+
variables (latitude, longitude), and dimensions (x, y, t). We'll cover how these
31+
fit into Xarray's data structures below.
32+
33+
Xarray doesn’t just keep track of labels on arrays – it uses them to provide a
34+
powerful and concise interface. For example:
35+
36+
- Apply operations over dimensions by name: `x.sum('time')`.
37+
38+
- Select values by label (or logical location) instead of integer location:
39+
`x.loc['2014-01-01']` or `x.sel(time='2014-01-01')`.
40+
41+
- Mathematical operations (e.g., `x - y`) vectorize across multiple dimensions
42+
(array broadcasting) based on dimension names, not shape.
43+
44+
- Easily use the split-apply-combine paradigm with groupby:
45+
`x.groupby('time.dayofyear').mean()`.
46+
47+
- Database-like alignment based on coordinate labels that smoothly handles
48+
missing values: `x, y = xr.align(x, y, join='outer')`.
49+
50+
- Keep track of arbitrary metadata in the form of a Python dictionary:
51+
`x.attrs`.
52+
53+
## Example: Mosquito genetics
54+
55+
Although the Xarray library was originally developed with Earth Science datasets in mind, the datastructures work well across many other domains! For example, below is a side-by-side view of a data schematic on the left and Xarray Dataset representation on the right taken from a mosquito genetics analysis:
56+
57+
<img src="https://vobs-resources.cog.sanger.ac.uk/training/img/workshop-4/mosquito-genotype-array.png" align="center" width="80%">
58+
59+
The data can be stored as a 3-dimensional array, where one dimension of the array corresponds to positions (**variants**) within a reference genome, another dimension corresponds to the individual mosquitoes that were sequenced (**samples**), and a third dimension corresponds to the number of genomes within each individual (**ploidy**)."
60+
61+
You can explore this dataset in detail via the [training course in data analysis for genomic surveillance of African malaria vectors](https://anopheles-genomic-surveillance.github.io/workshop-5/module-1-xarray.html)!
62+
63+
## Explore on your own
64+
65+
The following collection of notebooks provide interactive code examples for working with example datasets and constructing Xarray data structures manually.
66+
367
```{tableofcontents}
468
569
```

fundamentals/01_datastructures.ipynb

Lines changed: 28 additions & 94 deletions
Original file line numberDiff line numberDiff line change
@@ -9,59 +9,12 @@
99
"In this lesson, we cover the basics of Xarray data structures. Our\n",
1010
"learning goals are as follows. By the end of the lesson, we will be able to:\n",
1111
"\n",
12+
":::{admonition} Learning Goals\n",
1213
"- Understand the basic data structures (`DataArray` and `Dataset` objects) in Xarray\n",
13-
"\n",
14-
"---\n",
15-
"\n",
16-
"## Introduction\n",
17-
"\n",
18-
"Multi-dimensional (a.k.a. N-dimensional, ND) arrays (sometimes called “tensors”)\n",
19-
"are an essential part of computational science. They are encountered in a wide\n",
20-
"range of fields, including physics, astronomy, geoscience, bioinformatics,\n",
21-
"engineering, finance, and deep learning. In Python, [NumPy](https://numpy.org/)\n",
22-
"provides the fundamental data structure and API for working with raw ND arrays.\n",
23-
"However, real-world datasets are usually more than just raw numbers; they have\n",
24-
"labels which encode information about how the array values map to locations in\n",
25-
"space, time, etc.\n",
26-
"\n",
27-
"Here is an example of how we might structure a dataset for a weather forecast:\n",
28-
"\n",
29-
"<img src=\"https://docs.xarray.dev/en/stable/_images/dataset-diagram.png\" align=\"center\" width=\"80%\">\n",
30-
"\n",
31-
"You'll notice multiple data variables (temperature, precipitation), coordinate\n",
32-
"variables (latitude, longitude), and dimensions (x, y, t). We'll cover how these\n",
33-
"fit into Xarray's data structures below.\n",
34-
"\n",
35-
"Xarray doesn’t just keep track of labels on arrays – it uses them to provide a\n",
36-
"powerful and concise interface. For example:\n",
37-
"\n",
38-
"- Apply operations over dimensions by name: `x.sum('time')`.\n",
39-
"\n",
40-
"- Select values by label (or logical location) instead of integer location:\n",
41-
" `x.loc['2014-01-01']` or `x.sel(time='2014-01-01')`.\n",
42-
"\n",
43-
"- Mathematical operations (e.g., `x - y`) vectorize across multiple dimensions\n",
44-
" (array broadcasting) based on dimension names, not shape.\n",
45-
"\n",
46-
"- Easily use the split-apply-combine paradigm with groupby:\n",
47-
" `x.groupby('time.dayofyear').mean()`.\n",
48-
"\n",
49-
"- Database-like alignment based on coordinate labels that smoothly handles\n",
50-
" missing values: `x, y = xr.align(x, y, join='outer')`.\n",
51-
"\n",
52-
"- Keep track of arbitrary metadata in the form of a Python dictionary:\n",
53-
" `x.attrs`.\n",
54-
"\n",
55-
"The N-dimensional nature of xarray’s data structures makes it suitable for\n",
56-
"dealing with multi-dimensional scientific data, and its use of dimension names\n",
57-
"instead of axis labels (`dim='time'` instead of `axis=0`) makes such arrays much\n",
58-
"more manageable than the raw numpy ndarray: with xarray, you don’t need to keep\n",
59-
"track of the order of an array’s dimensions or insert dummy dimensions of size 1\n",
60-
"to align arrays (e.g., using np.newaxis).\n",
61-
"\n",
62-
"The immediate payoff of using xarray is that you’ll write less code. The\n",
63-
"long-term payoff is that you’ll understand what you were thinking when you come\n",
64-
"back to look at it weeks or months later.\n"
14+
"- Customize the display of Xarray objects\n",
15+
"- Access variables, coordinates, and arbitrary metadata\n",
16+
"- Transform to tabular Pandas data structures\n",
17+
":::"
6518
]
6619
},
6720
{
@@ -72,13 +25,10 @@
7225
"\n",
7326
"Xarray provides two data structures: the `DataArray` and `Dataset`. The\n",
7427
"`DataArray` class attaches dimension names, coordinates and attributes to\n",
75-
"multi-dimensional arrays while `Dataset` combines multiple arrays.\n",
28+
"multi-dimensional arrays while `Dataset` combines multiple DataArrays.\n",
7629
"\n",
7730
"Both classes are most commonly created by reading data.\n",
78-
"To learn how to create a DataArray or Dataset manually, see the [Creating Data Structures](01.1_creating_data_structures.ipynb) tutorial.\n",
79-
"\n",
80-
"Xarray has a few small real-world tutorial datasets hosted in this GitHub repository https://github.com/pydata/xarray-data.\n",
81-
"We'll use the [xarray.tutorial.load_dataset](https://docs.xarray.dev/en/stable/generated/xarray.tutorial.open_dataset.html#xarray.tutorial.open_dataset) convenience function to download and open the `air_temperature` (National Centers for Environmental Prediction) Dataset by name."
31+
"To learn how to create a DataArray or Dataset manually, see the [Creating Data Structures](01.1_creating_data_structures.ipynb) tutorial."
8232
]
8333
},
8434
{
@@ -88,7 +38,13 @@
8838
"outputs": [],
8939
"source": [
9040
"import numpy as np\n",
91-
"import xarray as xr"
41+
"import xarray as xr\n",
42+
"import pandas as pd\n",
43+
"\n",
44+
"# When working in a Jupyter Notebook you might want to customize Xarray display settings to your liking\n",
45+
"# The following settings reduce the amount of data displayed out by default\n",
46+
"xr.set_options(display_expand_attrs=False, display_expand_data=False)\n",
47+
"np.set_printoptions(threshold=10, edgeitems=2)"
9248
]
9349
},
9450
{
@@ -97,7 +53,10 @@
9753
"source": [
9854
"### Dataset\n",
9955
"\n",
100-
"`Dataset` objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.\n"
56+
"`Dataset` objects are dictionary-like containers of DataArrays, mapping a variable name to each DataArray.\n",
57+
"\n",
58+
"Xarray has a few small real-world tutorial datasets hosted in this GitHub repository https://github.com/pydata/xarray-data.\n",
59+
"We'll use the [xarray.tutorial.load_dataset](https://docs.xarray.dev/en/stable/generated/xarray.tutorial.open_dataset.html#xarray.tutorial.open_dataset) convenience function to download and open the `air_temperature` (National Centers for Environmental Prediction) Dataset by name."
10160
]
10261
},
10362
{
@@ -147,14 +106,14 @@
147106
"cell_type": "markdown",
148107
"metadata": {},
149108
"source": [
150-
"#### What is all this anyway? (String representations)\n",
109+
"#### HTML vs text representations\n",
151110
"\n",
152111
"Xarray has two representation types: `\"html\"` (which is only available in\n",
153112
"notebooks) and `\"text\"`. To choose between them, use the `display_style` option.\n",
154113
"\n",
155114
"So far, our notebook has automatically displayed the `\"html\"` representation (which we will continue using).\n",
156-
"The `\"html\"` representation is interactive, allowing you to collapse sections (left arrows) and\n",
157-
"view attributes and values for each value (right hand sheet icon and data symbol)."
115+
"The `\"html\"` representation is interactive, allowing you to collapse sections () and\n",
116+
"view attributes and values for each value (📄 and )."
158117
]
159118
},
160119
{
@@ -180,7 +139,7 @@
180139
"- an unordered list of *coordinates* or dimensions with coordinates with one item\n",
181140
" per line. Each item has a name, one or more dimensions in parentheses, a dtype\n",
182141
" and a preview of the values. Also, if it is a dimension coordinate, it will be\n",
183-
" marked with a `*`.\n",
142+
" printed in **bold** font.\n",
184143
"- an alphabetically sorted list of *dimensions without coordinates* (if there are any)\n",
185144
"- an unordered list of *attributes*, or metadata"
186145
]
@@ -379,15 +338,6 @@
379338
"methods on `xarray` objects:\n"
380339
]
381340
},
382-
{
383-
"cell_type": "code",
384-
"execution_count": null,
385-
"metadata": {},
386-
"outputs": [],
387-
"source": [
388-
"import pandas as pd"
389-
]
390-
},
391341
{
392342
"cell_type": "code",
393343
"execution_count": null,
@@ -429,8 +379,8 @@
429379
"cell_type": "markdown",
430380
"metadata": {},
431381
"source": [
432-
"**<code>to_series</code>**: This will always convert `DataArray` objects to\n",
433-
"`pandas.Series`, using a `MultiIndex` for higher dimensions\n"
382+
"### to_series\n",
383+
"This will always convert `DataArray` objects to `pandas.Series`, using a `MultiIndex` for higher dimensions\n"
434384
]
435385
},
436386
{
@@ -446,9 +396,10 @@
446396
"cell_type": "markdown",
447397
"metadata": {},
448398
"source": [
449-
"**<code>to_dataframe</code>**: This will always convert `DataArray` or `Dataset`\n",
450-
"objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named\n",
451-
"for this.\n"
399+
"### to_dataframe\n",
400+
"\n",
401+
"This will always convert `DataArray` or `Dataset` objects to a `pandas.DataFrame`. Note that `DataArray` objects have to be named for this. Since columns in a `DataFrame` need to have the same index, they are\n",
402+
"broadcasted."
452403
]
453404
},
454405
{
@@ -459,23 +410,6 @@
459410
"source": [
460411
"ds.air.to_dataframe()"
461412
]
462-
},
463-
{
464-
"cell_type": "markdown",
465-
"metadata": {},
466-
"source": [
467-
"Since columns in a `DataFrame` need to have the same index, they are\n",
468-
"broadcasted.\n"
469-
]
470-
},
471-
{
472-
"cell_type": "code",
473-
"execution_count": null,
474-
"metadata": {},
475-
"outputs": [],
476-
"source": [
477-
"ds.to_dataframe()"
478-
]
479413
}
480414
],
481415
"metadata": {

workshops/scipy2024/index.ipynb

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
":::{admonition} Learning Goals\n",
2121
"- Orient yourself to Xarray resources to continue on your Xarray journey!\n",
2222
"- Effectively use Xarray’s multidimensional indexing and computational patterns\n",
23-
"- Understand how Xarray can wrap other array types in the scientific Python ecosystem\n",
23+
"- Understand how Xarray integrates with other libraries in the scientific Python ecosystem\n",
2424
"- Learn how to leverage Xarray’s powerful backend and extension capabilities to customize workflows and open a variety of scientific datasets\n",
2525
":::\n",
2626
"\n",
@@ -35,7 +35,7 @@
3535
"| Introduction and Setup | 1:30 (10 min) | --- | \n",
3636
"| The Xarray Data Model | 1:40 (40 min) | [Data structures](../../fundamentals/01_datastructures.ipynb) <br> [Basic Indexing](../../fundamentals/02.1_indexing_Basic.ipynb) | \n",
3737
"| *10 minute Break* \n",
38-
"| Indexing & Computational Patterns | 2:30 (50 min) | [Advanced Indexing](../../intermediate/indexing/indexing.md) <br> [Computation Patterns](../../intermediate/01-high-level-computation-patterns.ipynb) <br> | \n",
38+
"| Indexing & Computational Patterns | 2:30 (50 min) | [Advanced Indexing](../../intermediate/indexing/indexing.md) <br> [Computational Patterns](../../intermediate/01-high-level-computation-patterns.ipynb) <br> | \n",
3939
"| *10 minute Break* | \n",
4040
"| Xarray Integrations and Extensions | 3:30 (50 min) | [The Xarray Ecosystem](../../intermediate/xarray_ecosystem.ipynb) | \n",
4141
"| *10 minute Break* | \n",
@@ -81,6 +81,14 @@
8181
"- Max Jones (CarbonPlan)\n",
8282
"- Wietze Suijker (Space Intelligence)"
8383
]
84+
},
85+
{
86+
"cell_type": "code",
87+
"execution_count": null,
88+
"id": "1",
89+
"metadata": {},
90+
"outputs": [],
91+
"source": []
8492
}
8593
],
8694
"metadata": {

0 commit comments

Comments
 (0)