Skip to content

Commit 378abec

Browse files
authored
410 edits (#61)
* fix vrt dir * fix refs * remove header substitutes * typo, internal refs, superscript, remove substitutions * fix typo in s1 nb1 * typo * move env files, add vector data desc, open_zarr, some typos * standardize datacube -> data cube * fix create_vrt fn * add intermediate asf s1 cube * fix s1 code to support other data locs * small fixes * only calc season gb for one pol in pc nb * few more fixes
1 parent c84ca00 commit 378abec

File tree

143 files changed

+11958
-213958
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

143 files changed

+11958
-213958
lines changed

book/_config.yml

Lines changed: 2 additions & 184 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# Book settings
22
# Learn more at https://jupyterbook.org/customize/config.html
33

4-
title: Cloud-native geospatial datacube workflows with open-source tools
4+
title: Cloud-native geospatial data cube workflows with open-source tools
55
author: Emma Marshall
66
copyright: "2025" #, Emma Marshall
77
#logo: logo.png
@@ -33,7 +33,7 @@ bibtex_bibfiles:
3333

3434
# Information about where the book exists on the web
3535
repository:
36-
url: https://github.com/e-marshall/cloud-open-source-geospatial-datacube-workflows
36+
url: https://github.com/e-marshall/cloud-open-source-geospatial-data-cube-workflows
3737
branch: main
3838

3939
launch_buttons:
@@ -74,192 +74,10 @@ sphinx:
7474
- substitution
7575

7676
myst_substitutions:
77-
part1_title: "Part 2: Background"
78-
part2_title: "ITS_LIVE ice velocity data tutorial"
79-
#part2_title: "Using Xarray to examine cloud-based glacier surface velocity data"
80-
part3_title: "Sentinel-1 RTC imagery tutorial"
81-
#part3_title: "Sentinel-1 RTC data workflows with xarray"
82-
part4_title: "Part 5: Conclusion"
83-
84-
#tutorial 1 nb titles
85-
title_its_nb1: "# 3.1 Accessing cloud-hosted ITS_LIVE data"
86-
title_its_nb2: "# 3.2 Working with larger than memory data"
87-
title_its_nb3: "# 3.3 Handling raster and vector data"
88-
title_its_nb4: "# 3.4 Exploratory data analysis of a single glacier"
89-
title_its_nb5: "# 3.5 Exploratory data analysis of multiple glaciers"
90-
91-
#tutorial 2 nb titles
92-
title_s1_1: "# 4.1 Read Sentinel-1 data processed by ASF"
93-
title_s1_2: "# 4.2 Wrangle metadata"
94-
title_s1_3: "# 4.3 Exploratory analysis of ASF S1 imagery"
95-
title_s1_4: "# 4.4 Read Sentinel-1 RTC data from Microsoft Planetary Computer"
96-
title_s1_5: "# 4.5 Comparing Sentinel-1 RTC datasets"
97-
#title_s1_6: "# 6. Example of Sentinel-1 RTC time series analysis"
9877

9978
#global nb sections
100-
intro: "## Introduction"
101-
overview: "### Overview"
102-
outline: "### Outline"
103-
learning_goals: "### Learning goals"
104-
concepts: "#### Concepts"
105-
techniques: "#### Techniques"
106-
conclusion: "## Conclusion"
10779
break: "----"
10880

109-
#nb1
110-
#can't get subs + links to headings to work
111-
# so not using lettered headings for now
112-
# but still using numbered subsections (a1_...)
113-
a_its_nb1: "A. Overview of ITS_LIVE data"
114-
a1_its_nb1: "1) Data structure overview"
115-
a2_its_nb1: "2) Climate Forecast (CF) Metadata Conventions"
116-
117-
b_its_nb1: "B. Read ITS_LIVE data from AWS S3 using Xarray"
118-
b1_its_nb1: "1) Overview of ITS_LIVE data storage and catalog"
119-
b2_its_nb1: "2) Read ITS_LIVE data from S3 storage into memory"
120-
b3_its_nb1: "3) Check spatial footprint of data"
121-
122-
c_its_nb1: "C. Query ITS_LIVE catalog"
123-
c1_its_nb1: "1) Find ITS_LIVE granule for a point of interest"
124-
c2_its_nb1: "2) Read + visualize spatial footprint of ITS_LIVE data"
125-
126-
127-
#nb2
128-
a_its_nb2: "A. Compare approaches for reading larger than memory data"
129-
a1_its_nb2: "1) `chunks = 'auto'`"
130-
a2_its_nb2: "2) `chunks = {}`"
131-
a3_its_nb2: "3) An out-of-order time dimension"
132-
a4_its_nb2: "4) Read the dataset without Dask"
133-
b_its_nb2: "B. Organize data once it's in memory"
134-
b1_its_nb2: "1) Arrange dataset in chronological order"
135-
b2_its_nb2: "2) Convert to a Dask-backed `Xarray.Dataset`"
136-
137-
#nb3
138-
a_its_nb3: "Read data using strategy identified in previous notebook"
139-
b_its_nb3: "Incorporate glacier outline (vector) data"
140-
b1_its_nb3: "1) Read and reproject vector data"
141-
b2_its_nb3: "2) Visualize spatial extents of glacier outlines and ITS_LIVE data cube"
142-
b3_its_nb3: "3) Crop vector data to spatial extent of raster data"
143-
c_its_nb3: "C. Combine raster and vector data"
144-
c1_its_nb3: "1) Use vector data to crop raster data"
145-
c2_its_nb3: "2) Write clipped raster data cube to disk"
146-
147-
#nb4
148-
a_its_nb4: "A. Data exploration"
149-
a1_its_nb4: "1) Load raster data and visualize with vector data"
150-
a2_its_nb4: "2) Examine data coverage along the time dimension"
151-
a3_its_nb4: "3) Look at data by sensor"
152-
b_its_nb4: "B. Comparing different satellites"
153-
b1_its_nb4: "1) DataTree approach"
154-
b2_its_nb4: "2) GroupBy approach"
155-
c_its_nb4: "C. Examine velocity variability"
156-
c1_its_nb4: "1) Histograms and summary statistics"
157-
c2_its_nb4: "2) Spatial velocity variablity"
158-
c3_its_nb4: "3) Temporal velocity variability"
159-
d_its_nb4: "D. Dimensional computations"
160-
d1_its_nb4: "1) Temporal resampling"
161-
d2_its_nb4: "2) Grouped analysis by season"
162-
163-
#nb5
164-
a_its_nb5: "A. Read and organize data"
165-
a1_its_nb5: "1) Raster data"
166-
a2_its_nb5: "2) Vector data"
167-
168-
b_its_nb5: "B. Combine raster and vector to create a vector data cube"
169-
b1_its_nb5: "1) Make a vector data cube"
170-
b2_its_nb5: "2) Add attribute data to vector cube"
171-
b3_its_nb5: "3) Write vector cube to disk"
172-
173-
c_its_nb5: "C. Data visualization"
174-
c1_its_nb5: "1) Read vector data cube into memory"
175-
c2_its_nb5: "2) Visualize velocity data"
176-
c3_its_nb5: "3) Visualize associations between velocity and attribute data"
177-
178-
179-
#sentinel nb1
180-
a_s1_nb1: "A. Prepare to read data into memory"
181-
a1_s1_nb1: "1) Build lists of file names and paths needed for VRT objects"
182-
a2_s1_nb1: "2) Create VRT objects"
183-
b_s1_nb1: "B. Read data"
184-
b1_s1_nb1: "1) Take a look at chunking"
185-
# sentinel nb2
186-
a_s1_nb2: "A. Read and inspect initial metadata"
187-
a1_s1_nb2: "1) Add appropriate names to variables"
188-
a2_s1_nb2: "2) What metadata currently exists?"
189-
190-
b_s1_nb2: "B. Add metadata from file name"
191-
b1_s1_nb2: "1) Parse file name"
192-
b2_s1_nb2: "2) Extract and format acquisition dates"
193-
b3_s1_nb2: "3) Combine data cubes"
194-
195-
c_s1_nb2: "C. Time-varying metadata"
196-
c1_s1_nb2: "1) Extract attributes as list of dictionaries"
197-
c2_s1_nb2: "2) Create tuple of metadata for each type of information"
198-
c3_s1_nb2: "3) Assign metadata tuple to Xarray dataset as a coordinate variable"
199-
200-
d_s1_nb2: "D. Add metadata from a markdown file"
201-
d1_s1_nb2: "1) Extract granule ID"
202-
d2_s1_nb2: "2) Build coordinate `xr.DataArray`"
203-
204-
#sentinel nb3
205-
a_s1_nb3: "A. Read and prepare data"
206-
a1_s1_nb3: "1) Clip to spatial area of interest"
207-
208-
b_s1_nb3: "B. Layover-shadow map"
209-
b1_s1_nb3: "1) Interactive visualization of layover-shadow maps"
210-
211-
c_s1_nb3: "C. Orbital direction"
212-
c1_s1_nb3: "1) Is a pass ascending or descending?"
213-
c2_s1_nb3: "2) Assign orbital direction as a coordinate variable"
214-
215-
d_s1_nb3: "D. Duplicate time steps"
216-
d1_s1_nb3: "1) Identify duplicate time steps"
217-
d2_s1_nb3: "2) Visualize duplicates"
218-
d3_s1_nb3: "3) Drop duplicates"
219-
220-
e_s1_nb3: "E. Examine coverage over time series"
221-
222-
f_s1_nb3: "F. Data visualization"
223-
f1_s1_nb3: "1) Mean backscatter over time"
224-
f2_s1_nb3: "2) Seasonal backscatter variability"
225-
f3_s1_nb3: "3) Backscatter time series"
226-
227-
228-
229-
230-
#s1 nb4
231-
a_s1_nb4: "A. Connect to Microsoft Planetary Computer"
232-
a1_s1_nb4: "1) Explore STAC metadata"
233-
234-
b_s1_nb4: "B. Read data and create Xarray data cube"
235-
b1_s1_nb4: "1) Create a Dask distributed cluster"
236-
b2_s1_nb4: "2) Use `stackstac` to pull queried data from Planetary Computer"
237-
b3_s1_nb4: "3) Inspect dataset"
238-
#b4_s1_nb4: "4) Convert a 4-d `xr.DataArray` to a 3-d `xr.Dataset`"
239-
240-
c_s1_nb4: "C. Visualize data"
241-
c1_s1_nb4: "1) Ascending and descending pass acquisitions"
242-
c2_s1_nb4: "2) Variability over time"
243-
c3_s1_nb4: "3) Seasonal variability"
244-
245-
#s1 nb5
246-
a_s1_nb5: "A. Read and prepare data"
247-
a1_s1_nb5: "1) Check coordinate reference system information"
248-
249-
b_s1_nb5: "B. Ensure direct comparison between datasets"
250-
b1_s1_nb5: "1) Subset time series to common time steps"
251-
b2_s1_nb5: "2) Handle differences in spatial resolution"
252-
b3_s1_nb5: "3) Mask missing data from one dataset"
253-
254-
c_s1_nb5: "C. Combine objects"
255-
c1_s1_nb5: "1) `expand_dims()` to add 'source' dimension"
256-
c2_s1_nb5: "2) `combine_by_coords()`"
257-
258-
d_s1_nb5: "D. Visualize comparison"
259-
d1_s1_nb5: "1) Mean over time"
260-
d2_s1_nb5: "2) Mean over space"
261-
d3_s1_nb5: "3) Difference"
262-
26381
# Not sure why but uncommenting these causes all of the md
26482
# substitution variables and formatting like tabs to not work
26583
#sphinx:

book/background/2_data_cubes.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,9 @@ The term **data cube** is used frequently throughout this book. This page contai
55

66
[^mynote2]: Geffner et al. frame this distinction as *measure attributes* ("attributes whose values are of interest") and *functional attributes* that contextualize the measure attribute values {cite:t}`geffner_2000_dynamic`.
77

8+
The key object of analysis in this book is a data cube. Many scientific workflows examine how a given variable (such as temperature, wind speed, relative humidity, etc.) varies over time and/or space. Data cubes are a way of organizing geospatial data that allow us to ask these questions. Most of the examples are [raster data cubes](https://openeo.org/documentation/1.0/datacubes.html). Raster data cubes are n-dimensional objects that store continuous measurements or estimates of physical quantities that exist along a given dimension(s).
89

9-
The key object of analysis in this book is a [raster data cube](https://openeo.org/documentation/1.0/datacubes.html). Raster data cubes are n-dimensional objects that store continuous measurements or estimates of physical quantities that exist along given dimension(s). Many scientific workflows involve examining how a variable (such as temperature, windspeed, relative humidity, etc.) varies over time and/or space. Data cubes are a way of organizing geospatial data that let us ask these questions.
10+
Many examples in the book also include vector data. In contrast to raster data, where continuous measurements are stored on a grid, vector data represent geographic features such as roads, rivers, and political borders using points, lines, and polygons. Vector data are often stored as table-like data frames, where geometry and attribute information for individual features are stored in each row of the table. A relatively new development in the Xarray and Python ecosystem is support for vector data cubes. Vector data cubes are similar to raster data cubes, except that one of the cube's dimensions is an array of geometry objects. This allows you to store multi-dimensional data associated with each geometry.
1011

1112
A very common data cube structure is a 3-dimensional object with (`x`,`y`,`time`) dimensions ({cite:t}`Baumann_2019_datacube,giuliani_2019_EarthObservationOpen,mahecha_2020_EarthSystemData,montero_2024_EarthSystemData`). While this is a relatively intuitive concept,in practice, the amount and types of information contained within a single dataset and the operations involved in managing them, can become complicated and unwieldy. As analysts, we access data (usually from providers such as Distributed Active Archive Center or [DAACs](https://nssdc.gsfc.nasa.gov/earth/daacs.html)), and then we are responsible for organizing the data in a way that let's us ask questions of it. While some of these decisions are straightforward (eg. *It makes sense to stack observations from different points in time along a time dimension*), some can be more open-ended (*Where and how should important metadata be stored so that it will propagate across appropriate operations and be accessible when it is needed?*).
1213

@@ -70,6 +71,8 @@ In the second [tutorial](../sentinel1/s1_intro.md), we work with two Sentinel-1
7071

7172
### See also
7273
- [OpenEO - Data Cubes](https://openeo.org/documentation/1.0/datacubes.html)
74+
- [r-spatial - Vector Data Cubes](https://r-spatial.org/r/2022/09/12/vdc.html)
75+
- [Xvec - Vector data cubes for Xarray](https://xvec.readthedocs.io/en/stable/)
7376
- [Open Data Cube initiative](https://www.opendatacube.org/about-draft)
7477
- [The Datacube Manifesto](http://www.earthserver.eu/tech/datacube-manifesto/The-Datacube-Manifesto.pdf)
7578
- [ARCO: The smartest way to access big geospatial data - Lobelia Earth](https://blog.lobelia.earth/arco-the-smartest-way-to-access-big-geospatial-data-eaf689eff3c9)

book/background/3_tutorials_overview.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
This book contains two distinct tutorials, each of which focuses on a different cloud-optimized geospatial dataset and different cloud-computing resources. Read more about the datasets used [here](4_tutorial_data.md).
44

5-
## *Part 1: {{part2_title}}*
5+
## *Part 1: ITS_LIVE ice velocity data tutorial*
66

77
This tutorial focuses on a dataset of ice velocity observations derived from satellite image pairs, using a number of different satellite sensors. This dataset is accessed as Zarr data cubes from AWS S3 cloud object storage. The notebooks in this tutorial focus on:
88

@@ -12,7 +12,7 @@ This tutorial focuses on a dataset of ice velocity observations derived from sat
1212
4) Inspecting metadata and using metadata to subset and visualize the dataset,
1313
5) Exploratory data analysis and visualization at the scale of a single glacier
1414

15-
## *Part 2: {{part3_title}}*
15+
## *Part 2: Sentinel-1 RTC imagery tutorial*
1616

1717
This tutorial focuses on data from Sentinel-1, a synthetic aperture radar (SAR) dataset containing imagery collected at C-band. Specifically, we are looking at Sentinel-1 Radiometric Terrain Corrected (RTC) imagery (for more detail on this, see [tutorial data](4_tutorial_data.md)). We demonstrate how to access and work with two Sentinel-1 RTC datasets as well as how to set up and perform an initial comparison between the two and time series analysis of Sentinel-1 backscatter variability. These notebooks cover:
1818

book/background/4_tutorial_data.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# 2.4 Data used in tutorials
22

3-
We use a many different datasets throughout these tutorials. While each tutorial is focused on a different raster time series (ITS_LIVE ice velocity data and Sentinel-1 imagery), we also use vector data to represent points of interest.
3+
We use many different datasets throughout these tutorials. While each tutorial is focused on a different raster time series (ITS_LIVE ice velocity data and Sentinel-1 imagery), we also use vector data to represent points of interest.
44

55
Most of the examples in this book use data accessed programmatically from cloud-object storage. We make subset of the data available in this books Github repository to remove the need for computationally-intensive operations in the tutorials. In one example, working with Sentinel-1 data processed by Alaska Satellite Facility, we start with data downloaded locally. Users who would like to complete this processing step on their own may do so (and access the data [here](https://zenodo.org/records/15036782)), but a smaller subset of this data is stored in the repository.
66

book/background/5_software.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,21 @@
11
# 2.5 Software and computing environment
22

3-
On this page you'll find information about the computing environment that will be used in both of the tutorials in this book. We provide instructions for Running locally (on laptop), or on a hosted JupyterHub in AWS us-west-2.
3+
On this page you'll find information about the computing environment that will be used for both of the tutorials in this book. We provide instructions for running locally (on a laptop), or on a hosted JupyterHub in AWS us-west-2.
44

55
## *Running tutorial materials locally*
66

77
There are two options for creating a software environment: [pixi](https://pixi.sh/latest/) or [mamba](https://mamba.readthedocs.io/en/latest/) / [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html). We recommend using pixi to create a consistent environment on different operating systems. If you have pixi installed, follow the steps below, otherwise, follow the steps for conda/mamba below.
88

99
### To use pixi
1010
1. Clone the book's GitHub repository:
11-
```git clone https://github.com/e-marshall/cloud-open-source-geospatial-datacube-workflows.git```
11+
```git clone https://github.com/e-marshall/cloud-open-source-geospatial-data-cube-workflows.git```
1212

1313
2. Navigate into the repo environment:
14-
```cd cloud-open-source-geospatial-datacube-workflows```
14+
```cd cloud-open-source-geospatial-data-cube-workflows```
1515

16-
3. There is a small data cube included in the repo that is used in the tutorials. We don't want git to track this so we tell it to ignore this file path.
17-
```git update-index --assume-unchanged book/itslive/data/raster_data/regional_glacier_velocity_vector_cube.zarr/.```
16+
3. There are two small data cubes included in the repo that are used in the tutorials. We don't want git to track these so we tell git to ignore these file paths:
17+
18+
```git update-index --assume-unchanged book/itslive/data/raster_data/regional_glacier_velocity_vector_cube.zarr/. book/sentinel/data/raster_data/full_timeseries/intermediate_cubes/s1_asf_clipped_cube.zarr/.```
1819

1920
4. Execute `pixi run` for each tutorial:
2021
```pixi run itslive```
@@ -25,17 +26,18 @@ Note that the first `pixi run` will download specific versions of all required P
2526
### To use conda/mamba
2627

2728
1. Clone this book's GitHub repository:
28-
```git clone https://github.com/e-marshall/cloud-open-source-geospatial-datacube-workflows.git```
29+
```git clone https://github.com/e-marshall/cloud-open-source-geospatial-data-cube-workflows.git```
2930

3031
2. Navigate into the `book` sub-directory:
31-
```cd cloud-open-source-geospatial-datacube-workflows/book```
32+
```cd cloud-open-source-geospatial-data-cube-workflows/book```
3233

3334
3. Create and activate a conda environment from the `environment.yml` file located in the repo:
3435
```conda env create -f environment.yml```
3536
```conda activate book```
3637

37-
4. There is a small data cube included in the repo that is used in the tutorials. We don't want git to track this so we tell it to ignore this file path.
38-
```git update-index --assume-unchanged book/itslive/data/raster_data/regional_glacier_velocity_vector_cube.zarr/.```
38+
4. There are two small data cubes included in the repo that are used in the tutorials. We don't want git to track these so we tell git to ignore these file paths:
39+
40+
```git update-index --assume-unchanged book/itslive/data/raster_data/regional_glacier_velocity_vector_cube.zarr/. book/sentinel/data/raster_data/full_timeseries/intermediate_cubes/s1_asf_clipped_cube.zarr/.```
3941

4042
5. Start Jupyterlab and navigate to the directories containing the Jupyter notebooks (`itslive/nbs` and `s1/nbs`):
4143
```jupyterlab```

0 commit comments

Comments
 (0)