You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: book/background/context_motivation.md
+5-4Lines changed: 5 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,13 +7,14 @@ This book demonstrates scientific workflows using publicly-available, cloud-opti
7
7
Technological developments in recent decades have engendered fundamental shifts in the nature of scientific data and how it is used for analysis.
8
8
9
9
```{epigraph}
10
-
"Traditionally, scientific data have been distributed via a “download model,” wherein scientists download individual data files to local computers for analysis.After downloading many files, scientists typically have to do extensive processing and organizing to make them useful for the data analysis; this creates a barrier to reproducibility, since a scientist’s analysis code must account for this unique “local” organization. Furthermore, the sheer size of the datasets (many terabytes to petabytes) can make downloading effectively impossible. Analysis of such data volumes also can benefit from parallel / distributed computing, which is not always readily available on local computers. Finally, this model reinforces inequality between privileged institutions that have the resources to host local copies of the data and those that don’t. This restricts who can participate in science."
11
-
-- {cite}`abernathey_2021_cloud`
10
+
"Traditionally, scientific data have been distributed via a “download model,” wherein scientists download individual data files to local computers for analysis.After downloading many files, scientists typically have to do extensive processing and organizing to make them useful for the data analysis; this creates a barrier to reproducibility, since a scientist’s analysis code must account for this unique “local” organization. Furthermore, the sheer size of the datasets (many terabytes to petabytes) can make downloading effectively impossible. Analysis of such data volumes also can benefit from parallel / distributed computing, which is not always readily available on local computers. Finally, this model reinforces inequality between privileged institutions that have the resources to host local copies of the data and those that don’t. This restricts who can participate in science."
11
+
12
+
-- {cite:t}`abernathey_2021_cloud`
12
13
```
13
14
14
15
### *II. Increasingly large, cloud-optimized data means new tools and approaches for data management*
15
16
16
-
The increase in publicly available earth observation data has transformed scientific workflows across a range of fields, prompting analysts to gain new skills in order to work with larger volumes of data in new formats and locations, and to use distributed cloud-computational resources in their analysis {cite}`abernathey_2021_cloud,gentemann_2021_science,mathieu_2017_esas,ramachandran_2021_open,Sudmanns_2020_big,wagemann_2021_user`.
17
+
The increase in publicly available earth observation data has transformed scientific workflows across a range of fields, prompting analysts to gain new skills in order to work with larger volumes of data in new formats and locations, and to use distributed cloud-computational resources in their analysis ({cite:t}`abernathey_2021_cloud,gentemann_2021_science,mathieu_2017_esas,ramachandran_2021_open,Sudmanns_2020_big,wagemann_2021_user`).
17
18
18
19
```{figure} imgs/fy24-projection-chart.png
19
20
---
@@ -23,6 +24,6 @@ Volume of NASA Earth Science Data archives, including growth of existing-mission
23
24
24
25
### *III. Asking questions of complex datasets*
25
26
26
-
Scientific workflows involve asking complex questions of diverse types of data. Earth observation and related datasets often contain two types of information: measurements of a physical observable (eg. temperature) and metadata that provides auxiliary information that required in order to interpret the physical observable (time and location of measurement, information about the sensor, etc.). With increasingly complex and large volumes of earth observation data that is currently available, storing, managing and organizing these types of data can very quickly become a complex and challenging task, especially for students and early-career analysts {cite}`mathieu_esas_2017,palumbo_2017_building,Sudmanns_2020_big,wagemann_2021_user`.
27
+
Scientific workflows involve asking complex questions of diverse types of data. Earth observation and related datasets often contain two types of information: measurements of a physical observable (eg. temperature) and metadata that provides auxiliary information that required in order to interpret the physical observable (time and location of measurement, information about the sensor, etc.). With increasingly complex and large volumes of earth observation data that is currently available, storing, managing and organizing these types of data can very quickly become a complex and challenging task, especially for students and early-career analysts ({cite:t}`mathieu_esas_2017,palumbo_2017_building,Sudmanns_2020_big,wagemann_2021_user`).
27
28
28
29
This book provides detailed examples of scientific workflow steps that ingest complex, multi-dimensional datastets, introduce users to the landscape of popular, actively-maintained opens-source software packages for working with geospatial data in Python, and include strategies for working with larger-than memory data stored in publicly available cloud-hosted repositories. These demonstrations are accompanied by detailed discussion of concepts involved in analyzing earth observation data such as dataset inspection, manipulation, and exploratory analysis and visualization. Overall, we emphasize the importance of understanding the structure of multi-dimensional earth observation datasets within the context of a given data model and demonstrate how such an understanding can enable more efficient and intuitive scientific workflows.
Copy file name to clipboardExpand all lines: book/background/data_cubes.md
+9-6Lines changed: 9 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,20 +6,20 @@ The term **data cube** is used frequently throughout this book. This page contai
6
6
7
7
The key object of analysis in this book is a [raster data cube](https://openeo.org/documentation/1.0/datacubes.html). Raster data cubes are n-dimensional objects that store continuous measurements or estimates of physical quantities that exist along given dimension(s). Many scientific workflows involve examining how a variable (such as temperature, windspeed, relative humidity, etc.) varies over time and/or space. Data cubes are a way of organizing geospatial data that let us ask these questions.
8
8
9
-
A very common data cube structure is a 3-dimensional object with (`x`,`y`,`time`) dimensions. While this is a relatively intuitive concept,in practice, the amount and types of information contained within a single dataset and the operations involved in managing them, can become complicated and unwieldy. As analysts, we accesss data (usually from providers such as Distributed Active Archive Centers ([DAACs](https://nssdc.gsfc.nasa.gov/earth/daacs.html))), and then we are responsible for organizing the data in a way that let's us ask questions of it. While some of these decisions are straightforward (eg. *It makes sense to stack observations from different points in time along a time dimension*), some can be more open-ended (*Where and how should important metadata be stored so that it will propagate across appropriate operations and be accessible when it is needed?*).
9
+
A very common data cube structure is a 3-dimensional object with (`x`,`y`,`time`) dimensions ({cite:t}`baumann_2017_datacube,mahecha_2020_EarthSystemData,giuliani_2019_EarthObservationOpen,montero_2024_EarthSystemData`). While this is a relatively intuitive concept,in practice, the amount and types of information contained within a single dataset and the operations involved in managing them, can become complicated and unwieldy. As analysts, we accesss data (usually from providers such as Distributed Active Archive Centers ([DAACs](https://nssdc.gsfc.nasa.gov/earth/daacs.html))), and then we are responsible for organizing the data in a way that let's us ask questions of it. While some of these decisions are straightforward (eg. *It makes sense to stack observations from different points in time along a time dimension*), some can be more open-ended (*Where and how should important metadata be stored so that it will propagate across appropriate operations and be accessible when it is needed?*).
10
10
11
11
### *Two types of information*
12
12
Fundamentally, many of these complexities can be reduced to one distinction: is a particular piece of information a physical observable (the main focus, or target, of the dataset), or is it metadata that provides necessary information in order to properly interpret and handle the physical observable? Answering this question will help you understand how to situate a piece of information within the broader data object.
13
13
14
-
[^mynote1]: An image collection is a set of n images, where images contain m variables or spectral bands. Band data from one image share a common spatial footprint, acquisition date/time, and spatial reference system but may have different pixel sizes. Technically, the data of bands may come from one or more files, depending on the organization of a particular data product." {cite}`appel_2019_ondemand`
14
+
[^mynote1]: An image collection is a set of n images, where images contain m variables or spectral bands. Band data from one image share a common spatial footprint, acquisition date/time, and spatial reference system but may have different pixel sizes. Technically, the data of bands may come from one or more files, depending on the organization of a particular data product." {cite:p}`appel_2019_ondemand`
15
15
16
16
17
17
18
18
### *Consider an example*
19
19
20
-
We have a time series of [NDVI](https://www.usgs.gov/landsat-missions/landsat-normalized-difference-vegetation-index) imagery generated from a stack of Landsat scenes. Before a user accesses a satellite imagery dataset, it has likely already undergone many levels of processing, transformation and re-organization. For more background on these steps, see {cite}`montero_2024_EarthSystemData`, *Section 3: 'The Earth System Data Cube Life cycle'*.
20
+
We have a time series of [NDVI](https://www.usgs.gov/landsat-missions/landsat-normalized-difference-vegetation-index) imagery generated from a stack of Landsat scenes. Before a user accesses a satellite imagery dataset, it has likely already undergone many levels of processing, transformation and re-organization. For more background on these steps, see Montero et al. {cite:p}`montero_2024_EarthSystemData`, *Section 3: 'The Earth System Data Cube Life cycle'*.
21
21
22
-
In this example, we're accessing the dataset at a common dissemination point, an 'image collection'[^mynote1] {cite}`appel_2019_ondemand` In the image collection, each satellite image contains information such as the following:
22
+
In this example, we're accessing the dataset at a common dissemination point, an 'image collection'[^mynote1]. In the image collection, each satellite image contains information such as the following:
23
23
- Acquisition date,
24
24
- X-coordinate values,
25
25
- Y-coordinate values,
@@ -60,7 +60,7 @@ The process described above is an example of preparing data for analysis. Thanks
60
60
61
61
```{epigraph}
62
62
CEOS Analysis Ready Data (CEOS-ARD) are satellite data that have been processed to a minimum set of requirements and organized into a form that allows immediate analysis with a minimum of additional user effort and interoperability both through time and with other datasets.
63
-
- Committee on Earth Observation Satellites ([CEOS](https://ceos.org/ard/index.html)) Analysis-Ready Data
63
+
- Committee on Earth Observation Satellites ([CEOS](https://ceos.org/ard/index.html)) Analysis-Ready Data {cite}`lewis_2018_CEOSAnalysisReady`
64
64
```
65
65
66
66
The development and increasing adoption of analysis-ready specifications for satellite imagery datasets is an exciting and transformative opportunity to increase the utilization of earth observation data.
@@ -71,7 +71,10 @@ However, many legacy datasets still require significant effort in order to be co
71
71
The tutorials in this book contain examples of data at various degrees of 'analysis-ready'. [Tutorial 1](../tutorial1/itslive_intro.md) uses a dataset of multi-sensor observations that is already organized as a `(x,y,time)` cube with a common grid. In [tutorial 2](../tutorial2/s1_intro.md), we will see an example of a dataset that has undergone intensive processing to make it 'analysis-ready' but requires further manipulation to arrive at the `(x,y,time)` cube format that will be easist to work with.
0 commit comments