Skip to content

Commit 3a9e3e7

Browse files
authored
Merge pull request #249 from s22s/feature/courtney-edits
Feature/courtney edits
2 parents 3426ac6 + e8e250b commit 3a9e3e7

File tree

11 files changed

+142
-92
lines changed

11 files changed

+142
-92
lines changed
Lines changed: 15 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,18 @@
11
# Concepts
22

3-
There are a number of Earth-observation (EO) concepts that crop up in the discussion of RasterFrames features. We'll cover these briefly in the sections below. However, here are a few links providing a more extensive introduction to working with Earth observation data.
3+
There are a number of Earth-observation (EO) concepts that crop up in the discussion of RasterFrames features. We'll cover these briefly in the sections below. However, here are a few links providing a more extensive introduction to working with Earth observation data.
44

55
* [_Fundamentals of Remote Sensing_](https://www.nrcan.gc.ca/maps-tools-and-publications/satellite-imagery-and-air-photos/tutorial-fundamentals-remote-sensing/9309)
66
* [_Newcomers Earth Observation Guide_](https://business.esa.int/newcomers-earth-observation-guide)
77
* [_Earth Observation Markets and Applications_](https://www.ofcom.org.uk/__data/assets/pdf_file/0021/82047/introduction_eo_for_ofcom_june_2015_no_video.pdf)
88

9+
## Raster
10+
11+
A raster is a regular grid of numeric values. A raster can be thought of as an image, as is the case if the values in the grid represent brightness along a greyscale. More generally a raster can measure many different phenomena or encode a variety of different discrete classifications.
12+
913
## Cell
1014

11-
A cell is a single sample from a sensor encoded as a scalar value asssociated with a specific spatiotemporal location and time. It can be thought of as an image pixel associated with a place and time.
15+
A cell is a single row and column intersection in the raster grid. It is a single pixel in an image. A cell's value often represents one sample from a sensor encoded as a scalar value associated with a specific location and time.
1216

1317
## Cell Type
1418

@@ -18,7 +22,7 @@ A numeric cell value may be encoded in a number of different computer numeric fo
1822
* integral vs floating-point
1923

2024

21-
The cell types most frequent in RasterFrames are as follows:
25+
The most frequently encountered cell types in RasterFrames are below.
2226

2327
| Name | Abbreviation | Description | Range |
2428
| --- | --- | --- | --- |
@@ -31,38 +35,32 @@ The cell types most frequent in RasterFrames are as follows:
3135
| Float | `float32` | 32-bit floating-point | -3.4028235E38 to 3.4028235E38 |
3236
| Double | `float64` | 64-bit floating-point | -1.7976931348623157E308 to 1.7976931348623157E308 |
3337

34-
See the section on [“NoData” Handling](nodata-handling.md) for additional discussion on cell types.
38+
See the section on [“NoData” Handling](nodata-handling.md) for additional discussion on cell types and more exhaustive coverage of available cell types.
3539

3640
## NoData
3741

38-
A "NoData" (or N/A) value is a specifically identified value for a cell type used to indicate the absence of data. See the section on @ref:[“NoData” Handling](nodata-handling.md) for additional discussion on NoData
42+
A "NoData" (or N/A) value is a specifically identified value for a cell type used to indicate the absence of data. See the section on @ref:[“NoData” Handling](nodata-handling.md) for additional discussion on "NoData".
3943

4044
## Scene
4145

42-
A scene (or granule) is a discrete instance of EO data with a specific extent (region), date-time, and projection/CRS.
46+
A scene (or granule) is a discrete instance of EO @ref:[raster data](concepts.md#raster) with a specific extent (region), date-time, and map projection (or CRS).
4347

4448
## Coordinate Reference System (CRS)
4549

46-
A coordinate reference system (or spatial reference system) is a set of mathematical constructs used to map cells to specific locations on the Earth (or other surface). A CRS typcially accompanies any EO data so it can be precicely located.
50+
A [coordinate reference system (or spatial reference system)][CRS] is a set of mathematical constructs used to translate locations on the three-dimensional surface of the earth to the two dimensional raster grid. A CRS typically accompanies any EO data so it can be precisely located.
4751

4852
## Extent
4953

50-
An extent (or bounding box) is a rectangular region specifying the geospatial coverage of a two-dimensional array of cells in a singular CRS.
54+
An extent (or bounding box) is a rectangular region specifying the geospatial coverage of a @ref:[raster](concepts.md#raster) or @ref:[tile](concepts.md#tile), a two-dimensional array of @ref:[cells](concepts.md#cell) within a single CRS.
5155

5256
## Tile
5357

54-
A tile (sometimes called a "chip") is a rectangular subset of a @ref:[scene](concepts.md#scene). A tile can conceptually be though of as a two-dimensional array.
58+
A tile (sometimes called a "chip") is a rectangular subset of a @ref:[scene](concepts.md#scene). As a scene is a raster, a tile is also a raster. A tile can conceptually be thought of as a two-dimensional array.
5559

5660
Some EO data has many bands or channels. Tiles in this context are conceptually a three-dimensional array, with the extra dimension representing the bands.
5761

58-
Tiles are often square and the dimensions are some power of two, for example 256 by 256.
62+
Tiles are often square and the dimensions are some power of two, for example 256 by 256.
5963

6064
The tile is the primary discretization unit used in RasterFrames. Each band of a scene is in a separate column. The scene's overall @ref:[extent](concepts.md#extent) is carved up into smaller extents for each tile. Each row of the DataFrame contains a two-dimensional tile per band column.
6165

62-
## Projected Extent
63-
64-
An extent paired with a CRS
65-
66-
## Projected Raster
67-
68-
A tile or scene paired with a CRS and extent.
66+
[CRS]: https://en.wikipedia.org/wiki/Spatial_reference_system
Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,39 @@
11
# Overview
22

3-
RasterFrames provides a DataFrame-centric view over arbitrary EO data, enabling spatiotemporal queries, map algebra raster operations, and compatibility with the ecosystem of Spark ML algorithms. It provides APIs in @ref:[Python, SQL, and Scala](languages.md), and can horizontally scale from a laptop to a supercomputer, enabling _global_ analysis with satellite imagery in a wholly new, flexible and convenient way.
3+
RasterFrames® provides a DataFrame-centric view over arbitrary Earth-observation (EO) data, enabling spatiotemporal queries, map algebra raster operations, and compatibility with the ecosystem of [Apache Spark](https://spark.apache.org/docs/latest/) [ML](https://spark.apache.org/docs/latest/ml-guide.html) algorithms. It provides APIs in @ref:[Python, SQL, and Scala](languages.md), and can scale from a laptop to a large distributed cluster, enabling _global_ analysis with satellite imagery in a wholly new, flexible and convenient way.
44

55
## Context
66

7-
We have a millennia-long history of organizing information in tabular form. Typically, rows represent independent events or observations, and columns represent measurements from the observations. The forms have evolved, from hand-written agricultural records and transaction ledgers, to the advent of spreadsheets on the personal computer, and on to the creation of the _DataFrame_ data structure as found in [R Data Frames][R] and [Python Pandas][Pandas]. The table-oriented data structure remains a common and critical component of organizing data across industries, and is the mental model employed by many data scientists across diverse forms of modeling and analysis.
7+
We have a millennia-long history of organizing information in tabular form. Typically, rows represent independent events or observations, and columns represent attributes and measurements from the observations. The forms have evolved, from hand-written agricultural records and transaction ledgers, to the advent of spreadsheets on the personal computer, and on to the creation of the _DataFrame_ data structure as found in [R Data Frames][R] and [Python Pandas][Pandas]. The table-oriented data structure remains a common and critical component of organizing data across industries, and is the mental model employed by many data scientists across diverse forms of modeling and analysis.
88

9-
Today, DataFrames are the _lingua franca_ of data science. The evolution of the tabular form has continued with Apache Spark SQL, which brings DataFrames to the big data distributed compute space. Through several novel innovations, Spark SQL enables interactive and batch-oriented cluster computing without having to be versed in the highly specialized skills typically required for high-performance computing. As suggested by the name, these DataFrames are manipulatable via standard SQL, as well as the more general-purpose programming languages Python, R, Java, and Scala.
9+
The evolution of the DataFrame form has continued with [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html), which brings DataFrames to the big data distributed compute space. Through several novel innovations, Spark SQL enables data scientists to work with DataFrames too large for the memory of a single computer. As suggested by the name, these DataFrames are manipulatable via standard SQL, as well as the more general-purpose programming languages Python, R, Java, and Scala.
1010

11-
RasterFrames®, an incubating Eclipse Foundation LocationTech project, brings together Earth-observing (EO) data analysis, big data computing, and DataFrame-based data science. The recent explosion of EO data from public and private satellite operators presents both a huge opportunity as well as a challenge to the data analysis community. It is _Big Data_ in the truest sense, and its footprint is rapidly getting bigger. According to a World Bank document on assets for post-disaster situation awareness[^1]:
11+
RasterFrames, an incubating Eclipse Foundation LocationTech project, brings together EO data access, cloud computing, and DataFrame-based data science. The recent explosion of EO data from public and private satellite operators presents both a huge opportunity as well as a challenge to the data analysis community. It is _Big Data_ in the truest sense, and its footprint is rapidly getting bigger. According to a World Bank document on assets for post-disaster situation awareness[^1]:
1212

1313
> Of the 1,738 operational satellites currently orbiting the earth (as of 9/[20]17), 596 are earth observation satellites and 477 of these are non-military assets (ie available to civil society including commercial entities and governments for earth observation, according to the Union of Concerned Scientists). This number is expected to increase significantly over the next ten years. The 200 or so planned remote sensing satellites have a value of over 27 billion USD (Forecast International). This estimate does not include the burgeoning fleets of smallsats as well as micro, nano and even smaller satellites... All this enthusiasm has, not unexpectedly, led to a veritable fire-hose of remotely sensed data which is becoming difficult to navigate even for seasoned experts.
1414
1515
## Benefit
1616

17-
By using DataFrames as the core cognitive and compute data model for processing EO data, RasterFrames is able to deliver sophisticated computational and algorithmic capabilities in a tabular form that is familiar and accessible to the general computing public. Because it is built on Apache Spark, solutions prototyped on a laptop can be scaled to run on cluster and cloud compute resources in a way not easily achieved with other toolchains.
17+
By using DataFrames as the core cognitive and compute data model for processing EO data, RasterFrames is able to deliver sophisticated computational and algorithmic capabilities in a tabular form that is familiar and accessible to the general computing public. Because it is built on Apache Spark, solutions prototyped on a laptop can be easily scaled to run on cluster and cloud compute resources. Apache Spark also provides integration between its DataFrame libraries and machine learning, with which RasterFrames is fully compatible.
1818

1919
## Architecture
2020

21-
RasterFrames takes the Spark SQL DataFrame and extends it to support standard EO operations. It does this with the help of several other LocationTech projects:
21+
RasterFrames builds upon several other LocationTech projects:
2222
[GeoTrellis](https://geotrellis.io/), [GeoMesa](https://www.geomesa.org/),
2323
[JTS](https://github.com/locationtech/jts), and
24-
[SFCurve](https://github.com/locationtech/sfcurve) (see below).
24+
[SFCurve](https://github.com/locationtech/sfcurve).
2525

2626
![LocationTech Stack](static/rasterframes-locationtech-stack.png)
2727

28-
RasterFrames introduces georectified raster imagery to Spark SQL. It quantizes scenes into chunks called "tiles". Each tile contains a 2-D matrix of "cell" (pixel) values along with information on how to numerically interpret those cells. As shown in the figure below, a "RasterFrame" is a Spark DataFrame with one or more columns of type `tile`. A `tile` column typically represents a single frequency band of sensor data, such as "blue" or "near infrared", but can also be quality assurance information, land classification assignments, or any other rasterized spatiotemporal data. Along with `tile` columns there is typically an `extent` specifying the geographic location of the data, the map projection of that geometry (`crs`), and a `timestamp` column representing the acquisition time. These columns can all be used in the `WHERE` clause when filtering
28+
RasterFrames introduces georectified raster imagery to Spark SQL. It quantizes scenes into chunks called @ref:[_tiles_](concepts.md#tile). Each tile contains a 2-D matrix of @ref:[_cell_](concepts.md#tile) or pixel values along with information on how to numerically interpret those cells.
2929

30-
RasterFrames also includes support for working with vector data, such as [GeoJSON][GeoJSON]. You can use vector data to filter DataFrame rows, using geospatial predicates (e.g. contains, intersects, overlaps, etc.), to mask cells, and to be rasterzied into training data appropriate for machine learning.
30+
As shown in the figure below, a "RasterFrame" is a Spark DataFrame with one or more columns of type @ref:[`tile`](concepts.md#tile). A `tile` column typically represents a single frequency band of sensor data, such as "blue" or "near infrared", but can also be quality assurance information, land classification assignments, or any other raster spatial data. Along with `tile` columns there is typically an @ref:[`extent`](concepts.md#extent) specifying the geographic location of the data, the map projection of that geometry (@ref:[`crs`](concepts.md#coordinate-reference-system--crs-)), and a `timestamp` column representing the acquisition time. These columns can all be used in the `WHERE` clause when filtering.
3131

32+
@@include[RasterFrame Example](static/rasterframe-sample.md)
3233

33-
![RasterFrame Anatomy](static/rasterframe-anatomy.png)
34+
RasterFrames also includes support for working with vector data, such as [GeoJSON][GeoJSON]. RasterFrames vector data operations let you filter with geospatial relationships like contains or intersects, mask cells, convert vectors to rasters, and more.
3435

35-
Raster data can be read from a number of sources. Through the flexible Spark SQL DataSource API, RasterFrames can be constructed from collections of georectified imagery (including Cloud Optimized GeoTIFFs or [COGS][COGS]), [GeoTrellis Layers][GTLayer], and from catalog of Landsat 8 and MODIS data sets on the [Amazon Web Services (AWS) Public Data Set (PDS)][PDS]. See @ref:[Raster Data I/O](raster-io.md) for details.
36+
Raster data can be read from a @ref:[number of sources](raster-io.md). Through the flexible Spark SQL DataSource API, RasterFrames can be constructed from collections of imagery (including Cloud Optimized GeoTIFFs or [COGS][COGS]), [GeoTrellis Layers][GTLayer], and from catalogs of large datasets like Landsat 8 and MODIS data sets on the @ref:[AWS Public Data Set (PDS)](raster-catalogs.md#using-external-catalogs).
3637

3738
[R]:https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/data.frame
3839
[Pandas]:https://pandas.pydata.org/
@@ -42,4 +43,4 @@ Raster data can be read from a number of sources. Through the flexible Spark SQL
4243
[COGS]:https://www.cogeo.org/
4344

4445
[^1]: [_Demystifying Satellite Assets for Post-Disaster Situation Awareness_](https://docs.google.com/document/d/11bIw5HcEiZy8SKli6ZFQC2chVEiiIJ-f0o6btA4LU48).
45-
World Bank via [OpenDRI.org](https://opendri.org/resource/demystifying-satellite-assets-for-post-disaster-situation-awareness/). Accessed November 28, 2018.
46+
World Bank via [OpenDRI.org](https://opendri.org/resource/demystifying-satellite-assets-for-post-disaster-situation-awareness/). Accessed November 28, 2018.

pyrasterframes/src/main/python/docs/getting-started.pymd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,7 @@ To support GeoTIFF and JPEG2000 formats, you should look for the following drive
110110

111111
Do the following to see if RasterFrames was able to find GDAL:
112112

113-
```python
113+
```python, evaluate=False
114114
from pyrasterframes.utils import gdal_version
115115
print(gdal_version())
116116
```

pyrasterframes/src/main/python/docs/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# RasterFrames
22

3-
RasterFrames® brings together Earth-observing (EO) data analysis, big data computing, and DataFrame-based data science. The recent explosion of EO data from public and private satellite operators presents both a huge opportunity as well as a challenge to the data analysis community. It is _Big Data_ in the truest sense, and its footprint is rapidly getting bigger.
3+
RasterFrames® brings together Earth-observation (EO) data access, cloud computing, and DataFrame-based data science. The recent explosion of EO data from public and private satellite operators presents both a huge opportunity as well as a challenge to the data analysis community. It is _Big Data_ in the truest sense, and its footprint is rapidly getting bigger.
44

5-
RasterFrames provides a DataFrame-centric view over arbitrary EO data, enabling spatiotemporal queries, map algebra raster operations, and compatibility with the ecosystem of Spark ML algorithms. By using DataFrames as the core cognitive and compute data model, it is able to deliver these features in a form that is accessible to general analysts while handling the rapidly growing data footprint.
5+
RasterFrames provides a DataFrame-centric view over arbitrary raster data, enabling spatiotemporal queries, map algebra raster operations, and compatibility with the ecosystem of Spark ML algorithms. By using DataFrames as the core cognitive and compute data model, it is able to deliver these features in a form that is both accessible to general analysts and scalable along with the rapidly growing data footprint.
66

77
To learn more, please see the @ref:[Getting Started](getting-started.md) section of this manual.
88

0 commit comments

Comments
 (0)