Skip to content

Commit ade36ab

Browse files
authored
Merge pull request #396 from s22s/feature/docs-intro-update
Updated intro section.
2 parents f226ff0 + 3f87e34 commit ade36ab

File tree

2 files changed

+76
-48
lines changed

2 files changed

+76
-48
lines changed

pyrasterframes/src/main/python/docs/index.md

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,21 @@
22

33
RasterFrames® brings together Earth-observation (EO) data access, cloud computing, and DataFrame-based data science. The recent explosion of EO data from public and private satellite operators presents both a huge opportunity and a huge challenge to the data analysis community. It is _Big Data_ in the truest sense, and its footprint is rapidly getting bigger.
44

5-
RasterFrames provides a DataFrame-centric view over arbitrary raster data, enabling spatiotemporal queries, map algebra raster operations, and compatibility with the ecosystem of Spark ML algorithms. By using DataFrames as the core cognitive and compute data model, it is able to deliver these features in a form that is both accessible to general analysts and scalable along with the rapidly growing data footprint.
5+
RasterFrames provides a DataFrame-centric view over arbitrary geospatial raster data, enabling spatiotemporal queries, map algebra raster operations, and interoperability with Spark ML. By using the DataFrame as the core cognitive and compute data model, RasterFrames is able to deliver an extensive set of functionality in a form that is both horizontally scalable as well as familiar to general analysts and data scientists. It provides APIs for Python, SQL, and Scala.
66

7-
To learn more, please see the @ref:[Getting Started](getting-started.md) section of this manual.
7+
![RasterFrames](static/rasterframes-pipeline-nologo.png)
88

9-
The source code can be found on GitHub at [locationtech/rasterframes](https://github.com/locationtech/rasterframes).
9+
Through its custom [Spark DataSource](https://rasterframes.io/raster-read.html), RasterFrames can read various raster formats -- including GeoTIFF, JP2000, MRF, and HDF -- and from an [array of services](https://rasterframes.io/raster-read.html#uri-formats), such as HTTP, FTP, HDFS, S3 and WASB. It also supports reading the vector formats GeoJSON and WKT/WKB. RasterFrame contents can be filtered, transformed, summarized, resampled, and rasterized through [200+ raster and vector functions](https://rasterframes.io/reference.html).
10+
11+
As part of the LocationTech family of projects, RasterFrames builds upon the strong foundations provided by GeoMesa (spatial operations) , GeoTrellis (raster operations), JTS (geometry modeling) and SFCurve (spatiotemporal indexing), integrating various aspects of these projects into a unified, DataFrame-centric analytics package.
12+
13+
![](static/rasterframes-locationtech-stack.png)
1014

11-
RasterFrames is released under the [Apache 2.0 License](https://github.com/locationtech/rasterframes/blob/develop/LICENSE).
15+
RasterFrames is released under the commercial-friendly [Apache 2.0](https://github.com/locationtech/rasterframes/blob/develop/LICENSE) open source license.
1216

13-
![RasterFrames](static/rasterframes-pipeline.png)
17+
To learn more, please see the @ref:[Getting Started](getting-started.md) section of this manual.
18+
19+
The source code can be found on GitHub at [locationtech/rasterframes](https://github.com/locationtech/rasterframes).
1420

1521
<hr/>
1622

pyrasterframes/src/main/python/docs/raster-read.pymd

Lines changed: 65 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ RasterFrames registers a DataSource named `raster` that enables reading of GeoTI
1414

1515
RasterFrames can also read from @ref:[GeoTrellis catalogs and layers](raster-read.md#geotrellis).
1616

17-
## Single Raster
17+
## Single Rasters
1818

1919
The simplest way to use the `raster` reader is with a single raster from a single URI or file. In the examples that follow we'll be reading from a Sentinel-2 scene stored in an AWS S3 bucket.
2020

@@ -33,14 +33,12 @@ print("CRS", crs.value.crsProj4)
3333
```
3434

3535
```python, raster_parts
36-
parts = rf.select(
36+
rf.select(
3737
rf_extent("proj_raster").alias("extent"),
3838
rf_tile("proj_raster").alias("tile")
3939
)
40-
parts
4140
```
4241

43-
4442
You can also see that the single raster has been broken out into many arbitrary non-overlapping regions. Doing so takes advantage of parallel in-memory reads from the cloud hosted data source and allows Spark to work on manageable amounts of data per task. The following code fragment shows us how many subtiles were created from a single source image.
4543

4644
```python, count_by_uri
@@ -55,6 +53,69 @@ tile = rf.select(rf_tile("proj_raster")).first()[0]
5553
display(tile)
5654
```
5755

56+
## Multiple Singleband Rasters
57+
58+
In this example, we show the reading @ref:[two bands](concepts.md#band) of [Landsat 8](https://landsat.gsfc.nasa.gov/landsat-8/) imagery (red and near-infrared), combining them with `rf_normalized_difference` to compute [NDVI](https://en.wikipedia.org/wiki/Normalized_difference_vegetation_index), a common measure of vegetation health. As described in the section on @ref:[catalogs](raster-catalogs.md), image URIs in a single row are assumed to be from the same scene/granule, and therefore compatible. This pattern is commonly used when multiple bands are stored in separate files.
59+
60+
```python, multi_singleband
61+
bands = [f'B{b}' for b in [4, 5]]
62+
uris = [f'https://landsat-pds.s3.us-west-2.amazonaws.com/c1/L8/014/032/LC08_L1TP_014032_20190720_20190731_01_T1/LC08_L1TP_014032_20190720_20190731_01_T1_{b}.TIF' for b in bands]
63+
catalog = ','.join(bands) + '\n' + ','.join(uris)
64+
65+
rf = (spark.read.raster(catalog, bands)
66+
# Adding semantic names
67+
.withColumnRenamed('B4', 'red').withColumnRenamed('B5', 'NIR')
68+
# Adding tile center point for reference
69+
.withColumn('longitude_latitude', st_reproject(st_centroid(rf_geometry('red')), rf_crs('red'), lit('EPSG:4326')))
70+
# Compute NDVI
71+
.withColumn('NDVI', rf_normalized_difference('NIR', 'red'))
72+
# For the purposes of inspection, filter out rows where there's not much vegetation
73+
.where(rf_tile_sum('NDVI') > 10000)
74+
# Order output
75+
.select('longitude_latitude', 'red', 'NIR', 'NDVI'))
76+
display(rf)
77+
```
78+
79+
## Multiband Rasters
80+
81+
A multiband raster is represented by a three dimensional numeric array stored in a single file. The first two dimensions are spatial, and the third dimension is typically designated for different spectral @ref:[bands](concepts.md#band). The bands could represent intensity of different wavelengths of light (or other electromagnetic radiation), or they could measure other phenomena such as time, quality indications, or additional gas concentrations, etc.
82+
83+
Multiband rasters files have a strictly ordered set of bands, which are typically indexed from 1. Some files have metadata tags associated with each band. Some files have a color interpetation metadata tag indicating how to interpret the bands.
84+
85+
When reading a multiband raster or a @ref:[_catalog_](#raster-catalogs) describing multiband rasters, you will need to know ahead of time which bands you want to read. You will specify the bands to read, **indexed from zero**, as a list of integers into the `band_indexes` parameter of the `raster` reader.
86+
87+
For example, we can read a four-band (red, green, blue, and near-infrared) image as follows. The individual rows of the resulting DataFrame still represent distinct spatial extents, with a projected raster column for each band specified by `band_indexes`.
88+
89+
```python, multiband
90+
mb = spark.read.raster(
91+
's3://s22s-test-geotiffs/naip/m_3807863_nw_17_1_20160620.tif',
92+
band_indexes=[0, 1, 2, 3],
93+
)
94+
display(mb)
95+
```
96+
97+
If a band is passed into `band_indexes` that exceeds the number of bands in the raster, a projected raster column will still be generated in the schema but the column will be full of `null` values.
98+
99+
You can also pass a _catalog_ and `band_indexes` together into the `raster` reader. This will create a projected raster column for the combination of all items in `catalog_col_names` and `band_indexes`. Again if a band in `band_indexes` exceeds the number of bands in a raster, it will have a `null` value for the corresponding column.
100+
101+
Here is a trivial example with a _catalog_ over multiband rasters. We specify two columns containing URIs and two bands, resulting in four projected raster columns.
102+
103+
```python, multiband_catalog
104+
import pandas as pd
105+
mb_cat = pd.DataFrame([
106+
{'foo': 's3://s22s-test-geotiffs/naip/m_3807863_nw_17_1_20160620.tif',
107+
'bar': 's3://s22s-test-geotiffs/naip/m_3807863_nw_17_1_20160620.tif'
108+
},
109+
])
110+
mb2 = spark.read.raster(
111+
spark.createDataFrame(mb_cat),
112+
catalog_col_names=['foo', 'bar'],
113+
band_indexes=[0, 1],
114+
tile_dimensions=(64,64)
115+
)
116+
mb2.printSchema()
117+
```
118+
58119
## URI Formats
59120

60121
RasterFrames relies on three different I/O drivers, selected based on a combination of scheme, file extentions, and library availability. GDAL is used by default if a compatible version of GDAL (>= 2.4) is installed, and if GDAL supports the specified scheme. If GDAL is not available, either the _Java I/O_ or _Hadoop_ driver will be selected, depending on scheme.
@@ -154,45 +215,6 @@ non_lazy
154215

155216
In the initial examples on this page, you may have noticed that the realized (non-lazy) _tiles_ are shown, but we did not change `lazy_tiles`. Instead, we used @ref:[`rf_tile`](reference.md#rf-tile) to explicitly request the realized _tile_ from the lazy representation.
156217

157-
## Multiband Rasters
158-
159-
A multiband raster represents a three dimensional numeric array. The first two dimensions are spatial, and the third dimension is typically designated for different spectral @ref:[bands](concepts.md#band). The bands could represent intensity of different wavelengths of light (or other electromagnetic radiation), or they could measure other phenomena such as time, quality indications, or additional gas concentrations, etc.
160-
161-
Multiband rasters files have a strictly ordered set of bands, which are typically indexed from 1. Some files have metadata tags associated with each band. Some files have a color interpetation metadata tag indicating how to interpret the bands.
162-
163-
When reading a multiband raster or a _catalog_ describing multiband rasters, you will need to know ahead of time which bands you want to read. You will specify the bands to read, **indexed from zero**, as a list of integers into the `band_indexes` parameter of the `raster` reader.
164-
165-
For example, we can read a four-band (red, green, blue, and near-infrared) image as follows. The individual rows of the resulting DataFrame still represent distinct spatial extents, with a projected raster column for each band specified by `band_indexes`.
166-
167-
```python, multiband
168-
mb = spark.read.raster(
169-
's3://s22s-test-geotiffs/naip/m_3807863_nw_17_1_20160620.tif',
170-
band_indexes=[0, 1, 2, 3],
171-
)
172-
mb.printSchema()
173-
```
174-
175-
If a band is passed into `band_indexes` that exceeds the number of bands in the raster, a projected raster column will still be generated in the schema but the column will be full of `null` values.
176-
177-
You can also pass a _catalog_ and `band_indexes` together into the `raster` reader. This will create a projected raster column for the combination of all items in `catalog_col_names` and `band_indexes`. Again if a band in `band_indexes` exceeds the number of bands in a raster, it will have a `null` value for the corresponding column.
178-
179-
Here is a trivial example with a _catalog_ over multiband rasters. We specify two columns containing URIs and two bands, resulting in four projected raster columns.
180-
181-
```python, multiband_catalog
182-
import pandas as pd
183-
mb_cat = pd.DataFrame([
184-
{'foo': 's3://s22s-test-geotiffs/naip/m_3807863_nw_17_1_20160620.tif',
185-
'bar': 's3://s22s-test-geotiffs/naip/m_3807863_nw_17_1_20160620.tif'
186-
},
187-
])
188-
mb2 = spark.read.raster(
189-
spark.createDataFrame(mb_cat),
190-
catalog_col_names=['foo', 'bar'],
191-
band_indexes=[0, 1],
192-
tile_dimensions=(64,64)
193-
)
194-
mb2.printSchema()
195-
```
196218

197219
## GeoTrellis
198220

0 commit comments

Comments
 (0)