Skip to content

Commit 3ffc0ca

Browse files
authored
Zonal stats page and other refactoring (#342)
* zonal map algebra doc page * proper rendering of folium.Map in docs * Simplify time series doc by ref to zonal map algebra * Remove local path in docs, favoring https urls to improve portability in rf-notebook Signed-off-by: Jason T. Brown <[email protected]>
1 parent 78bfe71 commit 3ffc0ca

File tree

9 files changed

+216
-112
lines changed

9 files changed

+216
-112
lines changed

pyrasterframes/src/main/python/docs/__init__.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
#
2020

2121
import os
22+
2223
from pweave import PwebPandocFormatter
2324

2425

@@ -36,10 +37,6 @@ def resource_dir():
3637
return test_resource
3738

3839

39-
def resource_dir_uri():
40-
return 'file://' + resource_dir()
41-
42-
4340
class PegdownMarkdownFormatter(PwebPandocFormatter):
4441
def __init__(self, *args, **kwargs):
4542
super().__init__(*args, **kwargs)

pyrasterframes/src/main/python/docs/aggregation.pymd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ np.set_printoptions(precision=3, floatmode='maxprec')
1414
spark = create_rf_spark_session()
1515
```
1616

17-
There are three types of aggregate functions: _tile_ aggregate, DataFrame aggregate, and element-wise local aggregate. In the @ref:[tile aggregate functions](reference.md#tile-statistics), we are computing a statistical summary per row of a _tile_ column in a DataFrame. In the @ref:[DataFrame aggregate functions](reference.md#aggregate-tile-statistics), we are computing statistical summaries over all of the cell values *and* across all of the rows in the DataFrame or group. In the @ref:[element-wise local aggregate functions](reference.md#tile-local-aggregate-statistics), we are computing the element-wise statistical summary across a DataFrame or group of _tiles_.
17+
There are three types of aggregate functions available in RasterFrames: _tile_ aggregate, DataFrame aggregate, and element-wise local aggregate. In the @ref:[tile aggregate functions](reference.md#tile-statistics), we are computing a statistical summary per row of a _tile_ column in a DataFrame. In the @ref:[DataFrame aggregate functions](reference.md#aggregate-tile-statistics), we are computing statistical summaries over all of the cell values *and* across all of the rows in the DataFrame or group. In the @ref:[element-wise local aggregate functions](reference.md#tile-local-aggregate-statistics), we are computing the element-wise statistical summary across a DataFrame or group of _tiles_. In the latter two cases, when @ref:[vector data](vector-data.md) is the grouping column, the results are @ref:[zonal statistics](zonal-algebra.md).
1818

1919
## Tile Mean Example
2020

pyrasterframes/src/main/python/docs/raster-processing.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,9 @@
22

33
@@@ index
44

5-
* @ref:[Local Algebra](local-algebra.md)
5+
* @ref:[Local Map Algebra](local-algebra.md)
66
* @ref:["NoData" Handling](nodata-handling.md)
7+
* @ref:[Zonal Map Algebra](zonal-algebra.md)
78
* @ref:[Aggregation](aggregation.md)
89
* @ref:[Time Series](time-series.md)
910
* @ref:[Machine Learning](machine-learning.md)

pyrasterframes/src/main/python/docs/supervised-learning.pymd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@ from pyspark.sql.functions import lit
1111
import pandas as pd
1212
import numpy as np
1313
import matplotlib.pyplot as plt
14-
from docs import resource_dir_uri
1514

1615
import os
1716

@@ -70,8 +69,10 @@ crses = df.select('crs.crsProj4').distinct().collect()
7069
print('Found ', len(crses), 'distinct CRS.')
7170
crs = crses[0][0]
7271

73-
label_df = spark.read.geojson(
74-
os.path.join(resource_dir_uri(), 'luray-labels.geojson')) \
72+
from pyspark import SparkFiles
73+
spark.sparkContext.addFile('https://github.com/locationtech/rasterframes/raw/develop/pyrasterframes/src/test/resources/luray-labels.geojson')
74+
75+
label_df = spark.read.geojson(SparkFiles.get('luray-labels.geojson')) \
7576
.select('id', st_reproject('geometry', lit('EPSG:4326'), lit(crs)).alias('geometry')) \
7677
.hint('broadcast')
7778

@@ -81,7 +82,6 @@ df_joined = df.join(label_df, st_intersects(st_geometry('extent'), 'geometry'))
8182
df_labeled = df_joined.withColumn('label',
8283
rf_rasterize('geometry', st_geometry('extent'), 'id', 'dims.cols', 'dims.rows')
8384
)
84-
8585
```
8686

8787
## Masking Poor Quality Cells

pyrasterframes/src/main/python/docs/time-series.pymd

Lines changed: 24 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -9,60 +9,28 @@ import pyrasterframes
99
from pyrasterframes.rasterfunctions import *
1010
import pyrasterframes.rf_ipython
1111

12-
import folium
13-
1412
from pyspark.sql.functions import udf, lit
1513
from geomesa_pyspark.types import MultiPolygonUDT
1614

1715
# This job is more memory bound, so reduce the concurrent tasks.
1816
spark = pyrasterframes.get_spark_session("local[4]")
1917
```
2018

21-
In this example, we will show how the flexibility of the DataFrame concept for raster data allows a simple and intuitive way to extract a time series from Earth observation data. We will start with our @ref:[built-in MODIS data catalog](raster-catalogs.md#using-built-in-experimental-catalogs).
22-
23-
```python catalog
24-
cat = spark.read.format('aws-pds-modis-catalog').load().repartition(200)
25-
cat.printSchema()
26-
```
27-
28-
We will summarize the change in NDVI over 2018 in the Cuyahoga Valley National Park in Ohio, USA. First, we will retrieve open vector data delineating the park boundary from the US National Park Service's LandsNet.
19+
In this example, we will show how the flexibility of the DataFrame concept for raster data allows a simple and intuitive way to extract a time series from Earth observation data. We will continue our example from the @ref:[Zonal Map Algebra page](zonal-algebra.md).
2920

30-
## Vector Data
21+
We will summarize the change in @ref:[NDVI](local-algebra.md#computing-ndvi) over the spring and early summer of 2018 in the Cuyahoga Valley National Park in Ohio, USA.
3122

32-
First we will get the vector data from LandsNet service by a REST query. The data is saved to a geojson file.
33-
34-
```python get_park_boundary
23+
```python vector, echo=False, results='hidden'
24+
cat = spark.read.format('aws-pds-modis-catalog').load().repartition(200)
3525
import requests
3626
nps_filepath = '/tmp/parks.geojson'
3727
nps_data_query_url = 'https://services1.arcgis.com/fBc8EJBxQRMcHlei/arcgis/rest/services/' \
3828
'NPS_Park_Boundaries/FeatureServer/0/query' \
39-
'?geometry=-82.451,41.075,-80.682,41.436&inSR=4326&outSR=4326&f=geojson'
29+
'?geometry=-82.451,41.075,-80.682,41.436&inSR=4326&outSR=4326&outFields=*&f=geojson'
4030
r = requests.get(nps_data_query_url)
4131
with open(nps_filepath,'wb') as f:
4232
f.write(r.content)
43-
```
4433

45-
```python, folium_map,
46-
m = folium.Map((41.25,-81.6), zoom_start=10).add_child(folium.GeoJson(nps_filepath))
47-
```
48-
49-
```python, folium_persist, echo=False
50-
# this is the work around for ability to render the folium map in the docs build
51-
import base64
52-
temp_folium = 'docs/static/__cuya__.html'
53-
m.save(temp_folium)
54-
with open(temp_folium, 'rb') as f:
55-
b64 = base64.b64encode(f.read())
56-
with open('docs/static/cuya.md', 'w') as md:
57-
md.write('<iframe src="data:text/html;charset=utf-8;base64,{}" allowfullscreen="" webkitallowfullscreen="" mozallowfullscreen="" style="position:relative;width:100%;height:500px"></iframe>'.format(b64.decode('utf-8')))
58-
# seems that the height is not correct?
59-
```
60-
61-
@@include[folium_static](static/cuya.md)
62-
63-
Now we read the park boundary vector data as a Spark DataFrame using the built-in @ref:[geojson DataSource](vector-data.md#geojson-datasource). The geometry is very detailed, and the EO cells are relatively coarse. To speed up the processing, the geometry is "simplified" by combining vertices within about 100 meters of each other. For more on this see the section on Shapely support in @ref:[user defined functions](vector-data.md#shapely-geometry-support).
64-
65-
```python read_cuya_vector
6634
park_vector = spark.read.geojson(nps_filepath)
6735

6836
@udf(MultiPolygonUDT())
@@ -72,11 +40,13 @@ def simplify(g, tol):
7240
park_vector = park_vector.withColumn('geo_simp', simplify('geometry', lit(0.001))) \
7341
.select('geo_simp') \
7442
.hint('broadcast')
43+
44+
7545
```
7646

7747
## Catalog Read
7848

79-
The entire park boundary is contained in MODIS granule h11 v04. We will simply filter on this granule, rather than using a @ref:[spatial relation](vector-data.md#geomesa-functions-and-spatial-relations). The time period selected should show the change in plant vigor as leaves emerge over the spring and into early summer.
49+
As in our other example, we will query for a single known MODIS granule directly. We limit the vector data to the single park of interest. The longer time period selected should show the change in plant vigor as leaves emerge over the spring and into early summer. The definitions of `cat` and `park_vector` are as in the @ref:[Zonal Map Algebra page](zonal-algebra.md).
8050

8151
```python query_catalog
8252
park_cat = cat \
@@ -85,65 +55,40 @@ park_cat = cat \
8555
(cat.acquisition_date > lit('2018-02-19')) &
8656
(cat.acquisition_date < lit('2018-07-01'))
8757
) \
88-
.crossJoin(park_vector)
58+
.crossJoin(park_vector.filter('OBJECTID == 380')) #only coyahuga
8959

90-
park_cat.printSchema()
9160
```
9261

93-
Now we have a catalog with several months of MODIS data for a single granule. However, the granule is larger than our park boundary. We will combine the park geometry with the catalog, and read only the bands of interest to compute NDVI, which we discussed in a @ref:[previous section](local-algebra.md#computing-ndvi).
62+
## Vector and Raster Data Interaction
9463

95-
We then [reproject](https://gis.stackexchange.com/questions/247770/understanding-reprojection) the park geometry to the same @ref:[CRS](concepts.md#coordinate-reference-system--crs-) as the imagery. Then we will filter to only the _tiles_ intersecting the park.
64+
We follow the same steps as the Zonal Map Algebra analysis: reprojecting the park geometry, filtering for intersection, rasterizing the geometry, and masking the NDVI by the _zone_ _tiles_. The code from that analysis is condensed here for reference.
9665

97-
```python read_catalog
66+
```python raster_prep
9867
raster_cols = ['B01', 'B02',] # red and near-infrared respectively
99-
park_rf = spark.read.raster(
68+
69+
rf_park_tile = spark.read.raster(
10070
park_cat.select(['acquisition_date', 'granule_id', 'geo_simp'] + raster_cols),
10171
catalog_col_names=raster_cols) \
10272
.withColumn('park_native', st_reproject('geo_simp', lit('EPSG:4326'), rf_crs('B01'))) \
103-
.filter(st_intersects('park_native', rf_geometry('B01')))
104-
105-
park_rf.printSchema()
106-
```
107-
108-
```python persist_catalog, echo=False
109-
# park_rf.persist()
110-
```
111-
112-
## Vector and Raster Data Interaction
113-
114-
Now we have the vector representation of the park boundary alongside the _tiles_ of red and near infrared bands. Next, we need to create a _tile_ representation of the park to allow us to limit the time series analysis to pixels within the park. This is similar to the masking operation demonstrated in @ref:[NoData handling](nodata-handling.md#masking).
115-
116-
We do this using two transformations. The first one will reproject the park boundary from coordinates to the MODIS sinusoidal projection. The second one will create a new _tile_ aligned with the imagery containing a value of 1 where the pixels are contained within the park and NoData elsewhere.
117-
118-
```python burn_in
119-
rf_park_tile = park_rf \
73+
.filter(st_intersects('park_native', rf_geometry('B01'))) \
12074
.withColumn('dims', rf_dimensions('B01')) \
12175
.withColumn('park_tile', rf_rasterize('park_native', rf_geometry('B01'), lit(1), 'dims.cols', 'dims.rows')) \
122-
.persist()
123-
124-
rf_park_tile.printSchema()
76+
.withColumn('ndvi', rf_normalized_difference('B02', 'B01')) \
77+
.withColumn('ndvi_masked', rf_mask('ndvi', 'park_tile'))
12578
```
12679

12780
## Create Time Series
12881

129-
Next, we will compute NDVI as the normalized difference of near infrared (band 2) and red (band 1). The _tiles_ are masked by the `park_tile`. We will then aggregate across the remaining values to arrive at an average NDVI for each week of the year. Note that the computation is creating a weighted average, which is weighted by the number of valid observations per week.
82+
We next aggregate across the cell values to arrive at an average NDVI for each week of the year. We use `pyspark`'s built in `groupby` and time functions with a RasterFrames @ref:[aggregate function](aggregation.md) to do this. Note that the computation is creating a weighted average, which is weighted by the number of valid observations per week.
13083

13184
```python ndvi_time_series
13285
from pyspark.sql.functions import col, year, weekofyear, month
133-
from pyspark.sql.functions import sum as sql_sum
134-
135-
rf_ndvi = rf_park_tile \
136-
.withColumn('ndvi', rf_normalized_difference('B02', 'B01')) \
137-
.withColumn('ndvi_masked', rf_mask('ndvi', 'park_tile'))
13886

139-
time_series = rf_ndvi \
140-
.withColumn('ndvi_wt', rf_tile_sum('ndvi_masked')) \
141-
.withColumn('wt', rf_data_cells('ndvi_masked')) \
142-
.groupby(year('acquisition_date').alias('year'), weekofyear('acquisition_date').alias('week')) \
143-
.agg(sql_sum('ndvi_wt').alias('ndvi_wt_wk'), sql_sum('wt').alias('wt_wk')) \
144-
.withColumn('ndvi', col('ndvi_wt_wk') / col('wt_wk'))
145-
146-
time_series.printSchema()
87+
time_series = rf_park_tile \
88+
.groupby(
89+
year('acquisition_date').alias('year'),
90+
weekofyear('acquisition_date').alias('week')) \
91+
.agg(rf_agg_mean('ndvi_masked').alias('ndvi'))
14792
```
14893

14994
Finally, we will take a look at the NDVI over time.
@@ -152,7 +97,7 @@ Finally, we will take a look at the NDVI over time.
15297
import matplotlib.pyplot as plt
15398

15499
time_series_pdf = time_series.toPandas()
155-
time_series_pdf = time_series_pdf.sort_values('week')
100+
time_series_pdf.sort_values('week', inplace=True)
156101
plt.plot(time_series_pdf['week'], time_series_pdf['ndvi'], 'go-')
157102
plt.xlabel('Week of year, 2018')
158103
plt.ylabel('NDVI')

pyrasterframes/src/main/python/docs/unsupervised-learning.pymd

Lines changed: 7 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ In this example, we will demonstrate how to fit and score an unsupervised learni
66

77
```python, setup, echo=False
88
from IPython.core.display import display
9-
from docs import resource_dir_uri
109
import pyrasterframes.rf_ipython
1110
from pyrasterframes.utils import create_rf_spark_session
1211

@@ -29,26 +28,19 @@ from pyspark.ml import Pipeline
2928

3029
```
3130

32-
The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create a catalog DataFrame using the pattern from [the I/O page](raster-io.html#Single-Scene--Multiple-Bands). In the catalog, each row represents a distinct area and time, and each column is the URI to a band's image product. The function `resource_dir_uri` gives a local file system path to the sample Landsat data. The resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
31+
The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create a catalog DataFrame using the pattern from [the I/O page](raster-io.html#Single-Scene--Multiple-Bands). In the catalog, each row represents a distinct area and time, and each column is the URI to a band's image product. The resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
3332

3433

3534
```python, catalog
36-
filenamePattern = "L8-B{}-Elkton-VA.tiff"
35+
filenamePattern = "https://github.com/locationtech/rasterframes/" \
36+
"raw/develop/core/src/test/resources/L8-B{}-Elkton-VA.tiff"
3737
catalog_df = pd.DataFrame([
38-
{'b' + str(b): os.path.join(resource_dir_uri(), filenamePattern.format(b)) for b in range(1, 8)}
38+
{'b' + str(b): filenamePattern.format(b) for b in range(1, 8)}
3939
])
40+
4041
df = spark.read.raster(catalog_df, catalog_col_names=catalog_df.columns)
41-
df = df.select(
42-
rf_crs(df.b1).alias('crs'),
43-
rf_extent(df.b1).alias('extent'),
44-
rf_tile(df.b1).alias('b1'),
45-
rf_tile(df.b2).alias('b2'),
46-
rf_tile(df.b3).alias('b3'),
47-
rf_tile(df.b4).alias('b4'),
48-
rf_tile(df.b5).alias('b5'),
49-
rf_tile(df.b6).alias('b6'),
50-
rf_tile(df.b7).alias('b7'),
51-
)
42+
df = df.withColumn('crs', rf_crs(df.b1)) \
43+
.withColumn('extent', rf_crs(df.b1))
5244
df.printSchema()
5345
```
5446

pyrasterframes/src/main/python/docs/vector-data.pymd

Lines changed: 30 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,28 @@
11
# Vector Data
22

3-
RasterFrames provides a variety of ways to work with spatial vector data (points, lines, and polygons) alongside raster data. There is a convenience DataSource for the GeoJSON format, as well as the ability to convert from [GeoPandas][GeoPandas] to Spark. Representation of vector geometries in PySpark is through [Shapely][Shapely], providing a great deal of interoperability. RasterFrames also provides access to Spark functions for working with geometries.
3+
RasterFrames provides a variety of ways to work with spatial vector data (points, lines, and polygons) alongside raster data.
44

5-
## GeoJSON DataSource
5+
* DataSource for GeoJSON format
6+
* Ability to convert between from [GeoPandas][GeoPandas] and Spark DataFrames
7+
* In PySpark, geometries are [Shapely][Shapely] objects, providing a great deal of interoperability
8+
* Many Spark functions for working with columns of geometries
9+
* Vector data is also the basis for @ref:[zonal map algebra](zonal-algebra.md) operations.
610

711
```python, setup, echo=False
812
import pyrasterframes
913
import pyrasterframes.rf_ipython
10-
from pyrasterframes.utils import create_rf_spark_session
11-
spark = create_rf_spark_session()
14+
import geomesa_pyspark.types
15+
import geopandas
16+
import folium
17+
spark = pyrasterframes.get_spark_session('local[2]')
1218
```
1319

20+
## GeoJSON DataSource
21+
1422
```python, read_geojson
1523
from pyspark import SparkFiles
16-
spark.sparkContext.addFile('https://raw.githubusercontent.com/datasets/geo-admin1-us/master/data/admin1-us.geojson')
24+
admin1_us_url = 'https://raw.githubusercontent.com/datasets/geo-admin1-us/master/data/admin1-us.geojson'
25+
spark.sparkContext.addFile(admin1_us_url) # this lets us read http scheme uri's in spark
1726

1827
df = spark.read.geojson(SparkFiles.get('admin1-us.geojson'))
1928
df.printSchema()
@@ -36,7 +45,7 @@ def poly_or_mp_to_mp(g):
3645
else:
3746
return MultiPolygon([g])
3847

39-
gdf = geopandas.read_file('https://raw.githubusercontent.com/datasets/geo-admin1-us/master/data/admin1-us.geojson')
48+
gdf = geopandas.read_file(admin1_us_url)
4049
gdf.geometry = gdf.geometry.apply(poly_or_mp_to_mp)
4150
df2 = spark.createDataFrame(gdf)
4251
df2.printSchema()
@@ -92,8 +101,21 @@ l8 = l8.withColumn('paducah', st_point(lit(-88.628), lit(37.072))) # col of poi
92101
l8_filtered = l8 \
93102
.filter(st_intersects(l8.geom, st_bufferPoint(l8.paducah, lit(50000.0)))) \
94103
.filter(l8.acquisition_date > '2018-02-01') \
95-
.filter(l8.acquisition_date < '2018-04-01')
96-
l8_filtered.select('product_id', 'entity_id', 'acquisition_date', 'cloud_cover_pct')
104+
.filter(l8.acquisition_date < '2018-03-11')
105+
```
106+
107+
```python, folium, echo=False
108+
geo_df = geopandas.GeoDataFrame(
109+
l8_filtered.select('geom', 'bounds_wgs84').toPandas(),
110+
crs='EPSG:4326',
111+
geometry='geom')
112+
113+
# display as folium / leaflet map
114+
m = folium.Map()
115+
layer = folium.GeoJson(geo_df.to_json())
116+
m.fit_bounds(layer.get_bounds())
117+
m.add_child(layer)
118+
m
97119
```
98120

99121
[GeoPandas]: http://geopandas.org

0 commit comments

Comments
 (0)