Zonal stats page and other refactoring (#342)

vpipkt · web-flow · commit 3ffc0ca7acd2 · 2019-09-16T15:25:51.000-04:00
* zonal map algebra doc page

* proper rendering of folium.Map in docs 

* Simplify time series doc by ref to zonal map algebra 

* Remove local path in docs, favoring https urls to improve portability in rf-notebook

Signed-off-by: Jason T. Brown &lt;jason@astraea.earth&gt;
diff --git a/pyrasterframes/src/main/python/docs/__init__.py b/pyrasterframes/src/main/python/docs/__init__.py
@@ -19,6 +19,7 @@
 #
 
 import os
+
 from pweave import PwebPandocFormatter
 
 
@@ -36,10 +37,6 @@ def resource_dir():
     return test_resource
 
 
-def resource_dir_uri():
-    return 'file://' + resource_dir()
-
-
 class PegdownMarkdownFormatter(PwebPandocFormatter):
     def __init__(self, *args, **kwargs):
         super().__init__(*args, **kwargs)
diff --git a/pyrasterframes/src/main/python/docs/aggregation.pymd b/pyrasterframes/src/main/python/docs/aggregation.pymd
@@ -14,7 +14,7 @@ np.set_printoptions(precision=3, floatmode='maxprec')
 spark = create_rf_spark_session()
 ```
 
-There are three types of aggregate functions: _tile_ aggregate, DataFrame aggregate, and element-wise local aggregate. In the @ref:[tile aggregate functions](reference.md#tile-statistics), we are computing a statistical summary per row of a _tile_ column in a DataFrame. In the @ref:[DataFrame aggregate functions](reference.md#aggregate-tile-statistics), we are computing statistical summaries over all of the cell values *and* across all of the rows in the DataFrame or group. In the @ref:[element-wise local aggregate functions](reference.md#tile-local-aggregate-statistics), we are computing the element-wise statistical summary across a DataFrame or group of _tiles_.
+There are three types of aggregate functions available in RasterFrames: _tile_ aggregate, DataFrame aggregate, and element-wise local aggregate. In the @ref:[tile aggregate functions](reference.md#tile-statistics), we are computing a statistical summary per row of a _tile_ column in a DataFrame. In the @ref:[DataFrame aggregate functions](reference.md#aggregate-tile-statistics), we are computing statistical summaries over all of the cell values *and* across all of the rows in the DataFrame or group. In the @ref:[element-wise local aggregate functions](reference.md#tile-local-aggregate-statistics), we are computing the element-wise statistical summary across a DataFrame or group of _tiles_. In the latter two cases, when @ref:[vector data](vector-data.md) is the grouping column, the results are @ref:[zonal statistics](zonal-algebra.md).
 
 ## Tile Mean Example
 
diff --git a/pyrasterframes/src/main/python/docs/raster-processing.md b/pyrasterframes/src/main/python/docs/raster-processing.md
@@ -2,8 +2,9 @@
 
 @@@ index
 
-* @ref:[Local Algebra](local-algebra.md)
+* @ref:[Local Map Algebra](local-algebra.md)
 * @ref:["NoData" Handling](nodata-handling.md)
+* @ref:[Zonal Map Algebra](zonal-algebra.md)
 * @ref:[Aggregation](aggregation.md)
 * @ref:[Time Series](time-series.md)
 * @ref:[Machine Learning](machine-learning.md)
diff --git a/pyrasterframes/src/main/python/docs/supervised-learning.pymd b/pyrasterframes/src/main/python/docs/supervised-learning.pymd
@@ -11,7 +11,6 @@ from pyspark.sql.functions import lit
 import pandas as pd
 import numpy as np
 import matplotlib.pyplot as plt
-from docs import resource_dir_uri
 
 import os
 
@@ -70,8 +69,10 @@ crses = df.select('crs.crsProj4').distinct().collect()
 print('Found ', len(crses), 'distinct CRS.')
 crs = crses[0][0]
 
-label_df = spark.read.geojson(
-    os.path.join(resource_dir_uri(), 'luray-labels.geojson')) \
+from pyspark import SparkFiles
+spark.sparkContext.addFile('https://github.com/locationtech/rasterframes/raw/develop/pyrasterframes/src/test/resources/luray-labels.geojson') 
+
+label_df = spark.read.geojson(SparkFiles.get('luray-labels.geojson')) \
     .select('id', st_reproject('geometry', lit('EPSG:4326'), lit(crs)).alias('geometry')) \
     .hint('broadcast')
 
@@ -81,7 +82,6 @@ df_joined = df.join(label_df, st_intersects(st_geometry('extent'), 'geometry'))
 df_labeled = df_joined.withColumn('label', 
    rf_rasterize('geometry', st_geometry('extent'), 'id', 'dims.cols', 'dims.rows')
 )
-
 ```
 
 ## Masking Poor Quality Cells
diff --git a/pyrasterframes/src/main/python/docs/time-series.pymd b/pyrasterframes/src/main/python/docs/time-series.pymd
@@ -9,60 +9,28 @@ import pyrasterframes
 from pyrasterframes.rasterfunctions import *
 import pyrasterframes.rf_ipython
 
-import folium 
-
 from pyspark.sql.functions import udf, lit
 from geomesa_pyspark.types import MultiPolygonUDT
 
 # This job is more memory bound, so reduce the concurrent tasks.
 spark = pyrasterframes.get_spark_session("local[4]")
 ```
 
-In this example, we will show how the flexibility of the DataFrame concept for raster data allows a simple and intuitive way to extract a time series from Earth observation data. We will start with our @ref:[built-in MODIS data catalog](raster-catalogs.md#using-built-in-experimental-catalogs).
-
-```python catalog
-cat = spark.read.format('aws-pds-modis-catalog').load().repartition(200)
-cat.printSchema()
-```
-
-We will summarize the change in NDVI over 2018 in the Cuyahoga Valley National Park in Ohio, USA. First, we will retrieve open vector data delineating the park boundary from the US National Park Service's LandsNet.
+In this example, we will show how the flexibility of the DataFrame concept for raster data allows a simple and intuitive way to extract a time series from Earth observation data. We will continue our example from the @ref:[Zonal Map Algebra page](zonal-algebra.md).
 
-## Vector Data
+We will summarize the change in @ref:[NDVI](local-algebra.md#computing-ndvi) over the spring and early summer of 2018 in the Cuyahoga Valley National Park in Ohio, USA. 
 
-First we will get the vector data from LandsNet service by a REST query. The data is saved to a geojson file.
-
-```python get_park_boundary
+```python vector, echo=False, results='hidden'
+cat = spark.read.format('aws-pds-modis-catalog').load().repartition(200)
 import requests
 nps_filepath = '/tmp/parks.geojson'
 nps_data_query_url = 'https://services1.arcgis.com/fBc8EJBxQRMcHlei/arcgis/rest/services/' \
                      'NPS_Park_Boundaries/FeatureServer/0/query' \
-                     '?geometry=-82.451,41.075,-80.682,41.436&inSR=4326&outSR=4326&f=geojson'
+                     '?geometry=-82.451,41.075,-80.682,41.436&inSR=4326&outSR=4326&outFields=*&f=geojson'
 r = requests.get(nps_data_query_url)
 with open(nps_filepath,'wb') as f:
     f.write(r.content)
-```
 
-```python, folium_map, 
-m = folium.Map((41.25,-81.6), zoom_start=10).add_child(folium.GeoJson(nps_filepath))
-```
-
-```python, folium_persist, echo=False
-# this is the work around for ability to render the folium map in the docs build
-import base64
-temp_folium = 'docs/static/__cuya__.html'
-m.save(temp_folium)
-with open(temp_folium, 'rb') as f:
-    b64 = base64.b64encode(f.read())
-with open('docs/static/cuya.md', 'w') as md:
-    md.write('<iframe src="data:text/html;charset=utf-8;base64,{}" allowfullscreen="" webkitallowfullscreen="" mozallowfullscreen="" style="position:relative;width:100%;height:500px"></iframe>'.format(b64.decode('utf-8')))
-    # seems that the height is not correct?
-```
-
-@@include[folium_static](static/cuya.md)
-
-Now we read the park boundary vector data as a Spark DataFrame using the built-in @ref:[geojson DataSource](vector-data.md#geojson-datasource). The geometry is very detailed, and the EO cells are relatively coarse. To speed up the processing, the geometry is "simplified" by combining vertices within about 100 meters of each other. For more on this see the section on Shapely support in @ref:[user defined functions](vector-data.md#shapely-geometry-support).
-
-```python read_cuya_vector
 park_vector = spark.read.geojson(nps_filepath)
 
 @udf(MultiPolygonUDT())
@@ -72,11 +40,13 @@ def simplify(g, tol):
 park_vector = park_vector.withColumn('geo_simp', simplify('geometry', lit(0.001))) \
                          .select('geo_simp') \
                          .hint('broadcast')
+
+
 ```
 
 ## Catalog Read
 
-The entire park boundary is contained in MODIS granule h11 v04. We will simply filter on this granule, rather than using a @ref:[spatial relation](vector-data.md#geomesa-functions-and-spatial-relations). The time period selected should show the change in plant vigor as leaves emerge over the spring and into early summer.
+As in our other example, we will query for a single known MODIS granule directly. We limit the vector data to the single park of interest. The longer time period selected should show the change in plant vigor as leaves emerge over the spring and into early summer. The definitions of `cat` and `park_vector` are as in the @ref:[Zonal Map Algebra page](zonal-algebra.md). 
 
 ```python query_catalog
 park_cat = cat \
@@ -85,65 +55,40 @@ park_cat = cat \
                     (cat.acquisition_date > lit('2018-02-19')) &
                     (cat.acquisition_date < lit('2018-07-01'))            
                     ) \
-            .crossJoin(park_vector)
+            .crossJoin(park_vector.filter('OBJECTID == 380')) #only coyahuga
                 
-park_cat.printSchema()
 ```
 
-Now we have a catalog with several months of MODIS data for a single granule. However, the granule is larger than our park boundary. We will combine the park geometry with the catalog, and read only the bands of interest to compute NDVI, which we discussed in a @ref:[previous section](local-algebra.md#computing-ndvi).
+## Vector and Raster Data Interaction
 
-We then [reproject](https://gis.stackexchange.com/questions/247770/understanding-reprojection) the park geometry to the same @ref:[CRS](concepts.md#coordinate-reference-system--crs-) as the imagery. Then we will filter to only the _tiles_ intersecting the park.
+We follow the same steps as the Zonal Map Algebra analysis: reprojecting the park geometry, filtering for intersection, rasterizing the geometry, and masking the NDVI by the _zone_ _tiles_. The code from that analysis is condensed here for reference.
 
-```python read_catalog
+```python raster_prep
 raster_cols = ['B01', 'B02',] # red and near-infrared respectively
-park_rf = spark.read.raster(
+
+rf_park_tile = spark.read.raster(
         park_cat.select(['acquisition_date', 'granule_id', 'geo_simp'] + raster_cols),
         catalog_col_names=raster_cols) \
     .withColumn('park_native', st_reproject('geo_simp', lit('EPSG:4326'), rf_crs('B01'))) \
-    .filter(st_intersects('park_native', rf_geometry('B01'))) 
-
-park_rf.printSchema()
-```
-
-```python persist_catalog, echo=False
-# park_rf.persist()                    
-```
-
-## Vector and Raster Data Interaction
-
-Now we have the vector representation of the park boundary alongside the _tiles_ of red and near infrared bands. Next, we need to create a _tile_ representation of the park to allow us to limit the time series analysis to pixels within the park. This is similar to the masking operation demonstrated in @ref:[NoData handling](nodata-handling.md#masking).
-
-We do this using two transformations. The first one will reproject the park boundary from coordinates to the MODIS sinusoidal projection. The second one will create a new _tile_ aligned with the imagery containing a value of 1 where the pixels are contained within the park and NoData elsewhere. 
-
-```python burn_in
-rf_park_tile = park_rf \
+    .filter(st_intersects('park_native', rf_geometry('B01'))) \
     .withColumn('dims', rf_dimensions('B01')) \
     .withColumn('park_tile', rf_rasterize('park_native', rf_geometry('B01'), lit(1), 'dims.cols', 'dims.rows')) \
-    .persist()
-
-rf_park_tile.printSchema()
+    .withColumn('ndvi', rf_normalized_difference('B02', 'B01')) \
+    .withColumn('ndvi_masked', rf_mask('ndvi', 'park_tile'))
 ```
 
 ## Create Time Series
 
-Next, we will compute NDVI as the normalized difference of near infrared (band 2) and red (band 1). The _tiles_ are masked by the `park_tile`. We will then aggregate across the remaining values to arrive at an average NDVI for each week of the year. Note that the computation is creating a weighted average, which is weighted by the number of valid observations per week.
+We next aggregate across the cell values to arrive at an average NDVI for each week of the year. We use `pyspark`'s built in `groupby` and time functions with a RasterFrames @ref:[aggregate function](aggregation.md) to do this. Note that the computation is creating a weighted average, which is weighted by the number of valid observations per week. 
 
 ```python ndvi_time_series
 from pyspark.sql.functions import col, year, weekofyear, month
-from pyspark.sql.functions import sum as sql_sum
-
-rf_ndvi = rf_park_tile \
-    .withColumn('ndvi', rf_normalized_difference('B02', 'B01')) \
-    .withColumn('ndvi_masked', rf_mask('ndvi', 'park_tile'))
 
-time_series = rf_ndvi \
-        .withColumn('ndvi_wt', rf_tile_sum('ndvi_masked')) \
-        .withColumn('wt', rf_data_cells('ndvi_masked')) \
-        .groupby(year('acquisition_date').alias('year'), weekofyear('acquisition_date').alias('week')) \
-        .agg(sql_sum('ndvi_wt').alias('ndvi_wt_wk'), sql_sum('wt').alias('wt_wk')) \
-        .withColumn('ndvi', col('ndvi_wt_wk') / col('wt_wk'))
-        
-time_series.printSchema()
+time_series = rf_park_tile \
+        .groupby(
+            year('acquisition_date').alias('year'), 
+            weekofyear('acquisition_date').alias('week')) \
+        .agg(rf_agg_mean('ndvi_masked').alias('ndvi')) 
 ```
 
 Finally, we will take a look at the NDVI over time.
@@ -152,7 +97,7 @@ Finally, we will take a look at the NDVI over time.
 import matplotlib.pyplot as plt
 
 time_series_pdf = time_series.toPandas()
-time_series_pdf = time_series_pdf.sort_values('week')
+time_series_pdf.sort_values('week', inplace=True)
 plt.plot(time_series_pdf['week'], time_series_pdf['ndvi'], 'go-')
 plt.xlabel('Week of year, 2018')
 plt.ylabel('NDVI')
diff --git a/pyrasterframes/src/main/python/docs/unsupervised-learning.pymd b/pyrasterframes/src/main/python/docs/unsupervised-learning.pymd
@@ -6,7 +6,6 @@ In this example, we will demonstrate how to fit and score an unsupervised learni
 
 ```python, setup, echo=False
 from IPython.core.display import display
-from docs import resource_dir_uri
 import pyrasterframes.rf_ipython
 from pyrasterframes.utils import create_rf_spark_session
 
@@ -29,26 +28,19 @@ from pyspark.ml import Pipeline
 
 ```
 
-The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create a catalog DataFrame using the pattern from [the I/O page](raster-io.html#Single-Scene--Multiple-Bands). In the catalog, each row represents a distinct area and time, and each column is the URI to a band's image product. The function `resource_dir_uri` gives a local file system path to the sample Landsat data. The resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
+The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create a catalog DataFrame using the pattern from [the I/O page](raster-io.html#Single-Scene--Multiple-Bands). In the catalog, each row represents a distinct area and time, and each column is the URI to a band's image product. The resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
 
 
 ```python, catalog
-filenamePattern = "L8-B{}-Elkton-VA.tiff"
+filenamePattern = "https://github.com/locationtech/rasterframes/" \
+                  "raw/develop/core/src/test/resources/L8-B{}-Elkton-VA.tiff"
 catalog_df = pd.DataFrame([
-    {'b' + str(b): os.path.join(resource_dir_uri(), filenamePattern.format(b)) for b in range(1, 8)}
+    {'b' + str(b): filenamePattern.format(b) for b in range(1, 8)}
 ])
+
 df = spark.read.raster(catalog_df, catalog_col_names=catalog_df.columns)
-df = df.select(
-    rf_crs(df.b1).alias('crs'),
-    rf_extent(df.b1).alias('extent'),
-    rf_tile(df.b1).alias('b1'),
-    rf_tile(df.b2).alias('b2'),
-    rf_tile(df.b3).alias('b3'),
-    rf_tile(df.b4).alias('b4'),
-    rf_tile(df.b5).alias('b5'),
-    rf_tile(df.b6).alias('b6'),
-    rf_tile(df.b7).alias('b7'),
-)
+df = df.withColumn('crs', rf_crs(df.b1)) \
+       .withColumn('extent', rf_crs(df.b1))
 df.printSchema()
 ```
 
diff --git a/pyrasterframes/src/main/python/docs/vector-data.pymd b/pyrasterframes/src/main/python/docs/vector-data.pymd
@@ -1,19 +1,28 @@
 # Vector Data
 
-RasterFrames provides a variety of ways to work with spatial vector data (points, lines, and polygons) alongside raster data. There is a convenience DataSource for the GeoJSON format, as well as the ability to convert from [GeoPandas][GeoPandas] to Spark. Representation of vector geometries in PySpark is through [Shapely][Shapely], providing a great deal of interoperability. RasterFrames also provides access to Spark functions for working with geometries.
+RasterFrames provides a variety of ways to work with spatial vector data (points, lines, and polygons) alongside raster data.  
 
-## GeoJSON DataSource
+ * DataSource for GeoJSON format
+ * Ability to convert between from [GeoPandas][GeoPandas] and Spark DataFrames
+ * In PySpark, geometries are [Shapely][Shapely] objects, providing a great deal of interoperability
+ * Many Spark functions for working with columns of geometries
+ * Vector data is also the basis for @ref:[zonal map algebra](zonal-algebra.md) operations.
 
 ```python, setup, echo=False
 import pyrasterframes
 import pyrasterframes.rf_ipython
-from pyrasterframes.utils import create_rf_spark_session
-spark = create_rf_spark_session()
+import geomesa_pyspark.types
+import geopandas
+import folium
+spark = pyrasterframes.get_spark_session('local[2]')
 ```
 
+## GeoJSON DataSource
+
 ```python, read_geojson
 from pyspark import SparkFiles
-spark.sparkContext.addFile('https://raw.githubusercontent.com/datasets/geo-admin1-us/master/data/admin1-us.geojson')
+admin1_us_url = 'https://raw.githubusercontent.com/datasets/geo-admin1-us/master/data/admin1-us.geojson'
+spark.sparkContext.addFile(admin1_us_url)  # this lets us read http scheme uri's in spark
 
 df = spark.read.geojson(SparkFiles.get('admin1-us.geojson'))
 df.printSchema()
@@ -36,7 +45,7 @@ def poly_or_mp_to_mp(g):
     else:
         return MultiPolygon([g])
 
-gdf = geopandas.read_file('https://raw.githubusercontent.com/datasets/geo-admin1-us/master/data/admin1-us.geojson')
+gdf = geopandas.read_file(admin1_us_url)
 gdf.geometry = gdf.geometry.apply(poly_or_mp_to_mp)
 df2 = spark.createDataFrame(gdf)
 df2.printSchema()
@@ -92,8 +101,21 @@ l8 = l8.withColumn('paducah', st_point(lit(-88.628), lit(37.072)))  # col of poi
 l8_filtered = l8 \
                 .filter(st_intersects(l8.geom, st_bufferPoint(l8.paducah, lit(50000.0)))) \
                 .filter(l8.acquisition_date > '2018-02-01') \
-                .filter(l8.acquisition_date < '2018-04-01')
-l8_filtered.select('product_id', 'entity_id', 'acquisition_date', 'cloud_cover_pct')
+                .filter(l8.acquisition_date < '2018-03-11')
+```
+
+```python, folium, echo=False
+geo_df = geopandas.GeoDataFrame(
+    l8_filtered.select('geom', 'bounds_wgs84').toPandas(),
+    crs='EPSG:4326',
+    geometry='geom')
+
+# display as folium / leaflet map
+m = folium.Map()
+layer = folium.GeoJson(geo_df.to_json())
+m.fit_bounds(layer.get_bounds())
+m.add_child(layer)
+m
 ```
 
 [GeoPandas]: http://geopandas.org
diff --git a/pyrasterframes/src/main/python/docs/zonal-algebra.pymd b/pyrasterframes/src/main/python/docs/zonal-algebra.pymd
diff --git a/pyrasterframes/src/main/python/pyrasterframes/rf_ipython.py b/pyrasterframes/src/main/python/pyrasterframes/rf_ipython.py