Add new label data set for supervised learning docs; flesh out initial text

vpipkt · vpipkt · commit 6502d0dc01f9 · 2019-07-29T14:55:39.000-04:00
Signed-off-by: Jason T. Brown &lt;jason@astraea.earth&gt;
diff --git a/pyrasterframes/src/main/python/docs/raster-write.pymd b/pyrasterframes/src/main/python/docs/raster-write.pymd
@@ -33,7 +33,7 @@ display(tile) # IPython.display function
 
 Within an IPython or Jupyter interpreter a Pandas DataFrame containing a column of _tiles_ will be rendered as the samples discussed above. Simply import the `rf_ipython` submodule to enable enhanced HTML rendering of a Pandas DataFrame.
 
-In the example below, notice the result is limited to a small subset. For more discussion about why this is important, see the @ref:[Pandas and NumPy discussion](pandas-numpy.md).
+In the example below, notice the result is limited to a small subset. For more discussion about why this is important, see the @ref:[Pandas and NumPy discussion](numpy-pandas.md).
 
 ```python toPandas, evaluate=True
 import pyrasterframes.rf_ipython
diff --git a/pyrasterframes/src/main/python/docs/supervised-learning.pymd b/pyrasterframes/src/main/python/docs/supervised-learning.pymd
@@ -1,6 +1,6 @@
 # Supervised Machine Learning
 
-In this example we will demonstrate how to fit and score an unsupervised learning model with a sample of Sentinel-2 data and labels from the US [National Land Cover Dataset](https://www.mrlc.gov/) (NLCD).
+In this example we will demonstrate how to fit and score an unsupervised learning model with a sample of Sentinel-2 data and hand-drawn vector labels over different [land cover](https://en.wikipedia.org/wiki/Land_cover) types.
 
 ```python, setup, echo=False
 from IPython.core.display import display
@@ -16,19 +16,7 @@ spark = create_rf_spark_session()
 
 ## Create and Read Raster Catalog
 
-We import various Spark components that we need to construct our [Pipeline](https://spark.apache.org/docs/latest/ml-pipeline.html).
-
-```python, imports, echo=True
-from pyrasterframes import TileExploder
-from pyrasterframes.rf_types import NoDataFilter
-
-from pyspark.ml.feature import VectorAssembler
-from pyspark.ml.classification import DecisionTreeClassifier
-from pyspark.ml.evaluation import MulticlassClassificationEvaluator
-from pyspark.ml import Pipeline
-```
-
-The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create @ref:[a catalog DataFrame](raster-catalogs.md#creating-a-catalog). In the catalog, each row represents a distinct area and time; and each column is the URI to a band's image product. The resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
+The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create @ref:[a catalog DataFrame](raster-catalogs.md#creating-a-catalog). In the catalog, each row represents a distinct area and time; and each column is the URI to a band's image product. In this example our catalog just has one row. After reading the catalog, the resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
 
 ```python, read_bands, term=True
 uri_base = 's3://s22s-test-geotiffs/luray_snp/{}.tif'
@@ -67,14 +55,16 @@ df.printSchema()
 
 ### Label Data
 
-[](https://github.com/locationtech/rasterframes/blob/develop/core/src/test/resources/L8-Labels-Elkton-VA.geojson)
+The land classification labels are based on a smalls set of hand drawn polygons in the geojson file [here](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/test/resources/luray-labels.geojson). The property `id` indicates the type of land cover in each area. For these integer values 1 is forest, 2 is cropland, 3 is developed areas.
+
+We will create a very small Spark DataFrame of the label shapes and then join it to the raster DataFrame. Such joins are typically expensive but in this case both datasets are quite small. After the raster and vector data are joined, we will convert the vector shapes into _tiles_ using the @ref:[`rf_rasterize`](reference.md#rf-rasterize) function. This procedure is sometimes called "burning in" a geometry into a raster. The values in the resulting _tiles_ are the `id` property of the geojson; which we will use as labels in our supervised learning task.
 
 ```python
 crses = df.select('crs.crsProj4').distinct().collect()
 print('Found ', len(crses), 'distinct CRS.')
 crs = crses[0][0]
 
-label_df = spark.read.geojson(os.path.join(resource_dir_uri(), "L8-Labels-Elkton-VA.geojson")) \
+label_df = spark.read.geojson(os.path.join(resource_dir_uri(), 'luray-labels.geojson')) \
 					 .select('id', st_reproject('geometry', 'EPSG:4326', crs).alias('geometry')) \
 					 .hint('broadcast')
 
@@ -87,11 +77,9 @@ FROM df_joined
 """)
 ```
 
-
-
 ## Masking NoData
 
-We will follow the same procedure as demonstrated in the @ref:[quality masking](nodata-handling.md#masking) section of the chapter on NoData. Instead of actually masking we will just sort on the mask cell values later in the process
+To filter only for good quality pixels, we follow the same procedure as demonstrated in the @ref:[quality masking](nodata-handling.md#masking) section of the chapter on NoData. Instead of actually masking we will just sort on the mask cell values later in the process.
 
 ```python make_mask
 from pyspark.sql.functions import lit
@@ -111,10 +99,23 @@ df_mask = mask_part.withColumn('mask', rf_local_add('nodata', 'defect')) \
 df_mask.printSchema()
 ```
 
-
 ## Create ML Pipeline
 
-The data preparation modeling pipeline is next. SparkML requires that each observation be in its own row, and those observations be packed into a single [`Vector`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.linalg) object. The first step is to "explode" the tiles into a single row per cell/pixel. Then we filter out any rows that have `NoData` values (which will cause an error during training). Finally we use the SparkML `VectorAssembler` to create that `Vector`. 
+We import various Spark components that we need to construct our [Pipeline](https://spark.apache.org/docs/latest/ml-pipeline.html). These are the objects that will work in sequence to conduct the data preparation and modeling.
+
+```python, imports, echo=True
+from pyrasterframes import TileExploder
+from pyrasterframes.rf_types import NoDataFilter
+
+from pyspark.ml.feature import VectorAssembler
+from pyspark.ml.classification import DecisionTreeClassifier
+from pyspark.ml.evaluation import MulticlassClassificationEvaluator
+from pyspark.ml import Pipeline
+```
+
+SparkML requires that each observation be in its own row, and those observations be packed into a single [`Vector`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.linalg) object. The first step is to "explode" the tiles into a single row per cell/pixel with the `TileExploder` (see also @ref:[`rf_explode_tiles`](reference.md#rf_explode_tiles)). Then we filter out any rows that have `NoData` values (which will cause an error during training). Finally we use the SparkML `VectorAssembler` to create that `Vector`. 
+
+It is worth discussing a couple of interesting things about the `NoDataFilter`. First we filter by the mask column. This achieves the filtering of observations only to the good pixels, without having to explicitly mask and assign NoData in all eleven columns of raster data. The other column specified is the `label` column. When it is time to score the model, the pipeline will ignore the fact that there is no `label` column on the input DataFrame.
 
 ```python, transformers
 exploder = TileExploder()
@@ -151,35 +152,38 @@ model = pipeline.fit(df_mask.filter(rf_tile_sum('label') > 0).cache())
 
 ## Model Evaluation
 
-To view the model's performance ....
+To view the model's performance, we first call the pipeline's `transform` method on the training dataset. This transformed dataset will have the model's prediction included in each row. We next construct an evaluator and pass it the transformed dataset to easily compute the performance metric. We could also use a variety of DataFrame or SQL transformations to compute our metric if we like.
 
 ```python eval
+prediction_df = model.transform(df_mask) \
+                       .drop(assembler.getOutputCol()).cache()
+prediction_df.printSchema()
+
 eval = MulticlassClassificationEvaluator(predictionCol=classifier.getPredictionCol(),
 										 labelCol=classifier.getLabelCol(),
 										 metricName='accuracy',
 )
 
-prediction_df = model.transform(df_mask) \
-                       .drop(assembler.getOutputCol()).cache()
 accuracy = eval.evaluate(prediction_df)
-accuracy
+print("\nAccuracy:", accuracy)
 ```
 
-
-We can take a quick look at the resulting confusion matrix. 
+As an example of using the flexibility provided by DataFrames, the code below computes and displays the confusion matrix. 
 
 ```python confusion_mtrx
 prediction_df.groupBy(classifier.getPredictionCol()) \
     .pivot(classifier.getLabelCol()) \
-    .count().show(20, False)
+    .count() \
+	.sort(classifier.getPredictionCol()).show(20, False)
 ```
 
 ## Visualize Prediction
 
-We can recreate the tiled data structure using the metadata added by the `TileExploder` pipeline stage.
+Because the pipeline included a `TileExploder`, we will recreate the tiled data structure. The explosion transformation includes metadata enabling us to recreate the _tiles_. See the @ref:[`rf_assemble_tile`](reference.md#rf-assemble-tile) function documentation for more details. In this case, the pipeline is scoring on all areas, regardless of whether they intersect the label polygons. This is simply done by removing the `label` column, as @ref:[discussed above](supervised-learning.md#create-ml-pipeline). 
 
 ```python assemble_prediction
-model.transform(df_mask.drop('label')).createOrReplaceTempView('scored')
+scored = model.transform(df_mask.drop('label')).createOrReplaceTempView('scored')
+
 retiled = spark.sql("""
 SELECT extent, crs, rf_assemble_tile(column_index, row_index, prediction, 128, 128) as prediction
 FROM scored
@@ -189,9 +193,13 @@ GROUP BY extent, crs
 retiled.printSchema()
 ```
 
-Take a look at a sample of the resulting output.
+Take a look at a sample of the resulting output. Recall the label coding: 1 is forest (purple), 2 is cropland (green) and 3 is developed areas(yellow). 
 
 ```python display_prediction
-display(retiled.select('prediction').first()['prediction'])
+display(
+  retiled.select('prediction') \
+    .sort(-rf_tile_sum(rf_local_equal('prediction', lit(3.0)))) \
+    .first()['prediction']
+)
 ```
 
diff --git a/pyrasterframes/src/test/resources/luray-labels.geojson b/pyrasterframes/src/test/resources/luray-labels.geojson
@@ -0,0 +1,18 @@
+{
+"type": "FeatureCollection",
+"name": "L8-Labels-Elkton-VA",
+"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
+"features": [
+{ "type": "Feature", "properties": { "id": 2 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -78.074908035755897, 38.325593965393153 ], [ -78.078776658402262, 38.327823897217776 ], [ -78.092819443861529, 38.32571146899194 ], [ -78.100161396610659, 38.313995346890302 ], [ -78.089376916671426, 38.31282539893752 ], [ -78.077634077459933, 38.317374423878938 ], [ -78.073898621221971, 38.322204512883999 ], [ -78.074908035755897, 38.325593965393153 ] ] ] } },
+{ "type": "Feature", "properties": { "id": 1 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -78.369170650882523, 38.58065284389172 ], [ -78.383640370680567, 38.54472654334991 ], [ -78.327374251542963, 38.538843940894999 ], [ -78.314259254758397, 38.575748016690518 ], [ -78.369170650882523, 38.58065284389172 ] ] ] } },
+{ "type": "Feature", "properties": { "id": 2 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -77.565452768282938, 38.692698239200176 ], [ -77.54853756075201, 38.687441203624303 ], [ -77.547985098699257, 38.68151785733005 ], [ -77.564314874957617, 38.685855270536351 ], [ -77.565452768282938, 38.692698239200176 ] ] ] } },
+{ "type": "Feature", "properties": { "id": 1 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -78.625973620483947, 38.384086145975928 ], [ -78.619781683426993, 38.388373376437947 ], [ -78.613688820548219, 38.386918574305838 ], [ -78.609339103717076, 38.395958714056761 ], [ -78.600276924123435, 38.395057914051847 ], [ -78.598004559194038, 38.384358026518214 ], [ -78.600712105415838, 38.372726630839622 ], [ -78.61155677616398, 38.374617746567012 ], [ -78.625973620483947, 38.384086145975928 ] ] ] } },
+{ "type": "Feature", "properties": { "id": 3 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -77.499872299697259, 38.265959576364672 ], [ -77.490962536589549, 38.262548509310811 ], [ -77.494072526256261, 38.251792574227593 ], [ -77.508081452628034, 38.250717051596098 ], [ -77.512193148300739, 38.258543877561848 ], [ -77.499872299697259, 38.265959576364672 ] ] ] } },
+{ "type": "Feature", "properties": { "id": 2 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -77.896560416594369, 38.384489202082833 ], [ -77.885446228394855, 38.379901618618995 ], [ -77.886558613990559, 38.386926446793112 ], [ -77.892167110596205, 38.388029313414151 ], [ -77.896560416594369, 38.384489202082833 ] ] ] } },
+{ "type": "Feature", "properties": { "id": 2 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -78.104587721849782, 38.230347976602772 ], [ -78.096623830061731, 38.232933469701152 ], [ -78.093733463904584, 38.227058631140601 ], [ -78.100422718389751, 38.223567355393556 ], [ -78.104587721849782, 38.230347976602772 ] ] ] } },
+{ "type": "Feature", "properties": { "id": 2 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -78.09500918449011, 38.235516869468242 ], [ -78.084617397721388, 38.240823731215684 ], [ -78.079911154013928, 38.23752678606543 ], [ -78.084755595273549, 38.232321061869477 ], [ -78.09500918449011, 38.235516869468242 ] ] ] } },
+{ "type": "Feature", "properties": { "id": 3 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -77.505050042952476, 38.306327713885693 ], [ -77.5048207527187, 38.300040419530752 ], [ -77.511898783174473, 38.289086070422542 ], [ -77.522735680615511, 38.292059924398963 ], [ -77.514377553898413, 38.300185554775908 ], [ -77.513706480506912, 38.310493467061114 ], [ -77.505050042952476, 38.306327713885693 ] ] ] } },
+{ "type": "Feature", "properties": { "id": 2 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -77.89809802814473, 38.532542129891446 ], [ -77.901444154311747, 38.525317699576725 ], [ -77.893247775998773, 38.518264149754522 ], [ -77.897703956491853, 38.513453996400983 ], [ -77.901049016994193, 38.520378827806503 ], [ -77.916274443433977, 38.528730212529119 ], [ -77.909492522815626, 38.535864361205661 ], [ -77.89809802814473, 38.532542129891446 ] ] ] } },
+{ "type": "Feature", "properties": { "id": 3 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -77.516032435253635, 38.794413501916893 ], [ -77.523118384311459, 38.785034143507744 ], [ -77.505102142982878, 38.775534328871899 ], [ -77.504411687954828, 38.760891842932352 ], [ -77.483604951868131, 38.747060201442238 ], [ -77.463488423830327, 38.747867575574077 ], [ -77.449558893048575, 38.757506997981018 ], [ -77.516032435253635, 38.794413501916893 ] ] ] } }
+]
+}