Skip to content

Commit c26d33f

Browse files
committed
Tweaks from PR feedback
Signed-off-by: Jason T. Brown <[email protected]>
1 parent 83e22b7 commit c26d33f

File tree

1 file changed

+49
-23
lines changed

1 file changed

+49
-23
lines changed

pyrasterframes/src/main/python/docs/supervised-learning.pymd

Lines changed: 49 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,14 @@
11
# Supervised Machine Learning
22

3-
In this example we will demonstrate how to fit and score an unsupervised learning model with a sample of Sentinel-2 data and hand-drawn vector labels over different [land cover](https://en.wikipedia.org/wiki/Land_cover) types.
3+
In this example we will demonstrate how to fit and score a supervised learning model with a sample of Sentinel-2 data and hand-drawn vector labels over different [land cover](https://en.wikipedia.org/wiki/Land_cover) types.
44

55
```python, setup, echo=False
66
from IPython.core.display import display
77
from pyrasterframes.utils import create_rf_spark_session
88
from pyrasterframes.rasterfunctions import *
99
import pandas as pd
10+
import numpy as np
11+
import matplotlib.pyplot as plt
1012
from docs import resource_dir_uri
1113

1214
import os
@@ -18,6 +20,8 @@ spark = create_rf_spark_session()
1820

1921
The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create @ref:[a catalog DataFrame](raster-catalogs.md#creating-a-catalog). In the catalog, each row represents a distinct area and time; and each column is the URI to a band's image product. In this example our catalog just has one row. After reading the catalog, the resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
2022

23+
The imagery for feature data will come from [eleven bands of 60 meter resolution Sentinel 2](https://earth.esa.int/web/sentinel/user-guides/sentinel-2-msi/resolutions/spatial) imagery. We also will use the [scene classification (SCL)](https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm) data to identify high quality, non-cloudy pixels.
24+
2125
```python, read_bands, term=True
2226
uri_base = 's3://s22s-test-geotiffs/luray_snp/{}.tif'
2327
bands = ['B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B09', 'B11', 'B12']
@@ -55,11 +59,13 @@ df.printSchema()
5559

5660
### Label Data
5761

58-
The land classification labels are based on a smalls set of hand drawn polygons in the geojson file [here](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/test/resources/luray-labels.geojson). The property `id` indicates the type of land cover in each area. For these integer values 1 is forest, 2 is cropland, 3 is developed areas.
62+
The land classification labels are based on a small set of hand drawn polygons in the geojson file [here](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/test/resources/luray-labels.geojson). The property `id` indicates the type of land cover in each area. For these integer values 1 is forest, 2 is cropland and 3 is developed areas.
63+
64+
We will create a very small Spark DataFrame of the label shapes and then join it to the raster DataFrame. Such joins are typically expensive, but in this case both datasets are quite small. To speed up the join for the small vector DataFrame, we put the `broadcast` hint on it. Spark will put a copy of it on each Spark executor.
5965

60-
We will create a very small Spark DataFrame of the label shapes and then join it to the raster DataFrame. Such joins are typically expensive but in this case both datasets are quite small. After the raster and vector data are joined, we will convert the vector shapes into _tiles_ using the @ref:[`rf_rasterize`](reference.md#rf-rasterize) function. This procedure is sometimes called "burning in" a geometry into a raster. The values in the resulting _tiles_ are the `id` property of the geojson; which we will use as labels in our supervised learning task.
66+
After the raster and vector data are joined, we will convert the vector shapes into _tiles_ using the @ref:[`rf_rasterize`](reference.md#rf-rasterize) function. This procedure is sometimes called "burning in" a geometry into a raster. The values in the resulting _tile_ cells are the `id` property of the geojson; which we will use as labels in our supervised learning task. In areas where the geometry does not intersect, the cells will contain a NoData.
6167

62-
```python
68+
```python join_and_rasterize
6369
crses = df.select('crs.crsProj4').distinct().collect()
6470
print('Found ', len(crses), 'distinct CRS.')
6571
crs = crses[0][0]
@@ -77,9 +83,9 @@ FROM df_joined
7783
""")
7884
```
7985

80-
## Masking NoData
86+
## Masking Poor Quality Cells
8187

82-
To filter only for good quality pixels, we follow the same procedure as demonstrated in the @ref:[quality masking](nodata-handling.md#masking) section of the chapter on NoData. Instead of actually masking we will just sort on the mask cell values later in the process.
88+
To filter only for good quality pixels, we follow roughly the same procedure as demonstrated in the @ref:[quality masking](nodata-handling.md#masking) section of the chapter on NoData. Instead of actually setting NoData values in the unwanted cells of any of the imagery bands, we will just on filter out the mask cell values later in the process.
8389

8490
```python make_mask
8591
from pyspark.sql.functions import lit
@@ -90,11 +96,16 @@ mask_part = df_labeled.withColumn('nodata', rf_local_equal('scl', lit(0))) \
9096
.withColumn('cloud9', rf_local_equal('scl', lit(9))) \
9197
.withColumn('cirrus', rf_local_equal('scl', lit(10)))
9298

93-
df_mask = mask_part.withColumn('mask', rf_local_add('nodata', 'defect')) \
99+
df_mask_inv = mask_part.withColumn('mask', rf_local_add('nodata', 'defect')) \
94100
.withColumn('mask', rf_local_add('mask', 'cloud8')) \
95101
.withColumn('mask', rf_local_add('mask', 'cloud9')) \
96102
.withColumn('mask', rf_local_add('mask', 'cirrus')) \
97103
.drop('nodata', 'defect', 'cloud8', 'cloud9', 'cirrus')
104+
# at this point the mask contains 0 for good cells and 1 for defect, etc
105+
# convert cell type and set value 1 to NoData
106+
df_mask = df_mask_inv.withColumn('mask',
107+
rf_with_no_data(rf_convert_cell_type('mask', 'uint8'), 1.0)
108+
)
98109

99110
df_mask.printSchema()
100111
```
@@ -113,23 +124,22 @@ from pyspark.ml.evaluation import MulticlassClassificationEvaluator
113124
from pyspark.ml import Pipeline
114125
```
115126

116-
SparkML requires that each observation be in its own row, and those observations be packed into a single [`Vector`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.linalg) object. The first step is to "explode" the tiles into a single row per cell/pixel with the `TileExploder` (see also @ref:[`rf_explode_tiles`](reference.md#rf_explode_tiles)). Then we filter out any rows that have `NoData` values (which will cause an error during training). Finally we use the SparkML `VectorAssembler` to create that `Vector`.
127+
SparkML requires that each observation be in its own row, and those observations be packed into a single [`Vector`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.linalg) object. The first step is to "explode" the tiles into a single row per cell or pixel with the `TileExploder` (see also @ref:[`rf_explode_tiles`](reference.md#rf_explode_tiles)). If a _tile_ cell contains a NoData it will become a null value after the exploder stage. Then we filter out any rows that missing or null values, which will cause an error during training. Finally we use the SparkML `VectorAssembler` to create that `Vector`.
117128

118-
It is worth discussing a couple of interesting things about the `NoDataFilter`. First we filter by the mask column. This achieves the filtering of observations only to the good pixels, without having to explicitly mask and assign NoData in all eleven columns of raster data. The other column specified is the `label` column. When it is time to score the model, the pipeline will ignore the fact that there is no `label` column on the input DataFrame.
129+
It is worth discussing a couple of interesting things about the `NoDataFilter`. First, we filter out missing values in the mask column. Recall above we set undesirable pixels to NoData, so they will be removed at this stage. The other column for the `NoDataFilter` is the `label` column. When it is time to score the model, the pipeline will ignore the fact that there is no `label` column on the input DataFrame.
119130

120131
```python, transformers
121132
exploder = TileExploder()
122133

123134
noDataFilter = NoDataFilter() \
124-
.setInputCols(cols + ['label', 'mask'])
135+
.setInputCols(['label', 'mask'])
125136

126137
assembler = VectorAssembler() \
127138
.setInputCols(bands) \
128139
.setOutputCol("features")
129140
```
130141

131-
We are going to use a decision tree for classification. You can swap out one of the other multi-class
132-
classification algorithms if you like. With the algorithm selected we can assemble our modeling pipeline.
142+
We are going to use a decision tree for classification. You can swap out one of the other multi-class classification algorithms if you like. With the algorithm selected we can assemble our modeling pipeline.
133143

134144
```python, pipeline
135145
classifier = DecisionTreeClassifier() \
@@ -144,15 +154,15 @@ pipeline.getStages()
144154

145155
## Train the Model
146156

147-
Push the "go button"! This will actually run each step of the Pipeline we created including fitting the decision tree model.
157+
Push the "go button"! This will actually run each step of the Pipeline we created including fitting the decision tree model. We filter the dataframe for only tiles intersecting the label raster, because the label shapes are relatively sparse over the imagery. It would be logically equivalent to either include or exclude this, but it is more efficient to do this filter because it will mean less data going into the pipeline.
148158

149159
```python, train
150160
model = pipeline.fit(df_mask.filter(rf_tile_sum('label') > 0).cache())
151161
```
152162

153163
## Model Evaluation
154164

155-
To view the model's performance, we first call the pipeline's `transform` method on the training dataset. This transformed dataset will have the model's prediction included in each row. We next construct an evaluator and pass it the transformed dataset to easily compute the performance metric. We could also use a variety of DataFrame or SQL transformations to compute our metric if we like.
165+
To view the model's performance, we first call the pipeline's `transform` method on the training dataset. This transformed dataset will have the model's prediction included in each row. We next construct an evaluator and pass it the transformed dataset to easily compute the performance metric. We can also create custom metrics using a variety of DataFrame or SQL transformations.
156166

157167
```python eval
158168
prediction_df = model.transform(df_mask) \
@@ -168,7 +178,7 @@ accuracy = eval.evaluate(prediction_df)
168178
print("\nAccuracy:", accuracy)
169179
```
170180

171-
As an example of using the flexibility provided by DataFrames, the code below computes and displays the confusion matrix.
181+
As an example of using the flexibility provided by DataFrames, the code below computes and displays the confusion matrix. The categories down the rows are the predictions, and the truth labels are across the columns.
172182

173183
```python confusion_mtrx
174184
prediction_df.groupBy(classifier.getPredictionCol()) \
@@ -182,24 +192,40 @@ prediction_df.groupBy(classifier.getPredictionCol()) \
182192
Because the pipeline included a `TileExploder`, we will recreate the tiled data structure. The explosion transformation includes metadata enabling us to recreate the _tiles_. See the @ref:[`rf_assemble_tile`](reference.md#rf-assemble-tile) function documentation for more details. In this case, the pipeline is scoring on all areas, regardless of whether they intersect the label polygons. This is simply done by removing the `label` column, as @ref:[discussed above](supervised-learning.md#create-ml-pipeline).
183193

184194
```python assemble_prediction
185-
scored = model.transform(df_mask.drop('label')).createOrReplaceTempView('scored')
195+
scored = model.transform(df_mask.drop('label'))
196+
scored.createOrReplaceTempView('scored')
186197

187198
retiled = spark.sql("""
188-
SELECT extent, crs, rf_assemble_tile(column_index, row_index, prediction, 128, 128) as prediction
199+
SELECT extent, crs,
200+
rf_assemble_tile(column_index, row_index, prediction, 128, 128) as prediction,
201+
rf_assemble_tile(column_index, row_index, B04, 128, 128) as red,
202+
rf_assemble_tile(column_index, row_index, B03, 128, 128) as grn,
203+
rf_assemble_tile(column_index, row_index, B02, 128, 128) as blu,
189204
FROM scored
190205
GROUP BY extent, crs
191206
""")
192207

193208
retiled.printSchema()
194209
```
195210

196-
Take a look at a sample of the resulting output. Recall the label coding: 1 is forest (purple), 2 is cropland (green) and 3 is developed areas(yellow).
211+
Take a look at a sample of the resulting output and the corresponding area's red-green-blue composite image.
197212

198-
```python display_prediction
199-
display(
200-
retiled.select('prediction') \
213+
```python display_rgb
214+
sample = retiled.select('prediction', 'red', 'grn', 'blu') \
201215
.sort(-rf_tile_sum(rf_local_equal('prediction', lit(3.0)))) \
202-
.first()['prediction']
203-
)
216+
.first()
217+
218+
sample_prediction = sample['prediction']
219+
220+
red = sample['red'].cells
221+
grn = sample['grn'].cells
222+
blu = sample['blu'].cells
223+
sample_rgb = np.concatenate([red[ :, :, None], grn[:, :, None] , blu[ :, :, None]], axis=2)
224+
np.imshow(sample_rgb)
204225
```
205226

227+
Recall the label coding: 1 is forest (purple), 2 is cropland (green) and 3 is developed areas(yellow).
228+
229+
```python display_prediction
230+
display(sample_prediction)
231+
```

0 commit comments

Comments
 (0)