Skip to content

Commit 6502d0d

Browse files
committed
Add new label data set for supervised learning docs; flesh out initial text
Signed-off-by: Jason T. Brown <jason@astraea.earth>
1 parent 7f2e668 commit 6502d0d

File tree

3 files changed

+59
-33
lines changed

3 files changed

+59
-33
lines changed

pyrasterframes/src/main/python/docs/raster-write.pymd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ display(tile) # IPython.display function
3333

3434
Within an IPython or Jupyter interpreter a Pandas DataFrame containing a column of _tiles_ will be rendered as the samples discussed above. Simply import the `rf_ipython` submodule to enable enhanced HTML rendering of a Pandas DataFrame.
3535

36-
In the example below, notice the result is limited to a small subset. For more discussion about why this is important, see the @ref:[Pandas and NumPy discussion](pandas-numpy.md).
36+
In the example below, notice the result is limited to a small subset. For more discussion about why this is important, see the @ref:[Pandas and NumPy discussion](numpy-pandas.md).
3737

3838
```python toPandas, evaluate=True
3939
import pyrasterframes.rf_ipython

pyrasterframes/src/main/python/docs/supervised-learning.pymd

Lines changed: 40 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Supervised Machine Learning
22

3-
In this example we will demonstrate how to fit and score an unsupervised learning model with a sample of Sentinel-2 data and labels from the US [National Land Cover Dataset](https://www.mrlc.gov/) (NLCD).
3+
In this example we will demonstrate how to fit and score an unsupervised learning model with a sample of Sentinel-2 data and hand-drawn vector labels over different [land cover](https://en.wikipedia.org/wiki/Land_cover) types.
44

55
```python, setup, echo=False
66
from IPython.core.display import display
@@ -16,19 +16,7 @@ spark = create_rf_spark_session()
1616

1717
## Create and Read Raster Catalog
1818

19-
We import various Spark components that we need to construct our [Pipeline](https://spark.apache.org/docs/latest/ml-pipeline.html).
20-
21-
```python, imports, echo=True
22-
from pyrasterframes import TileExploder
23-
from pyrasterframes.rf_types import NoDataFilter
24-
25-
from pyspark.ml.feature import VectorAssembler
26-
from pyspark.ml.classification import DecisionTreeClassifier
27-
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
28-
from pyspark.ml import Pipeline
29-
```
30-
31-
The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create @ref:[a catalog DataFrame](raster-catalogs.md#creating-a-catalog). In the catalog, each row represents a distinct area and time; and each column is the URI to a band's image product. The resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
19+
The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create @ref:[a catalog DataFrame](raster-catalogs.md#creating-a-catalog). In the catalog, each row represents a distinct area and time; and each column is the URI to a band's image product. In this example our catalog just has one row. After reading the catalog, the resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
3220

3321
```python, read_bands, term=True
3422
uri_base = 's3://s22s-test-geotiffs/luray_snp/{}.tif'
@@ -67,14 +55,16 @@ df.printSchema()
6755

6856
### Label Data
6957

70-
[](https://github.com/locationtech/rasterframes/blob/develop/core/src/test/resources/L8-Labels-Elkton-VA.geojson)
58+
The land classification labels are based on a smalls set of hand drawn polygons in the geojson file [here](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/test/resources/luray-labels.geojson). The property `id` indicates the type of land cover in each area. For these integer values 1 is forest, 2 is cropland, 3 is developed areas.
59+
60+
We will create a very small Spark DataFrame of the label shapes and then join it to the raster DataFrame. Such joins are typically expensive but in this case both datasets are quite small. After the raster and vector data are joined, we will convert the vector shapes into _tiles_ using the @ref:[`rf_rasterize`](reference.md#rf-rasterize) function. This procedure is sometimes called "burning in" a geometry into a raster. The values in the resulting _tiles_ are the `id` property of the geojson; which we will use as labels in our supervised learning task.
7161

7262
```python
7363
crses = df.select('crs.crsProj4').distinct().collect()
7464
print('Found ', len(crses), 'distinct CRS.')
7565
crs = crses[0][0]
7666

77-
label_df = spark.read.geojson(os.path.join(resource_dir_uri(), "L8-Labels-Elkton-VA.geojson")) \
67+
label_df = spark.read.geojson(os.path.join(resource_dir_uri(), 'luray-labels.geojson')) \
7868
.select('id', st_reproject('geometry', 'EPSG:4326', crs).alias('geometry')) \
7969
.hint('broadcast')
8070

@@ -87,11 +77,9 @@ FROM df_joined
8777
""")
8878
```
8979

90-
91-
9280
## Masking NoData
9381

94-
We will follow the same procedure as demonstrated in the @ref:[quality masking](nodata-handling.md#masking) section of the chapter on NoData. Instead of actually masking we will just sort on the mask cell values later in the process
82+
To filter only for good quality pixels, we follow the same procedure as demonstrated in the @ref:[quality masking](nodata-handling.md#masking) section of the chapter on NoData. Instead of actually masking we will just sort on the mask cell values later in the process.
9583

9684
```python make_mask
9785
from pyspark.sql.functions import lit
@@ -111,10 +99,23 @@ df_mask = mask_part.withColumn('mask', rf_local_add('nodata', 'defect')) \
11199
df_mask.printSchema()
112100
```
113101

114-
115102
## Create ML Pipeline
116103

117-
The data preparation modeling pipeline is next. SparkML requires that each observation be in its own row, and those observations be packed into a single [`Vector`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.linalg) object. The first step is to "explode" the tiles into a single row per cell/pixel. Then we filter out any rows that have `NoData` values (which will cause an error during training). Finally we use the SparkML `VectorAssembler` to create that `Vector`.
104+
We import various Spark components that we need to construct our [Pipeline](https://spark.apache.org/docs/latest/ml-pipeline.html). These are the objects that will work in sequence to conduct the data preparation and modeling.
105+
106+
```python, imports, echo=True
107+
from pyrasterframes import TileExploder
108+
from pyrasterframes.rf_types import NoDataFilter
109+
110+
from pyspark.ml.feature import VectorAssembler
111+
from pyspark.ml.classification import DecisionTreeClassifier
112+
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
113+
from pyspark.ml import Pipeline
114+
```
115+
116+
SparkML requires that each observation be in its own row, and those observations be packed into a single [`Vector`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.linalg) object. The first step is to "explode" the tiles into a single row per cell/pixel with the `TileExploder` (see also @ref:[`rf_explode_tiles`](reference.md#rf_explode_tiles)). Then we filter out any rows that have `NoData` values (which will cause an error during training). Finally we use the SparkML `VectorAssembler` to create that `Vector`.
117+
118+
It is worth discussing a couple of interesting things about the `NoDataFilter`. First we filter by the mask column. This achieves the filtering of observations only to the good pixels, without having to explicitly mask and assign NoData in all eleven columns of raster data. The other column specified is the `label` column. When it is time to score the model, the pipeline will ignore the fact that there is no `label` column on the input DataFrame.
118119

119120
```python, transformers
120121
exploder = TileExploder()
@@ -151,35 +152,38 @@ model = pipeline.fit(df_mask.filter(rf_tile_sum('label') > 0).cache())
151152

152153
## Model Evaluation
153154

154-
To view the model's performance ....
155+
To view the model's performance, we first call the pipeline's `transform` method on the training dataset. This transformed dataset will have the model's prediction included in each row. We next construct an evaluator and pass it the transformed dataset to easily compute the performance metric. We could also use a variety of DataFrame or SQL transformations to compute our metric if we like.
155156

156157
```python eval
158+
prediction_df = model.transform(df_mask) \
159+
.drop(assembler.getOutputCol()).cache()
160+
prediction_df.printSchema()
161+
157162
eval = MulticlassClassificationEvaluator(predictionCol=classifier.getPredictionCol(),
158163
labelCol=classifier.getLabelCol(),
159164
metricName='accuracy',
160165
)
161166

162-
prediction_df = model.transform(df_mask) \
163-
.drop(assembler.getOutputCol()).cache()
164167
accuracy = eval.evaluate(prediction_df)
165-
accuracy
168+
print("\nAccuracy:", accuracy)
166169
```
167170

168-
169-
We can take a quick look at the resulting confusion matrix.
171+
As an example of using the flexibility provided by DataFrames, the code below computes and displays the confusion matrix.
170172

171173
```python confusion_mtrx
172174
prediction_df.groupBy(classifier.getPredictionCol()) \
173175
.pivot(classifier.getLabelCol()) \
174-
.count().show(20, False)
176+
.count() \
177+
.sort(classifier.getPredictionCol()).show(20, False)
175178
```
176179

177180
## Visualize Prediction
178181

179-
We can recreate the tiled data structure using the metadata added by the `TileExploder` pipeline stage.
182+
Because the pipeline included a `TileExploder`, we will recreate the tiled data structure. The explosion transformation includes metadata enabling us to recreate the _tiles_. See the @ref:[`rf_assemble_tile`](reference.md#rf-assemble-tile) function documentation for more details. In this case, the pipeline is scoring on all areas, regardless of whether they intersect the label polygons. This is simply done by removing the `label` column, as @ref:[discussed above](supervised-learning.md#create-ml-pipeline).
180183

181184
```python assemble_prediction
182-
model.transform(df_mask.drop('label')).createOrReplaceTempView('scored')
185+
scored = model.transform(df_mask.drop('label')).createOrReplaceTempView('scored')
186+
183187
retiled = spark.sql("""
184188
SELECT extent, crs, rf_assemble_tile(column_index, row_index, prediction, 128, 128) as prediction
185189
FROM scored
@@ -189,9 +193,13 @@ GROUP BY extent, crs
189193
retiled.printSchema()
190194
```
191195

192-
Take a look at a sample of the resulting output.
196+
Take a look at a sample of the resulting output. Recall the label coding: 1 is forest (purple), 2 is cropland (green) and 3 is developed areas(yellow).
193197

194198
```python display_prediction
195-
display(retiled.select('prediction').first()['prediction'])
199+
display(
200+
retiled.select('prediction') \
201+
.sort(-rf_tile_sum(rf_local_equal('prediction', lit(3.0)))) \
202+
.first()['prediction']
203+
)
196204
```
197205

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
{
2+
"type": "FeatureCollection",
3+
"name": "L8-Labels-Elkton-VA",
4+
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:OGC:1.3:CRS84" } },
5+
"features": [
6+
{ "type": "Feature", "properties": { "id": 2 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -78.074908035755897, 38.325593965393153 ], [ -78.078776658402262, 38.327823897217776 ], [ -78.092819443861529, 38.32571146899194 ], [ -78.100161396610659, 38.313995346890302 ], [ -78.089376916671426, 38.31282539893752 ], [ -78.077634077459933, 38.317374423878938 ], [ -78.073898621221971, 38.322204512883999 ], [ -78.074908035755897, 38.325593965393153 ] ] ] } },
7+
{ "type": "Feature", "properties": { "id": 1 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -78.369170650882523, 38.58065284389172 ], [ -78.383640370680567, 38.54472654334991 ], [ -78.327374251542963, 38.538843940894999 ], [ -78.314259254758397, 38.575748016690518 ], [ -78.369170650882523, 38.58065284389172 ] ] ] } },
8+
{ "type": "Feature", "properties": { "id": 2 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -77.565452768282938, 38.692698239200176 ], [ -77.54853756075201, 38.687441203624303 ], [ -77.547985098699257, 38.68151785733005 ], [ -77.564314874957617, 38.685855270536351 ], [ -77.565452768282938, 38.692698239200176 ] ] ] } },
9+
{ "type": "Feature", "properties": { "id": 1 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -78.625973620483947, 38.384086145975928 ], [ -78.619781683426993, 38.388373376437947 ], [ -78.613688820548219, 38.386918574305838 ], [ -78.609339103717076, 38.395958714056761 ], [ -78.600276924123435, 38.395057914051847 ], [ -78.598004559194038, 38.384358026518214 ], [ -78.600712105415838, 38.372726630839622 ], [ -78.61155677616398, 38.374617746567012 ], [ -78.625973620483947, 38.384086145975928 ] ] ] } },
10+
{ "type": "Feature", "properties": { "id": 3 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -77.499872299697259, 38.265959576364672 ], [ -77.490962536589549, 38.262548509310811 ], [ -77.494072526256261, 38.251792574227593 ], [ -77.508081452628034, 38.250717051596098 ], [ -77.512193148300739, 38.258543877561848 ], [ -77.499872299697259, 38.265959576364672 ] ] ] } },
11+
{ "type": "Feature", "properties": { "id": 2 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -77.896560416594369, 38.384489202082833 ], [ -77.885446228394855, 38.379901618618995 ], [ -77.886558613990559, 38.386926446793112 ], [ -77.892167110596205, 38.388029313414151 ], [ -77.896560416594369, 38.384489202082833 ] ] ] } },
12+
{ "type": "Feature", "properties": { "id": 2 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -78.104587721849782, 38.230347976602772 ], [ -78.096623830061731, 38.232933469701152 ], [ -78.093733463904584, 38.227058631140601 ], [ -78.100422718389751, 38.223567355393556 ], [ -78.104587721849782, 38.230347976602772 ] ] ] } },
13+
{ "type": "Feature", "properties": { "id": 2 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -78.09500918449011, 38.235516869468242 ], [ -78.084617397721388, 38.240823731215684 ], [ -78.079911154013928, 38.23752678606543 ], [ -78.084755595273549, 38.232321061869477 ], [ -78.09500918449011, 38.235516869468242 ] ] ] } },
14+
{ "type": "Feature", "properties": { "id": 3 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -77.505050042952476, 38.306327713885693 ], [ -77.5048207527187, 38.300040419530752 ], [ -77.511898783174473, 38.289086070422542 ], [ -77.522735680615511, 38.292059924398963 ], [ -77.514377553898413, 38.300185554775908 ], [ -77.513706480506912, 38.310493467061114 ], [ -77.505050042952476, 38.306327713885693 ] ] ] } },
15+
{ "type": "Feature", "properties": { "id": 2 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -77.89809802814473, 38.532542129891446 ], [ -77.901444154311747, 38.525317699576725 ], [ -77.893247775998773, 38.518264149754522 ], [ -77.897703956491853, 38.513453996400983 ], [ -77.901049016994193, 38.520378827806503 ], [ -77.916274443433977, 38.528730212529119 ], [ -77.909492522815626, 38.535864361205661 ], [ -77.89809802814473, 38.532542129891446 ] ] ] } },
16+
{ "type": "Feature", "properties": { "id": 3 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -77.516032435253635, 38.794413501916893 ], [ -77.523118384311459, 38.785034143507744 ], [ -77.505102142982878, 38.775534328871899 ], [ -77.504411687954828, 38.760891842932352 ], [ -77.483604951868131, 38.747060201442238 ], [ -77.463488423830327, 38.747867575574077 ], [ -77.449558893048575, 38.757506997981018 ], [ -77.516032435253635, 38.794413501916893 ] ] ] } }
17+
]
18+
}

0 commit comments

Comments
 (0)