You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: pyrasterframes/src/main/python/docs/raster-write.pymd
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -33,7 +33,7 @@ display(tile) # IPython.display function
33
33
34
34
Within an IPython or Jupyter interpreter a Pandas DataFrame containing a column of _tiles_ will be rendered as the samples discussed above. Simply import the `rf_ipython` submodule to enable enhanced HTML rendering of a Pandas DataFrame.
35
35
36
-
In the example below, notice the result is limited to a small subset. For more discussion about why this is important, see the @ref:[Pandas and NumPy discussion](pandas-numpy.md).
36
+
In the example below, notice the result is limited to a small subset. For more discussion about why this is important, see the @ref:[Pandas and NumPy discussion](numpy-pandas.md).
Copy file name to clipboardExpand all lines: pyrasterframes/src/main/python/docs/supervised-learning.pymd
+40-32Lines changed: 40 additions & 32 deletions
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Supervised Machine Learning
2
2
3
-
In this example we will demonstrate how to fit and score an unsupervised learning model with a sample of Sentinel-2 data and labels from the US [National Land Cover Dataset](https://www.mrlc.gov/) (NLCD).
3
+
In this example we will demonstrate how to fit and score an unsupervised learning model with a sample of Sentinel-2 data and hand-drawn vector labels over different [land cover](https://en.wikipedia.org/wiki/Land_cover) types.
We import various Spark components that we need to construct our [Pipeline](https://spark.apache.org/docs/latest/ml-pipeline.html).
20
-
21
-
```python, imports, echo=True
22
-
from pyrasterframes import TileExploder
23
-
from pyrasterframes.rf_types import NoDataFilter
24
-
25
-
from pyspark.ml.feature import VectorAssembler
26
-
from pyspark.ml.classification import DecisionTreeClassifier
27
-
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
28
-
from pyspark.ml import Pipeline
29
-
```
30
-
31
-
The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create @ref:[a catalog DataFrame](raster-catalogs.md#creating-a-catalog). In the catalog, each row represents a distinct area and time; and each column is the URI to a band's image product. The resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
19
+
The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create @ref:[a catalog DataFrame](raster-catalogs.md#creating-a-catalog). In the catalog, each row represents a distinct area and time; and each column is the URI to a band's image product. In this example our catalog just has one row. After reading the catalog, the resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
The land classification labels are based on a smalls set of hand drawn polygons in the geojson file [here](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/test/resources/luray-labels.geojson). The property `id` indicates the type of land cover in each area. For these integer values 1 is forest, 2 is cropland, 3 is developed areas.
59
+
60
+
We will create a very small Spark DataFrame of the label shapes and then join it to the raster DataFrame. Such joins are typically expensive but in this case both datasets are quite small. After the raster and vector data are joined, we will convert the vector shapes into _tiles_ using the @ref:[`rf_rasterize`](reference.md#rf-rasterize) function. This procedure is sometimes called "burning in" a geometry into a raster. The values in the resulting _tiles_ are the `id` property of the geojson; which we will use as labels in our supervised learning task.
We will follow the same procedure as demonstrated in the @ref:[quality masking](nodata-handling.md#masking) section of the chapter on NoData. Instead of actually masking we will just sort on the mask cell values later in the process
82
+
To filter only for good quality pixels, we follow the same procedure as demonstrated in the @ref:[quality masking](nodata-handling.md#masking) section of the chapter on NoData. Instead of actually masking we will just sort on the mask cell values later in the process.
The data preparation modeling pipeline is next. SparkML requires that each observation be in its own row, and those observations be packed into a single [`Vector`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.linalg) object. The first step is to "explode" the tiles into a single row per cell/pixel. Then we filter out any rows that have `NoData` values (which will cause an error during training). Finally we use the SparkML `VectorAssembler` to create that `Vector`.
104
+
We import various Spark components that we need to construct our [Pipeline](https://spark.apache.org/docs/latest/ml-pipeline.html). These are the objects that will work in sequence to conduct the data preparation and modeling.
105
+
106
+
```python, imports, echo=True
107
+
from pyrasterframes import TileExploder
108
+
from pyrasterframes.rf_types import NoDataFilter
109
+
110
+
from pyspark.ml.feature import VectorAssembler
111
+
from pyspark.ml.classification import DecisionTreeClassifier
112
+
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
113
+
from pyspark.ml import Pipeline
114
+
```
115
+
116
+
SparkML requires that each observation be in its own row, and those observations be packed into a single [`Vector`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.linalg) object. The first step is to "explode" the tiles into a single row per cell/pixel with the `TileExploder` (see also @ref:[`rf_explode_tiles`](reference.md#rf_explode_tiles)). Then we filter out any rows that have `NoData` values (which will cause an error during training). Finally we use the SparkML `VectorAssembler` to create that `Vector`.
117
+
118
+
It is worth discussing a couple of interesting things about the `NoDataFilter`. First we filter by the mask column. This achieves the filtering of observations only to the good pixels, without having to explicitly mask and assign NoData in all eleven columns of raster data. The other column specified is the `label` column. When it is time to score the model, the pipeline will ignore the fact that there is no `label` column on the input DataFrame.
118
119
119
120
```python, transformers
120
121
exploder = TileExploder()
@@ -151,35 +152,38 @@ model = pipeline.fit(df_mask.filter(rf_tile_sum('label') > 0).cache())
151
152
152
153
## Model Evaluation
153
154
154
-
To view the model's performance ....
155
+
To view the model's performance, we first call the pipeline's `transform` method on the training dataset. This transformed dataset will have the model's prediction included in each row. We next construct an evaluator and pass it the transformed dataset to easily compute the performance metric. We could also use a variety of DataFrame or SQL transformations to compute our metric if we like.
We can recreate the tiled data structure using the metadata added by the `TileExploder` pipeline stage.
182
+
Because the pipeline included a `TileExploder`, we will recreate the tiled data structure. The explosion transformation includes metadata enabling us to recreate the _tiles_. See the @ref:[`rf_assemble_tile`](reference.md#rf-assemble-tile) function documentation for more details. In this case, the pipeline is scoring on all areas, regardless of whether they intersect the label polygons. This is simply done by removing the `label` column, as @ref:[discussed above](supervised-learning.md#create-ml-pipeline).
SELECT extent, crs, rf_assemble_tile(column_index, row_index, prediction, 128, 128) as prediction
185
189
FROM scored
@@ -189,9 +193,13 @@ GROUP BY extent, crs
189
193
retiled.printSchema()
190
194
```
191
195
192
-
Take a look at a sample of the resulting output.
196
+
Take a look at a sample of the resulting output. Recall the label coding: 1 is forest (purple), 2 is cropland (green) and 3 is developed areas(yellow).
0 commit comments