Skip to content

Commit 07ee233

Browse files
committed
Edits: aggregation.pymd, supervised-learning.pymd, unsupervised-learning.pymd, time-series.pymd
1 parent 3f3682d commit 07ee233

File tree

6 files changed

+37
-34
lines changed

6 files changed

+37
-34
lines changed

experimental/src/main/scala/org/locationtech/rasterframes/experimental/datasource/awspds/MODISCatalogDataSource.scala

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ object MODISCatalogDataSource extends LazyLogging with ResourceCacheSupport {
9393
"2018-03-12",
9494
"2018-03-13",
9595
"2018-03-14",
96+
"2018-03-15",
9697
"2018-05-16",
9798
"2018-05-17",
9899
"2018-05-18",

pyrasterframes/src/main/python/docs/aggregation.pymd

Lines changed: 20 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -11,13 +11,13 @@ import os
1111
spark = create_rf_spark_session()
1212
```
1313

14-
There are 3 types of aggregate functions: _tile_ aggregate, DataFrame aggregate, and element-wise local aggregate. In the @ref:[tile aggregate functions](reference.md#tile-statistics), we are computing a statistical summary per row of a _tile_ column in a DataFrame. In the @ref:[DataFrame aggregate functions](reference.md#aggregate-tile-statistics), we are computing statistical summaries over all of the cell values *and* across all of the rows in the DataFrame or group. In the @ref:[element-wise local aggregate functions](reference.md#tile-local-aggregate-statistics), we are computing the element-wise statistical summary across a DataFrame or group of _tiles_.
14+
There are three types of aggregate functions: _tile_ aggregate, DataFrame aggregate, and element-wise local aggregate. In the @ref:[tile aggregate functions](reference.md#tile-statistics), we are computing a statistical summary per row of a _tile_ column in a DataFrame. In the @ref:[DataFrame aggregate functions](reference.md#aggregate-tile-statistics), we are computing statistical summaries over all of the cell values *and* across all of the rows in the DataFrame or group. In the @ref:[element-wise local aggregate functions](reference.md#tile-local-aggregate-statistics), we are computing the element-wise statistical summary across a DataFrame or group of _tiles_.
1515

1616
## Tile Mean Example
1717

18-
We can illustrate these differences in computing an aggregate mean. First, we create a sample DataFrame of 2 _tiles_ where the first _tile_ is composed of 25 values of 1.0 and the second _tile_ is composed of 25 values of 3.0.
18+
We can illustrate aggregate differences by computing an aggregate mean. First, we create a sample DataFrame of 2 _tiles_ where the first _tile_ is composed of 25 values of 1.0 and the second _tile_ is composed of 25 values of 3.0.
1919

20-
```python, sql_dataframe, results='raw'
20+
```python, sql_dataframe
2121
import pyspark.sql.functions as F
2222

2323
rf = spark.sql("""
@@ -26,33 +26,36 @@ UNION
2626
SELECT 2 as id, rf_local_multiply(rf_make_ones_tile(5, 5, 'float32'), 3) as tile
2727
""")
2828

29-
rf.select("id", rf_render_matrix("tile")).show(truncate=False)
29+
tiles = rf.select("tile").collect()
30+
print(tiles[0]['tile'].cells)
31+
print(tiles[1]['tile'].cells)
3032
```
3133

32-
33-
In this code block, we are using the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the _tile_ aggregate mean of cells in each row of column `tile`. The mean of each _tile_ is computed separately, so the first mean is 1.0 and the second mean is 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
34+
We use the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the _tile_ aggregate mean of cells in each row of column `tile`. The mean of each _tile_ is computed separately, so the first mean is 1.0 and the second mean is 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
3435

3536
```python, tile_mean, results='raw'
36-
rf.select(F.col('id'), rf_tile_mean(F.col('tile'))).show(truncate=False)
37+
rf.select(F.col('id'), rf_tile_mean(F.col('tile'))).show()
3738
```
3839

39-
In this code block, we are using the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
40+
We use the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
4041

4142
```python, agg_mean, results='raw'
4243
rf.agg(rf_agg_mean(F.col('tile'))).show()
4344
```
4445

45-
In this code block, we are using the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. In this example it is computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the _tile_.
46+
We use the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. For this aggregation, we are computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the _tile_.
4647

47-
To compute an element-wise local aggregate, _tiles_ need have the same dimensions as in the example below where both _tiles_ have 5 rows and 5 columns. If we tried to compute an element-wise local aggregate over the DataFrame without equal _tile_ dimensions, we would get a runtime error.
48+
To compute an element-wise local aggregate, _tiles_ need to have the same dimensions. In this case, both _tiles_ have 5 rows and 5 columns. If we tried to compute an element-wise local aggregate over the DataFrame without equal _tile_ dimensions, we would get a runtime error.
4849

49-
```python, local_mean, results='raw'
50-
rf.agg(rf_agg_local_mean(F.col('tile')).alias("local_mean")).select(rf_render_matrix("local_mean")).show(truncate=False)
50+
```python, local_mean
51+
t = rf.agg(rf_agg_local_mean(F.col('tile')).alias('local_mean')) \
52+
.collect()[0]['local_mean']
53+
print(t.cells)
5154
```
5255

5356
## Cell Counts Example
5457

55-
We can also count the total number of data and NoData cells over all the _tiles_ in a DataFrame using @ref:[`rf_agg_data_cells`](reference.md#rf-agg-data-cells) and @ref:[`rf_agg_no_data_cells`](reference.md#rf-agg-no-data-cells). There are 3,842,290 data cells and 1,941,734 NoData cells in this DataFrame. See section on @ref:["NoData" handling](nodata-handling.md) for additional discussion on handling missing data.
58+
We can also count the total number of data and NoData cells over all the _tiles_ in a DataFrame using @ref:[`rf_agg_data_cells`](reference.md#rf-agg-data-cells) and @ref:[`rf_agg_no_data_cells`](reference.md#rf-agg-no-data-cells). There are ~3.8 million data cells and ~1.9 million NoData cells in this DataFrame. See the section on @ref:["NoData" handling](nodata-handling.md) for additional discussion on handling missing data.
5659

5760
```python, cell_counts, results='raw'
5861
rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/MCD43A4.006/11/05/2018233/MCD43A4.A2018233.h11v05.006.2018242035530_B02.TIF')
@@ -86,7 +89,7 @@ rf.agg(rf_agg_stats('proj_raster').alias('stats')) \
8689
.show()
8790
```
8891

89-
The @ref:[`rf_agg_local_stats`](reference.md#rf-agg-local-stats) function computes the element-wise local aggregate statistical summary as shown below. The DataFrame used in the previous two code blocks, has unequal _tile_ dimensions, so a different DataFrame is used in this code block to avoid a runtime error.
92+
The @ref:[`rf_agg_local_stats`](reference.md#rf-agg-local-stats) function computes the element-wise local aggregate statistical summary as shown below. The DataFrame used in the previous two code blocks has unequal _tile_ dimensions, so a different DataFrame is used in this code block to avoid a runtime error.
9093

9194
```python, agg_local_stats
9295
rf = spark.sql("""
@@ -106,7 +109,7 @@ for r in agg_local_stats:
106109

107110
## Histogram
108111

109-
The @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function computes a count of cell values within each row of _tile_ and outputs a `bins` array with the schema below. In the graph below, we have plotted `value` on the x-axis and `count` on the y-axis to create the histogram. There are 100 rows of _tile_ in this DataFrame, but this histogram is just computed for the _tile_ in the first row.
112+
The @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function computes a count of cell values within each row of _tile_ and outputs a `bins` array with the schema below. In the graph below, we have plotted each bin's `value` on the x-axis and `count` on the y-axis for the _tile_ in the first row of the DataFrame.
110113

111114

112115
```python, tile_histogram
@@ -118,8 +121,8 @@ hist_df = rf.select(rf_tile_histogram('proj_raster')['bins'].alias('bins'))
118121
hist_df.printSchema()
119122

120123
bins_row = hist_df.first()
121-
values = [int(row['value']) for row in bins_row.bins]
122-
counts = [int(row['count']) for row in bins_row.bins]
124+
values = [int(bin['value']) for bin in bins_row.bins]
125+
counts = [int(bin['count']) for bin in bins_row.bins]
123126

124127
plt.hist(values, weights=counts, bins=100)
125128
plt.show()

pyrasterframes/src/main/python/docs/supervised-learning.pymd

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ spark = create_rf_spark_session()
2020

2121
## Create and Read Raster Catalog
2222

23-
The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create @ref:[a catalog DataFrame](raster-catalogs.md#creating-a-catalog). In the catalog, each row represents a distinct area and time, and each column is the URI to a band's image product. In this example our catalog just has one row. After reading the catalog, the resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
23+
The first step is to create a Spark DataFrame containing our imagery data. To achieve that we will create @ref:[a catalog DataFrame](raster-catalogs.md#creating-a-catalog). In the catalog, each row represents a distinct area and time, and each column is the URI to a band's image product. In this example our catalog just has one row. After reading the catalog, the resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
2424

2525
The imagery for feature data will come from [eleven bands of 60 meter resolution Sentinel-2](https://earth.esa.int/web/sentinel/user-guides/sentinel-2-msi/resolutions/spatial) imagery. We also will use the [scene classification (SCL)](https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm) data to identify high quality, non-cloudy pixels.
2626

@@ -65,7 +65,7 @@ The land classification labels are based on a small set of hand drawn polygons i
6565

6666
We will create a very small Spark DataFrame of the label shapes and then join it to the raster DataFrame. Such joins are typically expensive, but in this case both datasets are quite small. To speed up the join for the small vector DataFrame, we put the `broadcast` hint on it, which will tell Spark to put a copy of it on each Spark executor.
6767

68-
After the raster and vector data are joined, we will convert the vector shapes into _tiles_ using the @ref:[`rf_rasterize`](reference.md#rf-rasterize) function. This procedure is sometimes called "burning in" a geometry into a raster. The values in the resulting _tile_ cells are the `id` property of the GeoJSON; which we will use as labels in our supervised learning task. In areas where the geometry does not intersect, the cells will contain a NoData.
68+
After the raster and vector data are joined, we will convert the vector shapes into _tiles_ using the @ref:[`rf_rasterize`](reference.md#rf-rasterize) function. This procedure is sometimes called "burning in" a geometry into a raster. The values in the resulting _tile_ cells are the `id` property of the GeoJSON, which we will use as labels in our supervised learning task. In areas where the geometry does not intersect, the cells will contain NoData.
6969

7070
```python join_and_rasterize
7171
crses = df.select('crs.crsProj4').distinct().collect()
@@ -87,7 +87,7 @@ FROM df_joined
8787

8888
## Masking Poor Quality Cells
8989

90-
To filter only for good quality pixels, we follow roughly the same procedure as demonstrated in the @ref:[quality masking](nodata-handling.md#masking) section of the chapter on NoData. Instead of actually setting NoData values in the unwanted cells of any of the imagery bands, we will just on filter out the mask cell values later in the process.
90+
To filter only for good quality pixels, we follow roughly the same procedure as demonstrated in the @ref:[quality masking](nodata-handling.md#masking) section of the chapter on NoData. Instead of actually setting NoData values in the unwanted cells of any of the imagery bands, we will just filter out the mask cell values later in the process.
9191

9292
```python, make_mask
9393
from pyspark.sql.functions import lit
@@ -126,9 +126,9 @@ from pyspark.ml.evaluation import MulticlassClassificationEvaluator
126126
from pyspark.ml import Pipeline
127127
```
128128

129-
SparkML requires that each observation be in its own row, and those observations be packed into a single [`Vector`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.linalg) object. The first step is to "explode" the _tiles_ into a single row per cell or pixel with the `TileExploder` (see also @ref:[`rf_explode_tiles`](reference.md#rf_explode_tiles)). If a _tile_ cell contains a NoData it will become a null value after the exploder stage. Then we filter out any rows that missing or null values, which will cause an error during training. Finally we use the SparkML `VectorAssembler` to create that `Vector`.
129+
SparkML requires that each observation be in its own row, and those observations be packed into a single [`Vector`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.linalg) object. The first step is to "explode" the _tiles_ into a single row per cell or pixel with the `TileExploder` (see also @ref:[`rf_explode_tiles`](reference.md#rf_explode_tiles)). If a _tile_ cell contains a NoData it will become a null value after the exploder stage. Then we use the `NoDataFilter` to filter out any rows that missing or null values, which will cause an error during training. Finally we use the SparkML `VectorAssembler` to create that `Vector`.
130130

131-
It is worth discussing a couple of interesting things about the `NoDataFilter`. First, we filter out missing values in the mask column. Recall above we set undesirable pixels to NoData, so they will be removed at this stage. The other column for the `NoDataFilter` is the `label` column. When it is time to score the model, the pipeline will ignore the fact that there is no `label` column on the input DataFrame.
131+
Recall above we set undesirable pixels to NoData, so the `NoDataFilter` will remove them at this stage. We apply the filter to the `mask` column and the `label` column, the latter being used during training. When it is time to score the model, the pipeline will ignore the fact that there is no `label` column on the input DataFrame.
132132

133133
```python, transformers
134134
exploder = TileExploder()
@@ -156,7 +156,7 @@ pipeline.getStages()
156156

157157
## Train the Model
158158

159-
Push the "go button"! This will actually run each step of the Pipeline we created including fitting the decision tree model. We filter the DataFrame for only _tiles_ intersecting the label raster, because the label shapes are relatively sparse over the imagery. It would be logically equivalent to either include or exclude this, but it is more efficient to do this filter because it will mean less data going into the pipeline.
159+
The next step is to actually run each step of the Pipeline we created, including fitting the decision tree model. We filter the DataFrame for only _tiles_ intersecting the label raster because the label shapes are relatively sparse over the imagery. It would be logically equivalent to either include or exclude thi step, but it is more efficient to filter because it will mean less data going into the pipeline.
160160

161161
```python, train
162162
model = pipeline.fit(df_mask.filter(rf_tile_sum('label') > 0).cache())
@@ -173,14 +173,13 @@ prediction_df.printSchema()
173173

174174
eval = MulticlassClassificationEvaluator(predictionCol=classifier.getPredictionCol(),
175175
labelCol=classifier.getLabelCol(),
176-
metricName='accuracy',
177-
)
176+
metricName='accuracy')
178177

179178
accuracy = eval.evaluate(prediction_df)
180179
print("\nAccuracy:", accuracy)
181180
```
182181

183-
As an example of using the flexibility provided by DataFrames, the code below computes and displays the confusion matrix. The categories down the rows are the predictions, and the truth labels are across the columns.
182+
As an example of using the flexibility provided by DataFrames, the code below computes and displays the confusion matrix.
184183

185184
```python, confusion_mtrx, results='raw'
186185
prediction_df.groupBy(classifier.getPredictionCol()) \

pyrasterframes/src/main/python/docs/time-series.pymd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -95,8 +95,8 @@ time_series = rf_ndvi \
9595
.withColumn('ndvi_wt', rf_tile_sum('ndvi_masked')) \
9696
.withColumn('wt', rf_data_cells('ndvi_masked')) \
9797
.groupby(year('acquisition_date').alias('year'), weekofyear('acquisition_date').alias('week')) \
98-
.agg(sql_sum('ndvi_wt').alias('ndvi_wt'), sql_sum('wt').alias('wt')) \
99-
.withColumn('ndvi', col('ndvi_wt') / col('wt'))
98+
.agg(sql_sum('ndvi_wt').alias('ndvi_wt_wk'), sql_sum('wt').alias('wt_wk')) \
99+
.withColumn('ndvi', col('ndvi_wt_wk') / col('wt_wk'))
100100
time_series.printSchema()
101101
time_series.persist()
102102
```
@@ -114,4 +114,4 @@ plt.ylabel('NDVI')
114114
plt.title('Cuyahoga Valley NP Green-up')
115115
```
116116

117-
We can see two fairly clear elbows in the curve at week 17 and week 21, indicating the start and end of the green up period. Estimation of such parameters is one technique [phenology](https://en.wikipedia.org/wiki/Phenology) researchers use to monitor changes in climate and environment!
117+
We can see two fairly clear elbows in the curve at week 17 and week 21, indicating the start and end of the green up period. Estimation of such parameters is one technique [phenology](https://en.wikipedia.org/wiki/Phenology) researchers use to monitor changes in climate and environment.

pyrasterframes/src/main/python/docs/unsupervised-learning.pymd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,12 @@ import os
1414

1515
spark = create_rf_spark_session()
1616

17-
import pandas as pd
1817
```
1918

20-
We import various Spark components that we need to construct our `Pipeline`.
19+
We import various Spark components needed to construct our `Pipeline`.
2120

2221
```python, imports, echo=True
22+
import pandas as pd
2323
from pyrasterframes import TileExploder
2424
from pyrasterframes.rasterfunctions import rf_assemble_tile, rf_crs, rf_extent, rf_tile, rf_dimensions
2525

@@ -37,7 +37,7 @@ filenamePattern = "L8-B{}-Elkton-VA.tiff"
3737
catalog_df = pd.DataFrame([
3838
{'b' + str(b): os.path.join(resource_dir_uri(), filenamePattern.format(b)) for b in range(1, 8)}
3939
])
40-
df = spark.read.raster(catalog=catalog_df.to_csv(index=None), catalog_col_names=catalog_df.columns)
40+
df = spark.read.raster(catalog=catalog_df, catalog_col_names=catalog_df.columns)
4141
df = df.select(
4242
rf_crs(df.b1).alias('crs'),
4343
rf_extent(df.b1).alias('extent'),
@@ -95,7 +95,7 @@ clustered = model.transform(df)
9595
clustered.show(8)
9696
```
9797

98-
If we want to inspect the model statistics, the SparkML API requires us to go through this unfortunate contortion:
98+
If we want to inspect the model statistics, the SparkML API requires us to go through this unfortunate contortion to access the clustering results:
9999

100100
```python, cluster_stats
101101
cluster_stage = model.stages[2]

pyrasterframes/src/main/python/setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ def run(self):
9797
import traceback
9898
import pweave
9999
bad_words = ["Error"]
100-
pweave.rcParams["chunk"]["defaultoptions"].update({'wrap': False, 'dpi': 100})
100+
pweave.rcParams["chunk"]["defaultoptions"].update({'wrap': False, 'dpi': 175})
101101
if self.format == 'markdown':
102102
pweave.PwebFormats.formats['markdown'] = {
103103
'class': PegdownMarkdownFormatter,

0 commit comments

Comments
 (0)