Skip to content

Commit 04bd1d1

Browse files
committed
Merge branch 'develop' into feature/raster-read-uris
* develop: Expanded the TOC to include machine learning usbsections Fix image rendering in the supervised ML docs Remove dupe of test l8 label geojson, not needed Tweaks from PR feedback Update link to numpy doc Add new label data set for supervised learning docs; flesh out initial text Remove deep learning page, for now Correct NoDataFilter implementation in Python; supervised ML doc example Remove stray paren from Tile PNG header Some minor style tweaks to aggregation docs fix wording fix wording of rf_agg_approx_histogram description add histogram & pretty printing jason's edits add links add tile aggregates aggregation functions move example code to aggregation add picture, fix display for pyweave
2 parents 51b5a2f + 7f71409 commit 04bd1d1

File tree

12 files changed

+400
-59
lines changed

12 files changed

+400
-59
lines changed

build.sbt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -139,6 +139,7 @@ lazy val docs = project
139139
"version" -> version.value,
140140
"scaladoc.org.apache.spark.sql.rf" -> "http://rasterframes.io/latest"
141141
),
142+
paradoxNavigationExpandDepth := Some(3),
142143
paradoxTheme := Some(builtinParadoxTheme("generic")),
143144
makeSite := makeSite.dependsOn(Compile / unidoc).dependsOn(Compile / paradox).value,
144145
Compile / paradox / sourceDirectories += (pyrasterframes / Python / doc / target).value,

pyrasterframes/src/main/python/docs/aggregation.pymd

Lines changed: 115 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,124 @@ import os
1010
spark = create_rf_spark_session()
1111
```
1212

13-
## Cell Counts & Tile Mean
13+
There are 3 types of aggregate functions: tile aggregate, DataFrame aggregate, and element-wise local aggregate. In the @ref:[tile aggregate functions](reference.md#tile-statistics), we are computing a statistical summary per row of a `tile` column in a DataFrame. In the @ref:[DataFrame aggregate functions](reference.md#aggregate-tile-statistics), we are computing statistical summaries over all of the cell values *and* across all of the rows in the DataFrame or group. In the @ref:[element-wise local aggregate functions](reference.md#tile-local-aggregate-statistics), we are computing the element-wise statistical summary across a DataFrame or group of tiles.
14+
15+
## Tile Mean Example
16+
17+
We can illustrate these differences in computing an aggregate mean. First, we create a sample DataFrame of 2 tiles where the first tile is composed of 25 values of 1.0 and the second tile is composed of 25 values of 3.0.
18+
19+
```python
20+
import pyspark.sql.functions as F
21+
22+
rf = spark.sql("""
23+
SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as tile
24+
UNION
25+
SELECT 2 as id, rf_local_multiply(rf_make_ones_tile(5, 5, 'float32'), 3) as tile
26+
""")
27+
28+
rf.select("id", rf_render_matrix("tile")).show(10, False)
29+
```
30+
31+
In this code block we are using the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the tile aggregate mean of cells in each row of column `tile`. The mean of each tile is computed separately, so the first mean is 1.0 and the second mean is 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
32+
33+
```python
34+
rf.select(F.col('id'), rf_tile_mean(F.col('tile'))).show(10, False)
35+
```
36+
37+
In this code block we are using the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
38+
39+
```python
40+
rf.agg(rf_agg_mean(F.col('tile'))).show(10, False)
41+
```
42+
43+
In this code block we are using the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. In this example it is computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the `tile`.
44+
45+
To compute an element-wise local aggregate, tiles need have the same dimensions as in the example below where both tiles have 5 rows and 5 columns. If we tried to compute an element-wise local aggregate over the DataFrame without equal tile dimensions, we would get a runtime error.
1446

1547
```python
16-
rf = spark.read.geotiff(os.path.join(resource_dir(), 'L8-B8-Robinson-IL.tiff'))
17-
rf.show(5, False)
48+
rf.agg(rf_agg_local_mean(F.col('tile')).alias("local_mean")).select(rf_render_matrix("local_mean")).show(10, False)
49+
```
50+
51+
## Cell Counts Example
52+
53+
We can also count the total number of data and NoData cells over all the tiles in a DataFrame using @ref:[`rf_agg_data_cells`](reference.md#rf-agg-data-cells) and @ref:[`rf_agg_no_data_cells`](reference.md#rf-agg-no-data-cells). There are 3,842,290 data cells and 1,941,734 NoData cells in this DataFrame. See section on @ref:["NoData" handling](nodata-handling.md) for additional discussion on handling missing data.
54+
55+
```python
56+
rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/MCD43A4.006/11/05/2018233/MCD43A4.A2018233.h11v05.006.2018242035530_B02.TIF')
57+
stats = rf.agg(rf_agg_data_cells('proj_raster'), rf_agg_no_data_cells('proj_raster'))
1858

19-
stats = rf.agg(rf_agg_no_data_cells('tile'), rf_agg_data_cells('tile'), rf_agg_mean('tile'))
2059
stats.show(5, False)
2160
```
2261

62+
## Statistical Summaries
63+
64+
The statistical summary functions return a summary of cell values: number of data cells, number of NoData cells, minimum, maximum, mean, and variance, which can be computed as a tile aggregate, a DataFrame aggregate, or an element-wise local aggregate.
65+
66+
The @ref:[`rf_tile_stats`](reference.md#rf-tile-stats) function computes summary statistics separately for each row in a `tile` column as shown below.
67+
68+
```python
69+
rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif')
70+
stats = rf.select(rf_tile_stats('proj_raster').alias('stats'))
71+
72+
stats.printSchema()
73+
stats.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance').show(10, False)
74+
```
75+
76+
The @ref:[`rf_agg_stats`](reference.md#rf-agg-stats) function aggregates over all of the tiles in a DataFrame and returns a statistical summary of all cell values as shown below.
77+
78+
```python
79+
rf.agg(rf_agg_stats('proj_raster').alias('stats')) \
80+
.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance') \
81+
.show(10, False)
82+
```
83+
84+
The @ref:[`rf_agg_local_stats`](reference.md#rf-agg-local-stats) function computes the element-wise local aggregate statistical summary as shown below. The DataFrame used in the previous two code blocks, has unequal tile dimensions, so a different DataFrame is used in this code block to avoid a runtime error.
85+
86+
```python
87+
rf = spark.sql("""
88+
SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as tile
89+
UNION
90+
SELECT 2 as id, rf_make_constant_tile(3, 5, 5, 'float32') as tile
91+
UNION
92+
SELECT 3 as id, rf_make_constant_tile(5, 5, 5, 'float32') as tile
93+
""").agg(rf_agg_local_stats('tile').alias('stats'))
94+
95+
agg_local_stats = rf.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance').collect()
96+
97+
for r in agg_local_stats:
98+
for stat in r.asDict():
99+
print(stat, ':\n', r[stat], '\n')
100+
```
101+
102+
## Histogram
103+
104+
The @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function computes a count of cell values within each row of tile and outputs a `bins` array with the schema below. In the graph below, we have plotted `value` on the x-axis and `count` on the y-axis to create the histogram. There are 100 rows of `tile` in this DataFrame, but this histogram is just computed for the `tile` in the first row.
105+
106+
```python
107+
import matplotlib.pyplot as plt
108+
109+
rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/MCD43A4.006/11/05/2018233/MCD43A4.A2018233.h11v05.006.2018242035530_B02.TIF')
110+
111+
hist_df = rf.select(rf_tile_histogram('proj_raster')['bins'].alias('bins'))
112+
hist_df.printSchema()
113+
114+
bins_row = hist_df.first()
115+
values = [int(row['value']) for row in bins_row.bins]
116+
counts = [int(row['count']) for row in bins_row.bins]
117+
118+
plt.hist(values, weights=counts, bins=100)
119+
plt.show()
120+
```
121+
122+
The @ref:[`rf_agg_approx_histogram`](reference.md#rf-agg-approx-histogram) function computes a count of cell values across all of the rows of `tile` in a DataFrame or group. In the example below, the range of the y-axis is significantly wider than the range of the y-axis on the previous histogram since this histogram was computed for all cell values in the DataFrame.
123+
124+
```python
125+
bins_list = rf.agg(
126+
rf_agg_approx_histogram('proj_raster')['bins'].alias('bins')
127+
).collect()
128+
values = [int(row['value']) for row in bins_list[0].bins]
129+
counts = [int(row['count']) for row in bins_list[0].bins]
130+
131+
plt.hist(values, weights=counts, bins=100)
132+
plt.show()
133+
```

pyrasterframes/src/main/python/docs/local-algebra.pymd

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,10 @@ spark = create_rf_spark_session()
1414

1515
[Local map algebra](https://gisgeography.com/map-algebra-global-zonal-focal-local/) raster operations are element-wise operations on a single `tile`, between a `tile` and a scalar, between two `tile`s, or among many `tile`s. These operations are common in processing of earth observation and other image data.
1616

17+
<img src="https://gisgeography.com/wp-content/uploads/2015/05/Local-Operation-Raster.png" alt="local op" style="width:200px;"/>
18+
19+
Credit: [GISGeography](https://gisgeography.com)
20+
1721

1822
## Computing NDVI
1923

@@ -57,10 +61,9 @@ df.printSchema()
5761
We can inspect a sample of the data. Yellow indicates very healthy vegetation, and purple represents bare soil or impervious surfaces.
5862

5963
```python
60-
df.select(rf_tile('ndvi').alias('ndvi')).limit(5).toPandas()
61-
```
6264

63-
**TODO** fix the display once issue #166 fix.
65+
display(df.select(rf_tile('ndvi').alias('ndvi')).first()['ndvi'])
66+
```
6467

6568
We continue examining NDVI in the @ref:[time series](time-series.md) section.
6669

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Machine Learning
2+
3+
RasterFrames provides facilities to train and predict with a wide variety of machine learning models through [Spark ML Pipelines](https://spark.apache.org/docs/latest/ml-guide.html). This library provides a variety of pipeline components for supervised learning, unsupervised learning, and data preparation that can be used to represent and repeatably conduct a variety of tasks in machine learning.
4+
5+
The following sections provide some examples on how to integrate these workflows with RasterFrames.
6+
7+
@@@ index
8+
9+
* [Unsupervised Machine Learning](unsupervised-learning.md)
10+
* [Supervised Machine Learning](supervised-learning.md)
11+
12+
@@@
13+
14+
15+
@@toc { depth=2 }

pyrasterframes/src/main/python/docs/raster-processing.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
* @ref:["NoData" Handling](nodata-handling.md)
77
* @ref:[Aggregation](aggregation.md)
88
* @ref:[Time Series](time-series.md)
9-
* @ref:[Machine Learning](spark-ml.md)
9+
* @ref:[Machine Learning](machine-learning.md)
1010

1111
@@@
1212

pyrasterframes/src/main/python/docs/raster-write.pymd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Writing Raster Data
22

3-
RasterFrames is oriented toward large scale analyses of spatial data. The primary output for most use cases may be a @ref:[statistical summary](aggregation.md), a @ref:[machine learning model](spark-ml.md), or some other result that is generally much smaller than the input data set.
3+
RasterFrames is oriented toward large scale analyses of spatial data. The primary output for most use cases may be a @ref:[statistical summary](aggregation.md), a @ref:[machine learning model](machine-learning.md), or some other result that is generally much smaller than the input data set.
44

55
However there are times in any analysis where writing a representative sample of the work in progress provides invaluable feedback on the process and results.
66

pyrasterframes/src/main/python/docs/reference.pymd

Lines changed: 3 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -399,17 +399,7 @@ Performs natural logarithm of cell values plus one. Inverse of @ref:[`rf_expm1`]
399399

400400
## Tile Statistics
401401

402-
The following functions compute a statistical summary per row of a `tile` column. The statistics are computed across the cells of a single `tile`, within each DataFrame Row. Consider the following example.
403-
404-
```python
405-
import pyspark.sql.functions as F
406-
spark.sql("""
407-
SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as t
408-
UNION
409-
SELECT 2 as id, rf_local_multiply(rf_make_ones_tile(5, 5, 'float32'), 3) as t
410-
""").select(F.col('id'), rf_tile_sum(F.col('t'))).show()
411-
```
412-
402+
The following functions compute a statistical summary per row of a `tile` column. The statistics are computed across the cells of a single `tile`, within each DataFrame Row.
413403

414404
### rf_tile_sum
415405

@@ -489,17 +479,7 @@ spark.sql("SELECT rf_tile_histogram(rf_make_ones_tile(5, 5, 'float32')) as tile_
489479

490480
## Aggregate Tile Statistics
491481

492-
These functions compute statistical summaries over all of the cell values *and* across all the rows in the DataFrame or group. Example use below computes a single double-valued mean per month, across all data cells in the `red_band` `tile` type column. This would return at most twelve rows.
493-
494-
Continuing our example from the @ref:[Tile Statistics](reference.md#tile-statistics) section, consider the following. Note that only a single row is returned. It is averaging 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows.
495-
496-
```python
497-
spark.sql("""
498-
SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as t
499-
UNION
500-
SELECT 2 as id, rf_local_multiply(rf_make_ones_tile(5, 5, 'float32'), 3.0) as t
501-
""").agg(rf_agg_mean(F.col('t'))).show(10, False)
502-
```
482+
These functions compute statistical summaries over all of the cell values *and* across all the rows in the DataFrame or group.
503483

504484
### rf_agg_mean
505485

@@ -538,31 +518,13 @@ Aggregates over the `tile` and returns statistical summaries of cell values: num
538518
Struct[Array[Struct[Double, Long]]] rf_agg_approx_histogram(Tile tile)
539519

540520

541-
Aggregates over the `tile` return a count of each cell value to create a histogram with values are plotted on the x-axis and counts on the y-axis. Related is the @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function which operates on a single row at a time.
521+
Aggregates over all of the rows in DataFrame of `tile` and returns a count of each cell value to create a histogram with values are plotted on the x-axis and counts on the y-axis. Related is the @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function which operates on a single row at a time.
542522

543523

544524
## Tile Local Aggregate Statistics
545525

546526
Local statistics compute the element-wise statistics across a DataFrame or group of `tile`s, resulting in a `tile` that has the same dimension.
547527

548-
Consider again our example for Tile Statistics and Aggregate Tile Statistics, this time apply @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean). We see that it is computing the element-wise mean across the two rows. In this case it is computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the `tile`.
549-
550-
551-
```python
552-
import pyspark.sql.functions as F
553-
alm = spark.sql("""
554-
SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as t
555-
UNION
556-
SELECT 2 as id, rf_local_multiply(rf_make_ones_tile(5, 5, 'float32'), 3) as t
557-
""").agg(rf_agg_local_mean(F.col('t')).alias('l')) \
558-
559-
## local_agg_mean returns a tile
560-
alm.select(rf_dimensions(alm.l)).show()
561-
562-
alm.select(rf_explode_tiles(alm.l)).show(10, False)
563-
```
564-
565-
566528
### rf_agg_local_max
567529

568530
Tile rf_agg_local_max(Tile tile)

0 commit comments

Comments
 (0)