You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Merge branch 'develop' into feature/raster-read-uris
* develop:
Expanded the TOC to include machine learning usbsections
Fix image rendering in the supervised ML docs
Remove dupe of test l8 label geojson, not needed
Tweaks from PR feedback
Update link to numpy doc
Add new label data set for supervised learning docs; flesh out initial text
Remove deep learning page, for now
Correct NoDataFilter implementation in Python; supervised ML doc example
Remove stray paren from Tile PNG header
Some minor style tweaks to aggregation docs
fix wording
fix wording of rf_agg_approx_histogram description
add histogram & pretty printing
jason's edits
add links
add tile aggregates
aggregation functions
move example code to aggregation
add picture, fix display for pyweave
Copy file name to clipboardExpand all lines: pyrasterframes/src/main/python/docs/aggregation.pymd
+115-4Lines changed: 115 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -10,13 +10,124 @@ import os
10
10
spark = create_rf_spark_session()
11
11
```
12
12
13
-
## Cell Counts & Tile Mean
13
+
There are 3 types of aggregate functions: tile aggregate, DataFrame aggregate, and element-wise local aggregate. In the @ref:[tile aggregate functions](reference.md#tile-statistics), we are computing a statistical summary per row of a `tile` column in a DataFrame. In the @ref:[DataFrame aggregate functions](reference.md#aggregate-tile-statistics), we are computing statistical summaries over all of the cell values *and* across all of the rows in the DataFrame or group. In the @ref:[element-wise local aggregate functions](reference.md#tile-local-aggregate-statistics), we are computing the element-wise statistical summary across a DataFrame or group of tiles.
14
+
15
+
## Tile Mean Example
16
+
17
+
We can illustrate these differences in computing an aggregate mean. First, we create a sample DataFrame of 2 tiles where the first tile is composed of 25 values of 1.0 and the second tile is composed of 25 values of 3.0.
18
+
19
+
```python
20
+
import pyspark.sql.functions as F
21
+
22
+
rf = spark.sql("""
23
+
SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as tile
24
+
UNION
25
+
SELECT 2 as id, rf_local_multiply(rf_make_ones_tile(5, 5, 'float32'), 3) as tile
In this code block we are using the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the tile aggregate mean of cells in each row of column `tile`. The mean of each tile is computed separately, so the first mean is 1.0 and the second mean is 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
In this code block we are using the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
In this code block we are using the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. In this example it is computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the `tile`.
44
+
45
+
To compute an element-wise local aggregate, tiles need have the same dimensions as in the example below where both tiles have 5 rows and 5 columns. If we tried to compute an element-wise local aggregate over the DataFrame without equal tile dimensions, we would get a runtime error.
We can also count the total number of data and NoData cells over all the tiles in a DataFrame using @ref:[`rf_agg_data_cells`](reference.md#rf-agg-data-cells) and @ref:[`rf_agg_no_data_cells`](reference.md#rf-agg-no-data-cells). There are 3,842,290 data cells and 1,941,734 NoData cells in this DataFrame. See section on @ref:["NoData" handling](nodata-handling.md) for additional discussion on handling missing data.
The statistical summary functions return a summary of cell values: number of data cells, number of NoData cells, minimum, maximum, mean, and variance, which can be computed as a tile aggregate, a DataFrame aggregate, or an element-wise local aggregate.
65
+
66
+
The @ref:[`rf_tile_stats`](reference.md#rf-tile-stats) function computes summary statistics separately for each row in a `tile` column as shown below.
The @ref:[`rf_agg_stats`](reference.md#rf-agg-stats) function aggregates over all of the tiles in a DataFrame and returns a statistical summary of all cell values as shown below.
The @ref:[`rf_agg_local_stats`](reference.md#rf-agg-local-stats) function computes the element-wise local aggregate statistical summary as shown below. The DataFrame used in the previous two code blocks, has unequal tile dimensions, so a different DataFrame is used in this code block to avoid a runtime error.
85
+
86
+
```python
87
+
rf = spark.sql("""
88
+
SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as tile
89
+
UNION
90
+
SELECT 2 as id, rf_make_constant_tile(3, 5, 5, 'float32') as tile
91
+
UNION
92
+
SELECT 3 as id, rf_make_constant_tile(5, 5, 5, 'float32') as tile
The @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function computes a count of cell values within each row of tile and outputs a `bins` array with the schema below. In the graph below, we have plotted `value` on the x-axis and `count` on the y-axis to create the histogram. There are 100 rows of `tile` in this DataFrame, but this histogram is just computed for the `tile` in the first row.
values = [int(row['value']) for row in bins_row.bins]
116
+
counts = [int(row['count']) for row in bins_row.bins]
117
+
118
+
plt.hist(values, weights=counts, bins=100)
119
+
plt.show()
120
+
```
121
+
122
+
The @ref:[`rf_agg_approx_histogram`](reference.md#rf-agg-approx-histogram) function computes a count of cell values across all of the rows of `tile` in a DataFrame or group. In the example below, the range of the y-axis is significantly wider than the range of the y-axis on the previous histogram since this histogram was computed for all cell values in the DataFrame.
[Local map algebra](https://gisgeography.com/map-algebra-global-zonal-focal-local/) raster operations are element-wise operations on a single `tile`, between a `tile` and a scalar, between two `tile`s, or among many `tile`s. These operations are common in processing of earth observation and other image data.
RasterFrames provides facilities to train and predict with a wide variety of machine learning models through [Spark ML Pipelines](https://spark.apache.org/docs/latest/ml-guide.html). This library provides a variety of pipeline components for supervised learning, unsupervised learning, and data preparation that can be used to represent and repeatably conduct a variety of tasks in machine learning.
4
+
5
+
The following sections provide some examples on how to integrate these workflows with RasterFrames.
Copy file name to clipboardExpand all lines: pyrasterframes/src/main/python/docs/raster-write.pymd
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Writing Raster Data
2
2
3
-
RasterFrames is oriented toward large scale analyses of spatial data. The primary output for most use cases may be a @ref:[statistical summary](aggregation.md), a @ref:[machine learning model](spark-ml.md), or some other result that is generally much smaller than the input data set.
3
+
RasterFrames is oriented toward large scale analyses of spatial data. The primary output for most use cases may be a @ref:[statistical summary](aggregation.md), a @ref:[machine learning model](machine-learning.md), or some other result that is generally much smaller than the input data set.
4
4
5
5
However there are times in any analysis where writing a representative sample of the work in progress provides invaluable feedback on the process and results.
Copy file name to clipboardExpand all lines: pyrasterframes/src/main/python/docs/reference.pymd
+3-41Lines changed: 3 additions & 41 deletions
Original file line number
Diff line number
Diff line change
@@ -399,17 +399,7 @@ Performs natural logarithm of cell values plus one. Inverse of @ref:[`rf_expm1`]
399
399
400
400
## Tile Statistics
401
401
402
-
The following functions compute a statistical summary per row of a `tile` column. The statistics are computed across the cells of a single `tile`, within each DataFrame Row. Consider the following example.
403
-
404
-
```python
405
-
import pyspark.sql.functions as F
406
-
spark.sql("""
407
-
SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as t
408
-
UNION
409
-
SELECT 2 as id, rf_local_multiply(rf_make_ones_tile(5, 5, 'float32'), 3) as t
The following functions compute a statistical summary per row of a `tile` column. The statistics are computed across the cells of a single `tile`, within each DataFrame Row.
413
403
414
404
### rf_tile_sum
415
405
@@ -489,17 +479,7 @@ spark.sql("SELECT rf_tile_histogram(rf_make_ones_tile(5, 5, 'float32')) as tile_
489
479
490
480
## Aggregate Tile Statistics
491
481
492
-
These functions compute statistical summaries over all of the cell values *and* across all the rows in the DataFrame or group. Example use below computes a single double-valued mean per month, across all data cells in the `red_band` `tile` type column. This would return at most twelve rows.
493
-
494
-
Continuing our example from the @ref:[Tile Statistics](reference.md#tile-statistics) section, consider the following. Note that only a single row is returned. It is averaging 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows.
495
-
496
-
```python
497
-
spark.sql("""
498
-
SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as t
499
-
UNION
500
-
SELECT 2 as id, rf_local_multiply(rf_make_ones_tile(5, 5, 'float32'), 3.0) as t
501
-
""").agg(rf_agg_mean(F.col('t'))).show(10, False)
502
-
```
482
+
These functions compute statistical summaries over all of the cell values *and* across all the rows in the DataFrame or group.
503
483
504
484
### rf_agg_mean
505
485
@@ -538,31 +518,13 @@ Aggregates over the `tile` and returns statistical summaries of cell values: num
Aggregates over the `tile` return a count of each cell value to create a histogram with values are plotted on the x-axis and counts on the y-axis. Related is the @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function which operates on a single row at a time.
521
+
Aggregates over all of the rows in DataFrame of `tile` and returns a count of each cell value to create a histogram with values are plotted on the x-axis and counts on the y-axis. Related is the @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function which operates on a single row at a time.
542
522
543
523
544
524
## Tile Local Aggregate Statistics
545
525
546
526
Local statistics compute the element-wise statistics across a DataFrame or group of `tile`s, resulting in a `tile` that has the same dimension.
547
527
548
-
Consider again our example for Tile Statistics and Aggregate Tile Statistics, this time apply @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean). We see that it is computing the element-wise mean across the two rows. In this case it is computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the `tile`.
549
-
550
-
551
-
```python
552
-
import pyspark.sql.functions as F
553
-
alm = spark.sql("""
554
-
SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as t
555
-
UNION
556
-
SELECT 2 as id, rf_local_multiply(rf_make_ones_tile(5, 5, 'float32'), 3) as t
0 commit comments