locationtech
diff --git a/‎build.sbt‎
Lines changed: 1 addition & 0 deletions b/‎build.sbt‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎pyrasterframes/src/main/python/docs/aggregation.pymd‎
Lines changed: 115 additions & 4 deletions b/‎pyrasterframes/src/main/python/docs/aggregation.pymd‎
Lines changed: 115 additions & 4 deletions
diff --git a/‎pyrasterframes/src/main/python/docs/local-algebra.pymd‎
Lines changed: 6 additions & 3 deletions b/‎pyrasterframes/src/main/python/docs/local-algebra.pymd‎
Lines changed: 6 additions & 3 deletions
diff --git a/‎pyrasterframes/src/main/python/docs/machine-learning.md‎
Lines changed: 15 additions & 0 deletions b/‎pyrasterframes/src/main/python/docs/machine-learning.md‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎pyrasterframes/src/main/python/docs/raster-processing.md‎
Lines changed: 1 addition & 1 deletion b/‎pyrasterframes/src/main/python/docs/raster-processing.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎pyrasterframes/src/main/python/docs/raster-write.pymd‎
Lines changed: 1 addition & 1 deletion b/‎pyrasterframes/src/main/python/docs/raster-write.pymd‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎pyrasterframes/src/main/python/docs/reference.pymd‎
Lines changed: 3 additions & 41 deletions b/‎pyrasterframes/src/main/python/docs/reference.pymd‎
Lines changed: 3 additions & 41 deletions
@@ -139,6 +139,7 @@ lazy val docs = project
       "version" -> version.value,
       "scaladoc.org.apache.spark.sql.rf" -> "http://rasterframes.io/latest"
     ),
+    paradoxNavigationExpandDepth := Some(3),
     paradoxTheme := Some(builtinParadoxTheme("generic")),
     makeSite := makeSite.dependsOn(Compile / unidoc).dependsOn(Compile / paradox).value,
     Compile / paradox / sourceDirectories += (pyrasterframes / Python / doc / target).value,
 
@@ -10,13 +10,124 @@ import os
 spark = create_rf_spark_session()
 ```
 
-## Cell Counts & Tile Mean
+There are 3 types of aggregate functions: tile aggregate, DataFrame aggregate, and element-wise local aggregate. In the @ref:[tile aggregate functions](reference.md#tile-statistics), we are computing a statistical summary per row of a `tile` column in a DataFrame. In the @ref:[DataFrame aggregate functions](reference.md#aggregate-tile-statistics), we are computing statistical summaries over all of the cell values *and* across all of the rows in the DataFrame or group. In the @ref:[element-wise local aggregate functions](reference.md#tile-local-aggregate-statistics), we are computing the element-wise statistical summary across a DataFrame or group of tiles.
+
+## Tile Mean Example
+
+We can illustrate these differences in computing an aggregate mean. First, we create a sample DataFrame of 2 tiles where the first tile is composed of 25 values of 1.0 and the second tile is composed of 25 values of 3.0.
+
+```python
+import pyspark.sql.functions as F
+
+rf = spark.sql("""
+SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as tile
+UNION
+SELECT 2 as id, rf_local_multiply(rf_make_ones_tile(5, 5, 'float32'), 3) as tile
+""")
+
+rf.select("id", rf_render_matrix("tile")).show(10, False)
+```
+
+In this code block we are using the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the tile aggregate mean of cells in each row of column `tile`. The mean of each tile is computed separately, so the first mean is 1.0 and the second mean is 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
+
+```python
+rf.select(F.col('id'), rf_tile_mean(F.col('tile'))).show(10, False)
+```
+
+In this code block we are using the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
+
+```python
+rf.agg(rf_agg_mean(F.col('tile'))).show(10, False)
+```
+
+In this code block we are using the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. In this example it is computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the `tile`.
+
+To compute an element-wise local aggregate, tiles need have the same dimensions as in the example below where both tiles have 5 rows and 5 columns. If we tried to compute an element-wise local aggregate over the DataFrame without equal tile dimensions, we would get a runtime error.
 
 ```python
-rf = spark.read.geotiff(os.path.join(resource_dir(), 'L8-B8-Robinson-IL.tiff'))
-rf.show(5, False)
+rf.agg(rf_agg_local_mean(F.col('tile')).alias("local_mean")).select(rf_render_matrix("local_mean")).show(10, False)
+```
+
+## Cell Counts Example
+
+We can also count the total number of data and NoData cells over all the tiles in a DataFrame using @ref:[`rf_agg_data_cells`](reference.md#rf-agg-data-cells) and @ref:[`rf_agg_no_data_cells`](reference.md#rf-agg-no-data-cells). There are 3,842,290 data cells and 1,941,734 NoData cells in this DataFrame. See section on @ref:["NoData" handling](nodata-handling.md) for additional discussion on handling missing data.
+
+```python
+rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/MCD43A4.006/11/05/2018233/MCD43A4.A2018233.h11v05.006.2018242035530_B02.TIF')
+stats = rf.agg(rf_agg_data_cells('proj_raster'), rf_agg_no_data_cells('proj_raster'))
 
-stats = rf.agg(rf_agg_no_data_cells('tile'), rf_agg_data_cells('tile'), rf_agg_mean('tile'))
 stats.show(5, False)
 ```
 
+## Statistical Summaries
+
+The statistical summary functions return a summary of cell values: number of data cells, number of NoData cells, minimum, maximum, mean, and variance, which can be computed as a tile aggregate, a DataFrame aggregate, or an element-wise local aggregate.
+
+The @ref:[`rf_tile_stats`](reference.md#rf-tile-stats) function computes summary statistics separately for each row in a `tile` column as shown below.
+
+```python
+rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif')
+stats = rf.select(rf_tile_stats('proj_raster').alias('stats'))
+
+stats.printSchema()
+stats.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance').show(10, False)
+```
+
+The @ref:[`rf_agg_stats`](reference.md#rf-agg-stats) function aggregates over all of the tiles in a DataFrame and returns a statistical summary of all cell values as shown below.
+
+```python
+rf.agg(rf_agg_stats('proj_raster').alias('stats')) \
+    .select('stats.min', 'stats.max', 'stats.mean', 'stats.variance') \
+    .show(10, False)
+```
+
+The @ref:[`rf_agg_local_stats`](reference.md#rf-agg-local-stats) function computes the element-wise local aggregate statistical summary as shown below. The DataFrame used in the previous two code blocks, has unequal tile dimensions, so a different DataFrame is used in this code block to avoid a runtime error.
+
+```python
+rf = spark.sql("""
+SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as tile
+UNION
+SELECT 2 as id, rf_make_constant_tile(3, 5, 5, 'float32') as tile
+UNION
+SELECT 3 as id, rf_make_constant_tile(5, 5, 5, 'float32') as tile
+""").agg(rf_agg_local_stats('tile').alias('stats'))
+
+agg_local_stats = rf.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance').collect()
+
+for r in agg_local_stats:
+    for stat in r.asDict():
+        print(stat, ':\n', r[stat], '\n')
+```
+
+## Histogram
+
+The @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function computes a count of cell values within each row of tile and outputs a `bins` array with the schema below. In the graph below, we have plotted `value` on the x-axis and `count` on the y-axis to create the histogram. There are 100 rows of `tile` in this DataFrame, but this histogram is just computed for the `tile` in the first row.
+
+```python
+import matplotlib.pyplot as plt
+
+rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/MCD43A4.006/11/05/2018233/MCD43A4.A2018233.h11v05.006.2018242035530_B02.TIF')
+
+hist_df = rf.select(rf_tile_histogram('proj_raster')['bins'].alias('bins'))
+hist_df.printSchema()
+
+bins_row = hist_df.first()
+values = [int(row['value']) for row in bins_row.bins]
+counts = [int(row['count']) for row in bins_row.bins]
+
+plt.hist(values, weights=counts, bins=100)
+plt.show()
+```
+
+The @ref:[`rf_agg_approx_histogram`](reference.md#rf-agg-approx-histogram) function computes a count of cell values across all of the rows of `tile` in a DataFrame or group. In the example below, the range of the y-axis is significantly wider than the range of the y-axis on the previous histogram since this histogram was computed for all cell values in the DataFrame.
+
+```python
+bins_list = rf.agg(
+    rf_agg_approx_histogram('proj_raster')['bins'].alias('bins')
+    ).collect()
+values = [int(row['value']) for row in bins_list[0].bins]
+counts = [int(row['count']) for row in bins_list[0].bins]
+
+plt.hist(values, weights=counts, bins=100)
+plt.show()
+```
@@ -14,6 +14,10 @@ spark = create_rf_spark_session()
 
 [Local map algebra](https://gisgeography.com/map-algebra-global-zonal-focal-local/) raster operations are element-wise operations on a single `tile`,  between a `tile` and a scalar, between two `tile`s, or among many `tile`s. These operations are common in processing of earth observation and other image data.
 
+<img src="https://gisgeography.com/wp-content/uploads/2015/05/Local-Operation-Raster.png" alt="local op" style="width:200px;"/>
+
+Credit: [GISGeography](https://gisgeography.com)
+
 
 ## Computing NDVI
 
@@ -57,10 +61,9 @@ df.printSchema()
 We can inspect a sample of the data. Yellow indicates very healthy vegetation, and purple represents bare soil or impervious surfaces.
 
 ```python
-df.select(rf_tile('ndvi').alias('ndvi')).limit(5).toPandas()
-```
 
-**TODO** fix the display once issue #166 fix.
+display(df.select(rf_tile('ndvi').alias('ndvi')).first()['ndvi'])
+```
 
 We continue examining NDVI in the @ref:[time series](time-series.md) section.
 
 
@@ -0,0 +1,15 @@
+# Machine Learning
+
+RasterFrames provides facilities to train and predict with a wide variety of machine learning models through [Spark ML Pipelines](https://spark.apache.org/docs/latest/ml-guide.html). This library provides a variety of pipeline components for supervised learning, unsupervised learning, and data preparation that can be used to represent and repeatably conduct a variety of tasks in machine learning.
+
+The following sections provide some examples on how to integrate these workflows with RasterFrames.
+
+@@@ index
+
+* [Unsupervised Machine Learning](unsupervised-learning.md)
+* [Supervised Machine Learning](supervised-learning.md)
+
+@@@
+
+
+@@toc { depth=2 }
@@ -6,7 +6,7 @@
 * @ref:["NoData" Handling](nodata-handling.md)
 * @ref:[Aggregation](aggregation.md)
 * @ref:[Time Series](time-series.md)
-* @ref:[Machine Learning](spark-ml.md)
+* @ref:[Machine Learning](machine-learning.md)
 
@@@
 
 
@@ -1,6 +1,6 @@
 # Writing Raster Data
 
-RasterFrames is oriented toward large scale analyses of spatial data. The primary output for most use cases may be a @ref:[statistical summary](aggregation.md), a @ref:[machine learning model](spark-ml.md), or some other result that is generally much smaller than the input data set. 
+RasterFrames is oriented toward large scale analyses of spatial data. The primary output for most use cases may be a @ref:[statistical summary](aggregation.md), a @ref:[machine learning model](machine-learning.md), or some other result that is generally much smaller than the input data set. 
 
 However there are times in any analysis where writing a representative sample of the work in progress provides invaluable feedback on the process and results.
 
 
@@ -399,17 +399,7 @@ Performs natural logarithm of cell values plus one. Inverse of @ref:[`rf_expm1`]
 
 ## Tile Statistics
 
-The following functions compute a statistical summary per row of a `tile` column. The statistics are computed across the cells of a single `tile`, within each DataFrame Row.  Consider the following example.
-
-```python
-import pyspark.sql.functions as F
-spark.sql("""
- SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as t
- UNION
- SELECT 2 as id, rf_local_multiply(rf_make_ones_tile(5, 5, 'float32'), 3) as t
- """).select(F.col('id'), rf_tile_sum(F.col('t'))).show()
-```
-
+The following functions compute a statistical summary per row of a `tile` column. The statistics are computed across the cells of a single `tile`, within each DataFrame Row.
 
 ### rf_tile_sum
 
@@ -489,17 +479,7 @@ spark.sql("SELECT rf_tile_histogram(rf_make_ones_tile(5, 5, 'float32')) as tile_
 
 ## Aggregate Tile Statistics
 
-These functions compute statistical summaries over all of the cell values *and* across all the rows in the DataFrame or group. Example use below computes a single double-valued mean per month, across all data cells in the `red_band` `tile` type column. This would return at most twelve rows.
-
-Continuing our example from the @ref:[Tile Statistics](reference.md#tile-statistics) section, consider the following. Note that only a single row is returned. It is averaging 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows.
-
-```python
-spark.sql("""
-SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as t
-UNION
-SELECT 2 as id, rf_local_multiply(rf_make_ones_tile(5, 5, 'float32'), 3.0) as t
-""").agg(rf_agg_mean(F.col('t'))).show(10, False)
-```
+These functions compute statistical summaries over all of the cell values *and* across all the rows in the DataFrame or group.
 
 ### rf_agg_mean
 
@@ -538,31 +518,13 @@ Aggregates over the `tile` and returns statistical summaries of cell values: num
     Struct[Array[Struct[Double, Long]]] rf_agg_approx_histogram(Tile tile)
 
 
-Aggregates over the `tile` return a count of each cell value to create a histogram with values are plotted on the x-axis and counts on the y-axis. Related is the @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function which operates on a single row at a time.
+Aggregates over all of the rows in DataFrame of `tile` and returns a count of each cell value to create a histogram with values are plotted on the x-axis and counts on the y-axis. Related is the @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function which operates on a single row at a time.
 
 
 ## Tile Local Aggregate Statistics
 
 Local statistics compute the element-wise statistics across a DataFrame or group of `tile`s, resulting in a `tile` that has the same dimension.
 
-Consider again our example for Tile Statistics and Aggregate Tile Statistics, this time apply @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean). We see that it is computing the element-wise mean across the two rows. In this case it is computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the `tile`.
-
-
-```python
-import pyspark.sql.functions as F
-alm = spark.sql("""
-SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as t
-UNION
-SELECT 2 as id, rf_local_multiply(rf_make_ones_tile(5, 5, 'float32'), 3) as t
-""").agg(rf_agg_local_mean(F.col('t')).alias('l')) \
-
-## local_agg_mean returns a tile
-alm.select(rf_dimensions(alm.l)).show()
-
-alm.select(rf_explode_tiles(alm.l)).show(10, False)
-```
-
-
 ### rf_agg_local_max
 
     Tile rf_agg_local_max(Tile tile)