Misc docs tweaking.

metasim · metasim · commit f6fdcd4a15f3 · 2019-08-12T16:08:38.000-04:00
diff --git a/build.sbt b/build.sbt
@@ -141,9 +141,12 @@ lazy val docs = project
     ),
     paradoxNavigationExpandDepth := Some(3),
     paradoxTheme := Some(builtinParadoxTheme("generic")),
-    makeSite := makeSite.dependsOn(Compile / unidoc).dependsOn(Compile / paradox).value,
+    makeSite := makeSite
+      .dependsOn(Compile / unidoc)
+      .dependsOn((Compile / paradox)
+        .dependsOn(pyrasterframes / doc)
+      ).value,
     Compile / paradox / sourceDirectories += (pyrasterframes / Python / doc / target).value,
-    Compile / paradox := (Compile / paradox).dependsOn(pyrasterframes / doc).value
   )
   .settings(
     addMappingsToSiteDir(ScalaUnidoc / packageDoc / mappings, ScalaUnidoc / siteSubdirName)
diff --git a/core/src/main/scala/org/locationtech/rasterframes/model/FixedRasterExtent.scala b/core/src/main/scala/org/locationtech/rasterframes/model/FixedRasterExtent.scala
@@ -19,7 +19,7 @@ package org.locationtech.rasterframes.model
 import geotrellis.raster._
 import geotrellis.vector._
 
-import scala.math.{ceil, max, min}
+import scala.math.ceil
 
 /**
   * This class is a copy of the GeoTrellis 2.x `RasterExtent`,
diff --git a/pyrasterframes/src/main/python/docs/aggregation.pymd b/pyrasterframes/src/main/python/docs/aggregation.pymd
@@ -1,6 +1,6 @@
 # Aggregation
 
-```python, echo=False
+```python, setup, echo=False
 from docs import *
 from pyrasterframes.utils import create_rf_spark_session
 from pyrasterframes.rasterfunctions import *
@@ -16,7 +16,7 @@ There are 3 types of aggregate functions: _tile_ aggregate, DataFrame aggregate,
 
 We can illustrate these differences in computing an aggregate mean. First, we create a sample DataFrame of 2 _tiles_ where the first _tile_ is composed of 25 values of 1.0 and the second _tile_ is composed of 25 values of 3.0.
 
-```python
+```python, sql_dataframe
 import pyspark.sql.functions as F
 
 rf = spark.sql("""
@@ -30,29 +30,29 @@ rf.select("id", rf_render_matrix("tile")).show(10, False)
 
 In this code block, we are using the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the _tile_ aggregate mean of cells in each row of column `tile`. The mean of each _tile_ is computed separately, so the first mean is 1.0 and the second mean is 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
 
-```python
+```python, tile_mean
 rf.select(F.col('id'), rf_tile_mean(F.col('tile'))).show(10, False)
 ```
 
 In this code block, we are using the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
 
-```python
+```python, agg_mean
 rf.agg(rf_agg_mean(F.col('tile'))).show(10, False)
 ```
 
 In this code block, we are using the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. In this example it is computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the _tile_.
 
 To compute an element-wise local aggregate, _tiles_ need have the same dimensions as in the example below where both _tiles_ have 5 rows and 5 columns. If we tried to compute an element-wise local aggregate over the DataFrame without equal _tile_ dimensions, we would get a runtime error.
 
-```python
+```python, local_mean
 rf.agg(rf_agg_local_mean(F.col('tile')).alias("local_mean")).select(rf_render_matrix("local_mean")).show(10, False)
 ```
 
 ## Cell Counts Example
 
 We can also count the total number of data and NoData cells over all the _tiles_ in a DataFrame using @ref:[`rf_agg_data_cells`](reference.md#rf-agg-data-cells) and @ref:[`rf_agg_no_data_cells`](reference.md#rf-agg-no-data-cells). There are 3,842,290 data cells and 1,941,734 NoData cells in this DataFrame. See section on @ref:["NoData" handling](nodata-handling.md) for additional discussion on handling missing data.
 
-```python
+```python, cell_counts
 rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/MCD43A4.006/11/05/2018233/MCD43A4.A2018233.h11v05.006.2018242035530_B02.TIF')
 stats = rf.agg(rf_agg_data_cells('proj_raster'), rf_agg_no_data_cells('proj_raster'))
 
@@ -65,7 +65,7 @@ The statistical summary functions return a summary of cell values: number of dat
 
 The @ref:[`rf_tile_stats`](reference.md#rf-tile-stats) function computes summary statistics separately for each row in a _tile_ column as shown below.
 
-```python
+```python, tile_stats
 rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif')
 stats = rf.select(rf_tile_stats('proj_raster').alias('stats'))
 
@@ -75,15 +75,15 @@ stats.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance').show(10,
 
 The @ref:[`rf_agg_stats`](reference.md#rf-agg-stats) function aggregates over all of the _tiles_ in a DataFrame and returns a statistical summary of all cell values as shown below.
 
-```python
+```python, agg_stats
 rf.agg(rf_agg_stats('proj_raster').alias('stats')) \
     .select('stats.min', 'stats.max', 'stats.mean', 'stats.variance') \
     .show(10, False)
 ```
 
 The @ref:[`rf_agg_local_stats`](reference.md#rf-agg-local-stats) function computes the element-wise local aggregate statistical summary as shown below. The DataFrame used in the previous two code blocks, has unequal _tile_ dimensions, so a different DataFrame is used in this code block to avoid a runtime error.
 
-```python
+```python, agg_local_stats
 rf = spark.sql("""
 SELECT 1 as id, rf_make_ones_tile(5, 5, 'float32') as tile
 UNION
@@ -103,7 +103,7 @@ for r in agg_local_stats:
 
 The @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function computes a count of cell values within each row of _tile_ and outputs a `bins` array with the schema below. In the graph below, we have plotted `value` on the x-axis and `count` on the y-axis to create the histogram. There are 100 rows of _tile_ in this DataFrame, but this histogram is just computed for the _tile_ in the first row.
 
-```python
+```python, tile_histogram
 import matplotlib.pyplot as plt
 
 rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/MCD43A4.006/11/05/2018233/MCD43A4.A2018233.h11v05.006.2018242035530_B02.TIF')
@@ -121,7 +121,7 @@ plt.show()
 
 The @ref:[`rf_agg_approx_histogram`](reference.md#rf-agg-approx-histogram) function computes a count of cell values across all of the rows of _tile_ in a DataFrame or group. In the example below, the range of the y-axis is significantly wider than the range of the y-axis on the previous histogram since this histogram was computed for all cell values in the DataFrame.
 
-```python
+```python, agg_histogram
 bins_list = rf.agg(
     rf_agg_approx_histogram('proj_raster')['bins'].alias('bins')
     ).collect()
diff --git a/pyrasterframes/src/main/python/docs/index.md b/pyrasterframes/src/main/python/docs/index.md
@@ -16,7 +16,7 @@ The source code can be found on GitHub at [locationtech/rasterframes](https://gi
 
 ## Detailed Contents
 
-@@ toc { depth=2 }
+@@ toc { depth=3 }
 
 @@@
 
diff --git a/pyrasterframes/src/main/python/docs/languages.pymd b/pyrasterframes/src/main/python/docs/languages.pymd
@@ -69,7 +69,7 @@ def sql(stmt):
 
 ### Step 1: Load the catalog
 
-```python, step_1_sql
+```python, step_1_sql, results=hidden
 sql("CREATE OR REPLACE TEMPORARY VIEW modis USING `aws-pds-modis-catalog`")
 ```
 
@@ -87,7 +87,7 @@ sql('DESCRIBE red_nir_monthly_2017').show()
 
 ### Step 3: Read tiles
 
-```python, step_3_sql
+```python, step_3_sql, results=hidden
 sql("""
 CREATE OR REPLACE TEMPORARY VIEW red_nir_tiles_monthly_2017
 USING raster
@@ -117,7 +117,7 @@ SELECT month, ndvi_stats.* FROM (
 
 The latest Scala API documentation is available here:
 
-* @ref:[Scala API Documentation](http://rasterframes.io/latest/api/index.html) 
+* [Scala API Documentation](https://rasterframes.io/latest/api/index.html) 
 
 
 ### Step 1: Load the catalog
diff --git a/pyrasterframes/src/main/python/docs/unsupervised-learning.pymd b/pyrasterframes/src/main/python/docs/unsupervised-learning.pymd
@@ -4,7 +4,7 @@ In this example, we will demonstrate how to fit and score an unsupervised learni
 
 ## Imports and Data Preparation
 
-```python, echo=False
+```python, setup, echo=False
 from IPython.core.display import display
 from docs import resource_dir_uri
 from pyrasterframes.utils import create_rf_spark_session
@@ -18,7 +18,7 @@ import pandas as pd
 
 We import various Spark components that we need to construct our `Pipeline`.
 
-```python, echo=True
+```python, imports, echo=True
 from pyrasterframes import TileExploder
 from pyrasterframes.rasterfunctions import rf_assemble_tile, rf_crs, rf_extent, rf_tile, rf_dimensions
 
@@ -31,7 +31,7 @@ from pyspark.ml import Pipeline
 The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create a catalog DataFrame using the pattern from [the I/O page](raster-io.html#Single-Scene--Multiple-Bands). In the catalog, each row represents a distinct area and time, and each column is the URI to a band's image product. The function `resource_dir_uri` gives a local file system path to the sample Landsat data. The resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
 
 
-```python, term=True
+```python, catalog, term=True
 filenamePattern = "L8-B{}-Elkton-VA.tiff"
 catalog_df = pd.DataFrame([
     {'b' + str(b): os.path.join(resource_dir_uri(), filenamePattern.format(b)) for b in range(1, 8)}
@@ -55,54 +55,54 @@ df.printSchema()
 
 SparkML requires that each observation be in its own row, and features for each observation be packed into a single `Vector`. For this unsupervised learning problem, we will treat each _pixel_ as an observation and each band as a feature. The first step is to "explode" the _tiles_ into a single row per pixel. In RasterFrames, generally a pixel is called a @ref:[`cell`](concepts.md#cell).
 
-```python
+```python, exploder
 exploder = TileExploder()
 ```
 
 To "vectorize" the the band columns, we use the SparkML `VectorAssembler`. Each of the seven bands is a different feature.
 
-```python
+```python, assembler
 assembler = VectorAssembler() \
     .setInputCols(list(catalog_df.columns)) \
     .setOutputCol("features")
 ```
 
 For this problem, we will use the K-means clustering algorithm and configure our model to have 5 clusters.
 
-```python
+```python, kmeans
 kmeans = KMeans().setK(5).setFeaturesCol('features')
 ```
 
 We can combine the above stages into a single _pipeline_.
 
-```python
+```python, pipeline
 pipeline = Pipeline().setStages([exploder, assembler, kmeans])
 ```
 
 ## Fit the Model and Score
 
 Fitting the _pipeline_ actually executes exploding the _tiles_, assembling the features _vectors_, and fitting the K-means clustering model.
 
-```python
+```python, fit
 model = pipeline.fit(df)
 ```
 
 We can use the `transform` function to score the training data in the fitted _pipeline_ model. This will add a column called `prediction` with the closest cluster identifier.
 
-```python
+```python, transform
 clustered = model.transform(df)
 clustered.show(8)
 ```
 
 If we want to inspect the model statistics, the SparkML API requires us to go through this unfortunate contortion:
 
-```python
+```python, cluster_stats
 cluster_stage = model.stages[2]
 ```
 
 We can then compute the sum of squared distances of points to their nearest center, which is elemental to most cluster quality metrics.
 
-```python
+```python, distance
 metric = cluster_stage.computeCost(clustered)
 print("Within set sum of squared errors: %s" % metric)
 ```
@@ -111,7 +111,7 @@ print("Within set sum of squared errors: %s" % metric)
 
 We can recreate the tiled data structure using the metadata added by the `TileExploder` pipeline stage.
 
-```python
+```python, assemble
 from pyrasterframes.rf_types import CellType
 
 tile_dims = df.select(rf_dimensions(df.b1).alias('dims')).first()['dims']
@@ -127,6 +127,6 @@ retiled.show()
 
 The resulting output is shown below.
 
-```python
+```python, viz
 display(retiled.select('prediction').first()['prediction'])
 ```
diff --git a/pyrasterframes/src/main/python/docs/vector-data.pymd b/pyrasterframes/src/main/python/docs/vector-data.pymd
@@ -4,13 +4,13 @@ RasterFrames provides a variety of ways to work with spatial vector (points, lin
 
 ## GeoJSON DataSource
 
-```python, echo=False
+```python, setup, echo=False
 import pyrasterframes
 from pyrasterframes.utils import create_rf_spark_session
 spark = create_rf_spark_session()
 ```
 
-```python
+```python, read_geojson
 from pyspark import SparkFiles
 spark.sparkContext.addFile('https://raw.githubusercontent.com/datasets/geo-admin1-us/master/data/admin1-us.geojson')
 
@@ -24,7 +24,7 @@ The properties of each feature are available as columns of the DataFrame, along
 
 You can also convert a [GeoPandas][GeoPandas] GeoDataFrame to a Spark DataFrame, preserving the geometry column. This means that any vector format that can be read with [OGR][OGR] can be converted to a Spark DataFrame. In the example below, we expect the same schema as `df` defined above by the GeoJSON reader. Note that in a GeoPandas DataFrame there can be heterogeneous geometry types in the column, but this may fail Spark's schema inference.
 
-```python
+```python, read_and_normalize
 import geopandas
 from shapely.geometry import MultiPolygon
 
@@ -45,20 +45,20 @@ df2.printSchema()
 
 The `geometry` column will have a Spark user-defined type that is compatible with [Shapely][Shapely] when working on the Python side. This means that when the data is collected to the driver, it will be a Shapely geometry object.
 
-```python
+```python, show_geom
 the_first = df.first()
 print(type(the_first['geometry']))
 ```
 
 Since it is a geometry we can do things like this:
 
-```python
+```python, show_wkt
 the_first['geometry'].wkt
 ```
 
 You can also write user-defined functions that take geometries as input, output, or both, via user defined types in the [geomesa_pyspark.types](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/main/python/geomesa_pyspark/types.py) module. Here is a simple example of a user-defined function that uses both a geometry input and output to compute the centroid of a geometry.
 
-```python
+```python, add_centroid
 from pyspark.sql.functions import udf
 from geomesa_pyspark.types import PointUDT
 
@@ -72,7 +72,7 @@ df.printSchema()
 
 We can take a look at a sample of the data. Notice the geometry columns print as well known text (wkt).
 
-```python
+```python, show_centroid
 df.show(4)
 ```
 
@@ -82,7 +82,7 @@ df.show(4)
 As documented in the @ref:[function reference](reference.md), various user-defined functions implemented by GeoMesa are also available for use. The example below uses a GeoMesa user-defined function to compute the centroid of a geometry. It is logically equivalent to the example above, but more efficient.
 
 
-```python
+```python, native_centroid
 from pyrasterframes.rasterfunctions import st_centroid
 df = df.withColumn('centroid', st_centroid(df.geometry))
 df.select('name', 'geometry', 'naive_centroid', 'centroid').show(4)