Edits: aggregation.pymd, supervised-learning.pymd, unsupervised-learning.pymd, time-series.pymd

metasim · metasim · commit 07ee2332d751 · 2019-08-15T14:09:25.000-04:00
diff --git a/experimental/src/main/scala/org/locationtech/rasterframes/experimental/datasource/awspds/MODISCatalogDataSource.scala b/experimental/src/main/scala/org/locationtech/rasterframes/experimental/datasource/awspds/MODISCatalogDataSource.scala
@@ -93,6 +93,7 @@ object MODISCatalogDataSource extends LazyLogging with ResourceCacheSupport {
     "2018-03-12",
     "2018-03-13",
     "2018-03-14",
+    "2018-03-15",
     "2018-05-16",
     "2018-05-17",
     "2018-05-18",
diff --git a/pyrasterframes/src/main/python/docs/aggregation.pymd b/pyrasterframes/src/main/python/docs/aggregation.pymd
@@ -11,13 +11,13 @@ import os
 spark = create_rf_spark_session()
 ```
 
-There are 3 types of aggregate functions: _tile_ aggregate, DataFrame aggregate, and element-wise local aggregate. In the @ref:[tile aggregate functions](reference.md#tile-statistics), we are computing a statistical summary per row of a _tile_ column in a DataFrame. In the @ref:[DataFrame aggregate functions](reference.md#aggregate-tile-statistics), we are computing statistical summaries over all of the cell values *and* across all of the rows in the DataFrame or group. In the @ref:[element-wise local aggregate functions](reference.md#tile-local-aggregate-statistics), we are computing the element-wise statistical summary across a DataFrame or group of _tiles_.
+There are three types of aggregate functions: _tile_ aggregate, DataFrame aggregate, and element-wise local aggregate. In the @ref:[tile aggregate functions](reference.md#tile-statistics), we are computing a statistical summary per row of a _tile_ column in a DataFrame. In the @ref:[DataFrame aggregate functions](reference.md#aggregate-tile-statistics), we are computing statistical summaries over all of the cell values *and* across all of the rows in the DataFrame or group. In the @ref:[element-wise local aggregate functions](reference.md#tile-local-aggregate-statistics), we are computing the element-wise statistical summary across a DataFrame or group of _tiles_.
 
 ## Tile Mean Example
 
-We can illustrate these differences in computing an aggregate mean. First, we create a sample DataFrame of 2 _tiles_ where the first _tile_ is composed of 25 values of 1.0 and the second _tile_ is composed of 25 values of 3.0.
+We can illustrate aggregate differences by computing an aggregate mean. First, we create a sample DataFrame of 2 _tiles_ where the first _tile_ is composed of 25 values of 1.0 and the second _tile_ is composed of 25 values of 3.0.
 
-```python, sql_dataframe, results='raw'
+```python, sql_dataframe
 import pyspark.sql.functions as F
 
 rf = spark.sql("""
@@ -26,33 +26,36 @@ UNION
 SELECT 2 as id, rf_local_multiply(rf_make_ones_tile(5, 5, 'float32'), 3) as tile
 """)
 
-rf.select("id", rf_render_matrix("tile")).show(truncate=False)
+tiles = rf.select("tile").collect()
+print(tiles[0]['tile'].cells)
+print(tiles[1]['tile'].cells)
 ```
 
-
-In this code block, we are using the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the _tile_ aggregate mean of cells in each row of column `tile`. The mean of each _tile_ is computed separately, so the first mean is 1.0 and the second mean is 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
+We use the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the _tile_ aggregate mean of cells in each row of column `tile`. The mean of each _tile_ is computed separately, so the first mean is 1.0 and the second mean is 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
 
 ```python, tile_mean, results='raw'
-rf.select(F.col('id'), rf_tile_mean(F.col('tile'))).show(truncate=False)
+rf.select(F.col('id'), rf_tile_mean(F.col('tile'))).show()
 ```
 
-In this code block, we are using the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
+We use the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
 
 ```python, agg_mean, results='raw'
 rf.agg(rf_agg_mean(F.col('tile'))).show()
 ```
 
-In this code block, we are using the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. In this example it is computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the _tile_.
+We use the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. For this aggregation, we are computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the _tile_.
 
-To compute an element-wise local aggregate, _tiles_ need have the same dimensions as in the example below where both _tiles_ have 5 rows and 5 columns. If we tried to compute an element-wise local aggregate over the DataFrame without equal _tile_ dimensions, we would get a runtime error.
+To compute an element-wise local aggregate, _tiles_ need to have the same dimensions. In this case, both _tiles_ have 5 rows and 5 columns. If we tried to compute an element-wise local aggregate over the DataFrame without equal _tile_ dimensions, we would get a runtime error.
 
-```python, local_mean, results='raw'
-rf.agg(rf_agg_local_mean(F.col('tile')).alias("local_mean")).select(rf_render_matrix("local_mean")).show(truncate=False)
+```python, local_mean
+t = rf.agg(rf_agg_local_mean(F.col('tile')).alias('local_mean')) \
+    .collect()[0]['local_mean']
+print(t.cells)    
 ```
 
 ## Cell Counts Example
 
-We can also count the total number of data and NoData cells over all the _tiles_ in a DataFrame using @ref:[`rf_agg_data_cells`](reference.md#rf-agg-data-cells) and @ref:[`rf_agg_no_data_cells`](reference.md#rf-agg-no-data-cells). There are 3,842,290 data cells and 1,941,734 NoData cells in this DataFrame. See section on @ref:["NoData" handling](nodata-handling.md) for additional discussion on handling missing data.
+We can also count the total number of data and NoData cells over all the _tiles_ in a DataFrame using @ref:[`rf_agg_data_cells`](reference.md#rf-agg-data-cells) and @ref:[`rf_agg_no_data_cells`](reference.md#rf-agg-no-data-cells). There are ~3.8 million data cells and ~1.9 million NoData cells in this DataFrame. See the section on @ref:["NoData" handling](nodata-handling.md) for additional discussion on handling missing data.
 
 ```python, cell_counts, results='raw'
 rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/MCD43A4.006/11/05/2018233/MCD43A4.A2018233.h11v05.006.2018242035530_B02.TIF')
@@ -86,7 +89,7 @@ rf.agg(rf_agg_stats('proj_raster').alias('stats')) \
     .show()
 ```
 
-The @ref:[`rf_agg_local_stats`](reference.md#rf-agg-local-stats) function computes the element-wise local aggregate statistical summary as shown below. The DataFrame used in the previous two code blocks, has unequal _tile_ dimensions, so a different DataFrame is used in this code block to avoid a runtime error.
+The @ref:[`rf_agg_local_stats`](reference.md#rf-agg-local-stats) function computes the element-wise local aggregate statistical summary as shown below. The DataFrame used in the previous two code blocks has unequal _tile_ dimensions, so a different DataFrame is used in this code block to avoid a runtime error.
 
 ```python, agg_local_stats
 rf = spark.sql("""
@@ -106,7 +109,7 @@ for r in agg_local_stats:
 
 ## Histogram
 
-The @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function computes a count of cell values within each row of _tile_ and outputs a `bins` array with the schema below. In the graph below, we have plotted `value` on the x-axis and `count` on the y-axis to create the histogram. There are 100 rows of _tile_ in this DataFrame, but this histogram is just computed for the _tile_ in the first row.
+The @ref:[`rf_tile_histogram`](reference.md#rf-tile-histogram) function computes a count of cell values within each row of _tile_ and outputs a `bins` array with the schema below. In the graph below, we have plotted each bin's `value` on the x-axis and `count` on the y-axis for the _tile_ in the first row of the DataFrame.
 
 
 ```python, tile_histogram
@@ -118,8 +121,8 @@ hist_df = rf.select(rf_tile_histogram('proj_raster')['bins'].alias('bins'))
 hist_df.printSchema()
 
 bins_row = hist_df.first()
-values = [int(row['value']) for row in bins_row.bins]
-counts = [int(row['count']) for row in bins_row.bins]
+values = [int(bin['value']) for bin in bins_row.bins]
+counts = [int(bin['count']) for bin in bins_row.bins]
 
 plt.hist(values, weights=counts, bins=100)
 plt.show()
diff --git a/pyrasterframes/src/main/python/docs/supervised-learning.pymd b/pyrasterframes/src/main/python/docs/supervised-learning.pymd
@@ -20,7 +20,7 @@ spark = create_rf_spark_session()
 
 ## Create and Read Raster Catalog
 
-The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create @ref:[a catalog DataFrame](raster-catalogs.md#creating-a-catalog). In the catalog, each row represents a distinct area and time, and each column is the URI to a band's image product. In this example our catalog just has one row. After reading the catalog, the resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
+The first step is to create a Spark DataFrame containing our imagery data. To achieve that we will create @ref:[a catalog DataFrame](raster-catalogs.md#creating-a-catalog). In the catalog, each row represents a distinct area and time, and each column is the URI to a band's image product. In this example our catalog just has one row. After reading the catalog, the resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
 
 The imagery for feature data will come from [eleven bands of 60 meter resolution Sentinel-2](https://earth.esa.int/web/sentinel/user-guides/sentinel-2-msi/resolutions/spatial) imagery. We also will use the [scene classification (SCL)](https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm) data to identify high quality, non-cloudy pixels.
 
@@ -65,7 +65,7 @@ The land classification labels are based on a small set of hand drawn polygons i
 
 We will create a very small Spark DataFrame of the label shapes and then join it to the raster DataFrame. Such joins are typically expensive, but in this case both datasets are quite small. To speed up the join for the small vector DataFrame, we put the `broadcast` hint on it, which will tell Spark to put a copy of it on each Spark executor.
 
-After the raster and vector data are joined, we will convert the vector shapes into _tiles_ using the @ref:[`rf_rasterize`](reference.md#rf-rasterize) function. This procedure is sometimes called "burning in" a geometry into a raster. The values in the resulting _tile_ cells are the `id` property of the GeoJSON; which we will use as labels in our supervised learning task. In areas where the geometry does not intersect, the cells will contain a NoData.
+After the raster and vector data are joined, we will convert the vector shapes into _tiles_ using the @ref:[`rf_rasterize`](reference.md#rf-rasterize) function. This procedure is sometimes called "burning in" a geometry into a raster. The values in the resulting _tile_ cells are the `id` property of the GeoJSON, which we will use as labels in our supervised learning task. In areas where the geometry does not intersect, the cells will contain NoData.
 
 ```python join_and_rasterize
 crses = df.select('crs.crsProj4').distinct().collect()
@@ -87,7 +87,7 @@ FROM df_joined
 
 ## Masking Poor Quality Cells
 
-To filter only for good quality pixels, we follow roughly the same procedure as demonstrated in the @ref:[quality masking](nodata-handling.md#masking) section of the chapter on NoData. Instead of actually setting NoData values in the unwanted cells of any of the imagery bands, we will just on filter out the mask cell values later in the process.
+To filter only for good quality pixels, we follow roughly the same procedure as demonstrated in the @ref:[quality masking](nodata-handling.md#masking) section of the chapter on NoData. Instead of actually setting NoData values in the unwanted cells of any of the imagery bands, we will just filter out the mask cell values later in the process.
 
 ```python, make_mask
 from pyspark.sql.functions import lit
@@ -126,9 +126,9 @@ from pyspark.ml.evaluation import MulticlassClassificationEvaluator
 from pyspark.ml import Pipeline
 ```
 
-SparkML requires that each observation be in its own row, and those observations be packed into a single [`Vector`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.linalg) object. The first step is to "explode" the _tiles_ into a single row per cell or pixel with the `TileExploder` (see also @ref:[`rf_explode_tiles`](reference.md#rf_explode_tiles)). If a _tile_ cell contains a NoData it will become a null value after the exploder stage. Then we filter out any rows that missing or null values, which will cause an error during training. Finally we use the SparkML `VectorAssembler` to create that `Vector`.
+SparkML requires that each observation be in its own row, and those observations be packed into a single [`Vector`](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#module-pyspark.ml.linalg) object. The first step is to "explode" the _tiles_ into a single row per cell or pixel with the `TileExploder` (see also @ref:[`rf_explode_tiles`](reference.md#rf_explode_tiles)). If a _tile_ cell contains a NoData it will become a null value after the exploder stage. Then we use the `NoDataFilter` to filter out any rows that missing or null values, which will cause an error during training. Finally we use the SparkML `VectorAssembler` to create that `Vector`.
 
-It is worth discussing a couple of interesting things about the `NoDataFilter`. First, we filter out missing values in the mask column. Recall above we set undesirable pixels to NoData, so they will be removed at this stage. The other column for the `NoDataFilter` is the `label` column. When it is time to score the model, the pipeline will ignore the fact that there is no `label` column on the input DataFrame.
+Recall above we set undesirable pixels to NoData, so the `NoDataFilter` will remove them at this stage. We apply the filter to the `mask` column and the `label` column, the latter being used during training. When it is time to score the model, the pipeline will ignore the fact that there is no `label` column on the input DataFrame.
 
 ```python, transformers
 exploder = TileExploder()
@@ -156,7 +156,7 @@ pipeline.getStages()
 
 ## Train the Model
 
-Push the "go button"! This will actually run each step of the Pipeline we created including fitting the decision tree model. We filter the DataFrame for only _tiles_ intersecting the label raster, because the label shapes are relatively sparse over the imagery. It would be logically equivalent to either include or exclude this, but it is more efficient to do this filter because it will mean less data going into the pipeline.
+The next step is to actually run each step of the Pipeline we created, including fitting the decision tree model. We filter the DataFrame for only _tiles_ intersecting the label raster because the label shapes are relatively sparse over the imagery. It would be logically equivalent to either include or exclude thi step, but it is more efficient to filter because it will mean less data going into the pipeline.
 
 ```python, train
 model = pipeline.fit(df_mask.filter(rf_tile_sum('label') > 0).cache())
@@ -173,14 +173,13 @@ prediction_df.printSchema()
 
 eval = MulticlassClassificationEvaluator(predictionCol=classifier.getPredictionCol(),
 										 labelCol=classifier.getLabelCol(),
-										 metricName='accuracy',
-)
+										 metricName='accuracy')
 
 accuracy = eval.evaluate(prediction_df)
 print("\nAccuracy:", accuracy)
 ```
 
-As an example of using the flexibility provided by DataFrames, the code below computes and displays the confusion matrix. The categories down the rows are the predictions, and the truth labels are across the columns.
+As an example of using the flexibility provided by DataFrames, the code below computes and displays the confusion matrix. 
 
 ```python, confusion_mtrx, results='raw'
 prediction_df.groupBy(classifier.getPredictionCol()) \
diff --git a/pyrasterframes/src/main/python/docs/time-series.pymd b/pyrasterframes/src/main/python/docs/time-series.pymd
@@ -95,8 +95,8 @@ time_series = rf_ndvi \
         .withColumn('ndvi_wt', rf_tile_sum('ndvi_masked')) \
         .withColumn('wt', rf_data_cells('ndvi_masked')) \
         .groupby(year('acquisition_date').alias('year'), weekofyear('acquisition_date').alias('week')) \
-        .agg(sql_sum('ndvi_wt').alias('ndvi_wt'), sql_sum('wt').alias('wt')) \
-        .withColumn('ndvi', col('ndvi_wt') / col('wt'))
+        .agg(sql_sum('ndvi_wt').alias('ndvi_wt_wk'), sql_sum('wt').alias('wt_wk')) \
+        .withColumn('ndvi', col('ndvi_wt_wk') / col('wt_wk'))
 time_series.printSchema()
 time_series.persist()
 ```
@@ -114,4 +114,4 @@ plt.ylabel('NDVI')
 plt.title('Cuyahoga Valley NP Green-up')
 ```
 
-We can see two fairly clear elbows in the curve at week 17 and week 21, indicating the start and end of the green up period. Estimation of such parameters is one technique [phenology](https://en.wikipedia.org/wiki/Phenology) researchers use to monitor changes in climate and environment!
+We can see two fairly clear elbows in the curve at week 17 and week 21, indicating the start and end of the green up period. Estimation of such parameters is one technique [phenology](https://en.wikipedia.org/wiki/Phenology) researchers use to monitor changes in climate and environment.
diff --git a/pyrasterframes/src/main/python/docs/unsupervised-learning.pymd b/pyrasterframes/src/main/python/docs/unsupervised-learning.pymd
@@ -14,12 +14,12 @@ import os
 
 spark = create_rf_spark_session()
 
-import pandas as pd
 ```
 
-We import various Spark components that we need to construct our `Pipeline`.
+We import various Spark components needed to construct our `Pipeline`.
 
 ```python, imports, echo=True
+import pandas as pd
 from pyrasterframes import TileExploder
 from pyrasterframes.rasterfunctions import rf_assemble_tile, rf_crs, rf_extent, rf_tile, rf_dimensions
 
@@ -37,7 +37,7 @@ filenamePattern = "L8-B{}-Elkton-VA.tiff"
 catalog_df = pd.DataFrame([
     {'b' + str(b): os.path.join(resource_dir_uri(), filenamePattern.format(b)) for b in range(1, 8)}
 ])
-df = spark.read.raster(catalog=catalog_df.to_csv(index=None), catalog_col_names=catalog_df.columns)
+df = spark.read.raster(catalog=catalog_df, catalog_col_names=catalog_df.columns)
 df = df.select(
     rf_crs(df.b1).alias('crs'),
     rf_extent(df.b1).alias('extent'),
@@ -95,7 +95,7 @@ clustered = model.transform(df)
 clustered.show(8)
 ```
 
-If we want to inspect the model statistics, the SparkML API requires us to go through this unfortunate contortion:
+If we want to inspect the model statistics, the SparkML API requires us to go through this unfortunate contortion to access the clustering results:
 
 ```python, cluster_stats
 cluster_stage = model.stages[2]
diff --git a/pyrasterframes/src/main/python/setup.py b/pyrasterframes/src/main/python/setup.py
@@ -97,7 +97,7 @@ def run(self):
         import traceback
         import pweave
         bad_words = ["Error"]
-        pweave.rcParams["chunk"]["defaultoptions"].update({'wrap': False, 'dpi': 100})
+        pweave.rcParams["chunk"]["defaultoptions"].update({'wrap': False, 'dpi': 175})
         if self.format == 'markdown':
             pweave.PwebFormats.formats['markdown'] = {
                 'class': PegdownMarkdownFormatter,