Misc tweaks to address PR feedback.

metasim · metasim · commit 631d1b92afda · 2019-08-21T17:32:39.000-04:00
diff --git a/core/src/main/scala/org/locationtech/rasterframes/util/package.scala b/core/src/main/scala/org/locationtech/rasterframes/util/package.scala
@@ -204,14 +204,19 @@ package object util {
       val header = cols.map(_.name).mkString("| ", " | ", " |") + "\n" + ("|---" * cols.length) + "|\n"
       val stringifiers = stringifyRowElements(cols, truncate)
       val cat = concat_ws(" | ", stringifiers: _*)
-      val body = df
-        .select(cat).limit(numRows)
+      val rows = df
+        .select(cat)
+        .limit(numRows)
         .as[String]
         .collect()
         .map(_.replaceAll("\\[", "\\\\["))
         .map(_.replace('\n', '↩'))
+
+      val body = rows
         .mkString("| ", " |\n| ", " |")
-      header + body
+
+      val caption = if (rows.length >= numRows) s"\n_Showing only top $numRows rows_.\n\n" else ""
+      caption + header + body
     }
 
     def toHTML(numRows: Int = 5, truncate: Boolean = false): String = {
@@ -220,13 +225,17 @@ package object util {
       val header = "<thead>\n" + cols.map(_.name).mkString("<tr><th>", "</th><th>", "</th></tr>\n") + "</thead>\n"
       val stringifiers = stringifyRowElements(cols, truncate)
       val cat = concat_ws("</td><td>", stringifiers: _*)
-      val body = df
+      val rows = df
         .select(cat).limit(numRows)
         .as[String]
         .collect()
+
+      val body = rows
         .mkString("<tr><td>", "</td></tr>\n<tr><td>", "</td></tr>\n")
 
-      "<table>\n" + header + "<tbody>\n" + body + "</tbody>\n" + "</table>"
+      val caption = if (rows.length >= numRows) s"<caption>Showing only top $numRows rows</caption>\n" else ""
+
+      "<table>\n" + caption + header + "<tbody>\n" + body + "</tbody>\n" + "</table>"
     }
   }
 
diff --git a/docs/src/main/paradox/_template/page.st b/docs/src/main/paradox/_template/page.st
@@ -33,6 +33,9 @@
     .md-clear { clear: both; }
     table { font-size: 80%; }
     code { font-size: 0.75em !important; }
+    table a {
+        word-break: break-all;
+    }
   </style>
 </head>
 
diff --git a/pyrasterframes/src/main/python/docs/aggregation.pymd b/pyrasterframes/src/main/python/docs/aggregation.pymd
@@ -35,14 +35,14 @@ We use the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute
 
 ```python, tile_mean
 means = rf.select(F.col('id'), rf_tile_mean(F.col('tile')))
-display(means)
+means
 ```
 
 We use the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
 
 ```python, agg_mean
 mean = rf.agg(rf_agg_mean(F.col('tile')))
-display(mean)
+mean
 ```
 
 We use the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. For this aggregation, we are computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the _tile_.
@@ -62,7 +62,7 @@ We can also count the total number of data and NoData cells over all the _tiles_
 ```python, cell_counts
 rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/MCD43A4.006/11/05/2018233/MCD43A4.A2018233.h11v05.006.2018242035530_B02.TIF')
 stats = rf.agg(rf_agg_data_cells('proj_raster'), rf_agg_no_data_cells('proj_raster'))
-display(stats)
+stats
 ```
 
 ## Statistical Summaries
@@ -79,15 +79,15 @@ stats.printSchema()
 ```
 
 ```python, show_stats
-display(stats.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance'))
+stats.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance')
 ```
 
 The @ref:[`rf_agg_stats`](reference.md#rf-agg-stats) function aggregates over all of the _tiles_ in a DataFrame and returns a statistical summary of all cell values as shown below.
 
 ```python, agg_stats
 stats = rf.agg(rf_agg_stats('proj_raster').alias('stats')) \
     .select('stats.min', 'stats.max', 'stats.mean', 'stats.variance')
-display(stats)   
+stats   
 ```
 
 The @ref:[`rf_agg_local_stats`](reference.md#rf-agg-local-stats) function computes the element-wise local aggregate statistical summary as shown below. The DataFrame used in the previous two code blocks has unequal _tile_ dimensions, so a different DataFrame is used in this code block to avoid a runtime error.
diff --git a/pyrasterframes/src/main/python/docs/getting-started.pymd b/pyrasterframes/src/main/python/docs/getting-started.pymd
@@ -44,7 +44,7 @@ df = spark.read.raster('https://modis-pds.s3.amazonaws.com/MCD43A4.006/11/08/201
 # Add 3 element-wise, show some rows of the DataFrame
 sample = df.withColumn('added', rf_local_add(df.proj_raster, lit(3))) \
   .select(rf_crs('added'), rf_extent('added'), rf_tile('added'))
-display(sample)
+sample
 ```
 
 This example is extended in the [getting started Jupyter notebook](https://nbviewer.jupyter.org/github/locationtech/rasterframes/blob/develop/rf-notebook/src/main/notebooks/Getting%20Started.ipynb).
diff --git a/pyrasterframes/src/main/python/docs/languages.pymd b/pyrasterframes/src/main/python/docs/languages.pymd
@@ -60,7 +60,7 @@ result = red_nir_tiles_monthly_2017 \
     .agg(rf_agg_stats(rf_normalized_difference(col('nir'), col('red'))).alias('ndvi_stats')) \
     .orderBy(col('month')) \
     .select('month', 'ndvi_stats.*')
-display(result)
+result
 ```
 
 ## SQL
@@ -87,7 +87,7 @@ SELECT granule_id, month(acquisition_date) as month, B01 as red, B02 as nir
 FROM modis
 WHERE year(acquisition_date) = 2017 AND day(acquisition_date) = 15 AND granule_id = 'h21v09'
 """)
-display(sql('DESCRIBE red_nir_monthly_2017'))
+sql('DESCRIBE red_nir_monthly_2017')
 ```
 
 ### Step 3: Read tiles
@@ -116,7 +116,7 @@ SELECT month, ndvi_stats.* FROM (
     ORDER BY month
 )
 """)
-display(grouped)
+grouped
 ```
 
 ## Scala
diff --git a/pyrasterframes/src/main/python/docs/nodata-handling.pymd b/pyrasterframes/src/main/python/docs/nodata-handling.pymd
@@ -41,7 +41,7 @@ We can also inspect the cell type of a given _tile_ or `proj_raster` column.
 ```python, ct_from_sen
 cell_types = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif') \
     .select(rf_cell_type('proj_raster')).distinct()
-display(cell_types)    
+cell_types    
 ```
 
 ### Understanding Cell Types and NoData
@@ -96,7 +96,7 @@ unmasked.printSchema()
 
 ```python, show_cell_types
 cell_types = unmasked.select(rf_cell_type('blue'), rf_cell_type('scl')).distinct()
-display(cell_types)
+cell_types
 ```
 
 Drawing on @ref:[local map algebra](local-algebra.md) techniques, we will create new _tile_ columns that are indicators of unwanted pixels, as defined above. Since the mask column is an integer type, the addition is equivalent to a logical or, so the boolean true values are 1.
@@ -116,14 +116,14 @@ one_mask = mask_part.withColumn('mask', rf_local_add('nodata', 'defect')) \
                     .withColumn('mask', rf_local_add('mask', 'cirrus'))
 
 cell_types = one_mask.select(rf_cell_type('mask')).distinct()
-display(cell_types)
+cell_types
 ```
 
 Because there is not a NoData already defined, we will choose one. In this particular example, the minimum value is greater than zero, so we can use 0 as the NoData value.
 
 ```python, pick_nd
 blue_min = one_mask.agg(rf_agg_stats('blue').min.alias('blue_min'))
-display(blue_min)
+blue_min
 ```
 
 We can now construct the cell type string for our blue band's cell type, designating 0 as NoData.
@@ -147,7 +147,7 @@ We can verify that the number of NoData cells in the resulting `blue_masked` col
 
 ```python, show_masked
 counts = masked.select(rf_no_data_cells('blue_masked'), rf_tile_sum('mask'))
-display(counts)
+counts
 ```
 
 It's also nice to view a sample. The white regions are areas of NoData.
@@ -258,7 +258,7 @@ y = Tile((np.ones((100, 100))*3), CellType.int32())
 rf = spark.createDataFrame([Row(x=x, y=y)])
 
 cell_types = rf.select(rf_cell_type('x'), rf_cell_type('y')).distinct()
-display(cell_types)
+cell_types
 ```
 
 When performing a local operation between _tile_ columns with cell types `int` and  `float`, the resulting _tile_ cell type will be `float`. In local algebra over two _tiles_ of different "sized" cell types, the resulting cell type will be the larger of the two input _tiles'_ cell types.
@@ -269,7 +269,7 @@ sums = rf.select(
     rf_cell_type('y'),
     rf_cell_type(rf_local_add('x', 'y')).alias('xy_sum'),
     )
-display(sums)    
+sums    
 ```
 
 Combining _tile_ columns of different cell types gets a little trickier when user defined NoData cell types are involved. Let's create two _tile_ columns: one with a NoData value of 1, and one with a NoData value of 2 (using our previously defined `get_nodata_ct` function).
@@ -285,15 +285,15 @@ Let's try adding the _tile_ columns with different NoData values. When there is
 ```python, show_3
 rf_nd_sum = rf_nd.withColumn('x_nd_sum', rf_local_add('x_nd_2', 'x_nd_1'))
 cell_types = rf_nd_sum.select(rf_cell_type('x_nd_sum')).distinct()
-display(cell_types)
+cell_types
 ```
 
 Reversing the order of the sum changes the NoData value of the resulting column to 2.
 
 ```python, show_4
 rf_nd_sum = rf_nd.withColumn('x_nd_sum', rf_local_add('x_nd_1', 'x_nd_2'))
 cell_types = rf_nd_sum.select(rf_cell_type('x_nd_sum')).distinct()
-display(cell_types)
+cell_types
 ```
 
 ## NoData Values in Aggregation
@@ -324,5 +324,5 @@ The results of `rf_tile_sum` vary on the _tiles_ that were masked. This is becau
 
 ```python, show_5
 sums = masked_rf.select(rf_tile_sum('tile'), rf_tile_sum('tile_nd_1'), rf_tile_sum('tile_nd_2'))
-display(sums)
+sums
 ```
diff --git a/pyrasterframes/src/main/python/docs/raster-read.pymd b/pyrasterframes/src/main/python/docs/raster-read.pymd
@@ -36,15 +36,15 @@ parts = rf.select(
     rf_extent("proj_raster").alias("extent"),
     rf_tile("proj_raster").alias("tile")
 )
-display(parts)
+parts
 ```
 
 
 You can also see that the single raster has been broken out into many arbitrary non-overlapping regions. Doing so takes advantage of parallel in-memory reads from the cloud hosted data source and allows Spark to work on manageable amounts of data per task. The following code fragment shows us how many subtiles were created from a single source image.
 
 ```python, count_by_uri
 counts = rf.groupby(rf.proj_raster_path).count()
-display(counts)
+counts
 ```
 
 Let's select a single _tile_ and view it. The _tile_ preview image as well as the string representation provide some basic information about the _tile_: its dimensions as numbers of columns and rows and the cell type, or data type of all the cells in the _tile_. For more about cell types, refer to @ref:[this discussion](nodata-handling.md#cell-types).
@@ -106,7 +106,7 @@ print("Available scenes: ", modis_catalog.count())
 ```
 
 ```python, show_catalog
-display(modis_catalog)
+modis_catalog
 ```
 
 MODIS data products are delivered on a regular, consistent grid, making identification of a specific area over time easy using [`(h,v)`](https://modis-land.gsfc.nasa.gov/MODLAND_grid.html) grid coordinates (see below).
@@ -117,7 +117,7 @@ For example, MODIS data right above the equator is all grid coordinates with `v0
 
 ```python, catalog_filtering
 equator = modis_catalog.where(F.col('gid').like('%v07%')) 
-display(equator.select('date', 'gid'))
+equator.select('date', 'gid')
 ```
 
 Now that we have prepared our catalog, we simply pass the DataFrame or CSV string to the `raster` DataSource to load the imagery. The `catalog_col_names` parameter gives the columns that contain the URI's to be read.
@@ -134,7 +134,7 @@ Observe the schema of the resulting DataFrame has a projected raster struct for
 
 ```python, cat_read_sample
 sample = rf.select('gid', rf_extent('red'), rf_extent('nir'), rf_tile('red'), rf_tile('nir'))
-display(sample.limit(3))    
+sample.limit(3)    
 ```
 
 ## Lazy Raster Reading
@@ -145,13 +145,13 @@ Consider the following two reads of the same data source. In the first, the lazy
 
 ```python, lazy_demo_1
 uri = 'https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif'
-lazy = spark.read.raster(uri).select('proj_raster.tile').limit(1)
-display(lazy)
+lazy = spark.read.raster(uri).select('proj_raster.tile')
+lazy
 ```
 
 ```python, lazy_demo_2
 non_lazy = spark.read.raster(uri, lazy_tiles=False).select('proj_raster.tile')
-display(non_lazy)
+non_lazy
 ```
 
 In the initial examples on this page, you may have noticed that the realized (non-lazy) _tiles_ are shown, but we did not change `lazy_tiles`. Instead, we used @ref:[`rf_tile`](reference.md#rf-tile) to explicitly request the realized _tile_ from the lazy representation.
diff --git a/pyrasterframes/src/main/python/docs/supervised-learning.pymd b/pyrasterframes/src/main/python/docs/supervised-learning.pymd
@@ -114,7 +114,7 @@ df_mask.printSchema()
 
 ## Create ML Pipeline
 
-We import various Spark components that we need to construct our [Pipeline](https://spark.apache.org/docs/latest/ml-pipeline.html). These are the objects that will work in sequence to conduct the data preparation and modeling.
+We import various Spark components that we need to construct our [`Pipeline`](https://spark.apache.org/docs/latest/ml-pipeline.html). These are the objects that will work in sequence to conduct the data preparation and modeling.
 
 ```python, imports, echo=True
 from pyrasterframes import TileExploder
@@ -186,7 +186,7 @@ cnf_mtrx = prediction_df.groupBy(classifier.getPredictionCol()) \
     .pivot(classifier.getLabelCol()) \
     .count() \
 	.sort(classifier.getPredictionCol())
-display(cnf_mtrx)	
+cnf_mtrx	
 ```
 
 ## Visualize Prediction
diff --git a/pyrasterframes/src/main/python/docs/time-series.pymd b/pyrasterframes/src/main/python/docs/time-series.pymd
@@ -54,7 +54,7 @@ m.save(temp_folium)
 with open(temp_folium, 'rb') as f:
     b64 = base64.b64encode(f.read())
 with open('docs/static/cuya.md', 'w') as md:
-    md.write('<iframe src="data:text/html;charset=utf-8;base64,{}" allowfullscreen="" webkitallowfullscreen="" mozallowfullscreen="" style="position:relative;width:100%;height:500"></iframe>'.format(b64.decode('utf-8')))
+    md.write('<iframe src="data:text/html;charset=utf-8;base64,{}" allowfullscreen="" webkitallowfullscreen="" mozallowfullscreen="" style="position:relative;width:100%;height:500px"></iframe>'.format(b64.decode('utf-8')))
     # seems that the height is not correct?
 ```
 
diff --git a/pyrasterframes/src/main/python/docs/unsupervised-learning.pymd b/pyrasterframes/src/main/python/docs/unsupervised-learning.pymd
@@ -74,7 +74,7 @@ For this problem, we will use the K-means clustering algorithm and configure our
 kmeans = KMeans().setK(5).setFeaturesCol('features')
 ```
 
-We can combine the above stages into a single _pipeline_.
+We can combine the above stages into a single [`Pipeline`](https://spark.apache.org/docs/latest/ml-pipeline.html).
 
 ```python, pipeline
 pipeline = Pipeline().setStages([exploder, assembler, kmeans])
@@ -92,7 +92,12 @@ We can use the `transform` function to score the training data in the fitted _pi
 
 ```python, transform
 clustered = model.transform(df)
-display(clustered)
+```
+
+Now let's take a look at some sample output.
+
+```python, view_predictions
+clustered.select('prediction', 'extent', 'column_index', 'row_index', 'features')
 ```
 
 If we want to inspect the model statistics, the SparkML API requires us to go through this unfortunate contortion to access the clustering results:
@@ -126,7 +131,7 @@ retiled.printSchema()
 ```
 
 ```python, display
-display(retiled)
+retiled
 ```
 
 The resulting output is shown below.
diff --git a/pyrasterframes/src/main/python/docs/vector-data.pymd b/pyrasterframes/src/main/python/docs/vector-data.pymd
@@ -74,7 +74,7 @@ df.printSchema()
 We can take a look at a sample of the data. Notice the geometry columns print as well known text (wkt).
 
 ```python, show_centroid
-display(df.limit(4))
+df.limit(3)
 ```
 
 
@@ -87,7 +87,7 @@ As documented in the @ref:[function reference](reference.md), various user-defin
 from pyrasterframes.rasterfunctions import st_centroid
 df = df.withColumn('centroid', st_centroid(df.geometry))
 centroids = df.select('name', 'geometry', 'naive_centroid', 'centroid')
-display(centroids.limit(4))
+centroids.limit(3)
 ```
 
 The RasterFrames vector functions and GeoMesa functions also provide a variety of spatial relations that are useful in combination with the geometric properties of projected rasters. In this example, we use the @ref:[built-in Landsat catalog](raster-catalogs.md#using-built-in-experimental-catalogs) which provides an extent. We will convert the extent to a polygon and filter to those within approximately 500 km of a selected point.
diff --git a/pyrasterframes/src/main/python/setup.py b/pyrasterframes/src/main/python/setup.py
@@ -75,7 +75,7 @@ def finalize_options(self):
             self.files = filter(lambda s: len(s) > 0, re.split(',', self.files))
             # `html` doesn't do quite what one expects... only replaces code blocks, leaving markdown in place
             print("format.....", self.format)
-            if self.format == 'html':
+            if self.format.strip() == 'html':
                 self.format = 'pandoc2html'
         if isinstance(self.quick, str):
             self.quick = self.quick == 'True' or self.quick == 'true'