Skip to content

Commit e627024

Browse files
committed
Reworked rendering of DataFrames in IPython to use display and the formatter API.
1 parent 3541cd6 commit e627024

File tree

16 files changed

+157
-101
lines changed

16 files changed

+157
-101
lines changed

core/src/main/scala/org/locationtech/rasterframes/util/package.scala

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -203,6 +203,25 @@ package object util {
203203
.mkString("| ", " |\n| ", " |")
204204
header + body
205205
}
206+
207+
def toHTML(numRows: Int = 5, truncate: Boolean = false): String = {
208+
import df.sqlContext.implicits._
209+
val cols = df.columns
210+
val header = "<thead>\n" + cols.mkString("<tr><th>", "</th><th>", "</th></tr>\n") + "</thead>\n"
211+
val stringifiers = cols
212+
.map(c => s"`$c`")
213+
.map(c => df.col(c).cast(StringType))
214+
.map(c => if (truncate) substring(c, 1, 40) else c)
215+
val cat = concat_ws("</td><td>", stringifiers: _*)
216+
val body = df
217+
.select(cat).limit(numRows)
218+
.as[String]
219+
.collect()
220+
.mkString("<tr><td>", "</td></tr>\n<tr><td>", "</td></tr>\n")
221+
222+
223+
"<table>\n" + header + "<tbody>\n" + body + "</tbody>\n" + "</table>"
224+
}
206225
}
207226

208227
object Shims {

core/src/test/scala/org/locationtech/rasterframes/ExtensionMethodSpec.scala

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,8 @@ import geotrellis.spark.{KeyBounds, SpatialKey, TileLayerMetadata}
2828
import org.apache.spark.sql.Encoders
2929
import org.locationtech.rasterframes.util.SubdivideSupport
3030

31+
import scala.xml.parsing.XhtmlParser
32+
3133
/**
3234
* Tests miscellaneous extension methods.
3335
*
@@ -114,5 +116,13 @@ class ExtensionMethodSpec extends TestEnvironment with TestData with SubdivideSu
114116
import org.locationtech.rasterframes.util._
115117
rf.toMarkdown().count(_ == '|') shouldBe >=(3 * 5)
116118
}
119+
120+
it("should render HTML") {
121+
import org.locationtech.rasterframes.util._
122+
123+
noException shouldBe thrownBy {
124+
XhtmlParser(scala.io.Source.fromString(rf.toHTML()))
125+
}
126+
}
117127
}
118128
}

pyrasterframes/src/main/python/docs/aggregation.pymd

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -33,14 +33,16 @@ print(tiles[1]['tile'].cells)
3333

3434
We use the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the _tile_ aggregate mean of cells in each row of column `tile`. The mean of each _tile_ is computed separately, so the first mean is 1.0 and the second mean is 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
3535

36-
```python, tile_mean, results='raw'
37-
rf.select(F.col('id'), rf_tile_mean(F.col('tile'))).show()
36+
```python, tile_mean
37+
means = rf.select(F.col('id'), rf_tile_mean(F.col('tile')))
38+
display(means)
3839
```
3940

4041
We use the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
4142

42-
```python, agg_mean, results='raw'
43-
rf.agg(rf_agg_mean(F.col('tile'))).show()
43+
```python, agg_mean
44+
mean = rf.agg(rf_agg_mean(F.col('tile')))
45+
display(mean)
4446
```
4547

4648
We use the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. For this aggregation, we are computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the _tile_.
@@ -57,11 +59,10 @@ print(t.cells)
5759

5860
We can also count the total number of data and NoData cells over all the _tiles_ in a DataFrame using @ref:[`rf_agg_data_cells`](reference.md#rf-agg-data-cells) and @ref:[`rf_agg_no_data_cells`](reference.md#rf-agg-no-data-cells). There are ~3.8 million data cells and ~1.9 million NoData cells in this DataFrame. See the section on @ref:["NoData" handling](nodata-handling.md) for additional discussion on handling missing data.
5961

60-
```python, cell_counts, results='raw'
62+
```python, cell_counts
6163
rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/MCD43A4.006/11/05/2018233/MCD43A4.A2018233.h11v05.006.2018242035530_B02.TIF')
6264
stats = rf.agg(rf_agg_data_cells('proj_raster'), rf_agg_no_data_cells('proj_raster'))
63-
64-
stats.show()
65+
display(stats)
6566
```
6667

6768
## Statistical Summaries
@@ -77,16 +78,16 @@ stats = rf.select(rf_tile_stats('proj_raster').alias('stats'))
7778
stats.printSchema()
7879
```
7980

80-
```python, show_stats, results='raw'
81-
stats.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance').show(10, truncate=False)
81+
```python, show_stats
82+
display(stats.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance'))
8283
```
8384

8485
The @ref:[`rf_agg_stats`](reference.md#rf-agg-stats) function aggregates over all of the _tiles_ in a DataFrame and returns a statistical summary of all cell values as shown below.
8586

86-
```python, agg_stats, results='raw'
87-
rf.agg(rf_agg_stats('proj_raster').alias('stats')) \
88-
.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance') \
89-
.show()
87+
```python, agg_stats
88+
stats = rf.agg(rf_agg_stats('proj_raster').alias('stats')) \
89+
.select('stats.min', 'stats.max', 'stats.mean', 'stats.variance')
90+
display(stats)
9091
```
9192

9293
The @ref:[`rf_agg_local_stats`](reference.md#rf-agg-local-stats) function computes the element-wise local aggregate statistical summary as shown below. The DataFrame used in the previous two code blocks has unequal _tile_ dimensions, so a different DataFrame is used in this code block to avoid a runtime error.

pyrasterframes/src/main/python/docs/getting-started.pymd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -34,17 +34,17 @@ spark = pyrasterframes.get_spark_session()
3434

3535
Then, you can read a raster and work with it in a Spark DataFrame.
3636

37-
```python, local_add, results='raw'
37+
```python, local_add
3838
from pyrasterframes.rasterfunctions import *
3939
from pyspark.sql.functions import lit
4040

4141
# Read a MODIS surface reflectance granule
4242
df = spark.read.raster('https://modis-pds.s3.amazonaws.com/MCD43A4.006/11/08/2019059/MCD43A4.A2019059.h11v08.006.2019072203257_B02.TIF')
4343

4444
# Add 3 element-wise, show some rows of the DataFrame
45-
df.withColumn('added', rf_local_add(df.proj_raster, lit(3))) \
46-
.select(rf_crs('added'), rf_extent('added'), rf_tile('added')) \
47-
.show(3)
45+
sample = df.withColumn('added', rf_local_add(df.proj_raster, lit(3))) \
46+
.select(rf_crs('added'), rf_extent('added'), rf_tile('added'))
47+
display(sample)
4848
```
4949

5050
This example is extended in the [getting started Jupyter notebook](https://nbviewer.jupyter.org/github/locationtech/rasterframes/blob/develop/rf-notebook/src/main/notebooks/Getting%20Started.ipynb).

pyrasterframes/src/main/python/docs/index.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ The source code can be found on GitHub at [locationtech/rasterframes](https://gi
1010

1111
<img src="RasterFramePipeline.png" width="600px"/>
1212

13+
RasterFrames is released under the [Apache 2.0 License](https://github.com/locationtech/rasterframes/blob/develop/LICENSE).
14+
1315
<hr/>
1416

1517
@@@ div { .md-left}

pyrasterframes/src/main/python/docs/languages.pymd

Lines changed: 8 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ red_nir_tiles_monthly_2017 = spark.read.raster(
5050

5151
### Step 4: Compute aggregates
5252

53-
```python, step_4_python, results='raw'
53+
```python, step_4_python
5454
result = red_nir_tiles_monthly_2017 \
5555
.where(st_intersects(
5656
st_reproject(rf_geometry(col('red')), rf_crs(col('red')).crsProj4, rf_mk_crs('EPSG:4326')),
@@ -60,7 +60,7 @@ result = red_nir_tiles_monthly_2017 \
6060
.agg(rf_agg_stats(rf_normalized_difference(col('nir'), col('red'))).alias('ndvi_stats')) \
6161
.orderBy(col('month')) \
6262
.select('month', 'ndvi_stats.*')
63-
result.show()
63+
display(result)
6464
```
6565

6666
## SQL
@@ -80,14 +80,14 @@ sql("CREATE OR REPLACE TEMPORARY VIEW modis USING `aws-pds-modis-catalog`")
8080

8181
### Step 2: Down-select data by month
8282

83-
```python, step_2_sql, results='raw'
83+
```python, step_2_sql
8484
sql("""
8585
CREATE OR REPLACE TEMPORARY VIEW red_nir_monthly_2017 AS
8686
SELECT granule_id, month(acquisition_date) as month, B01 as red, B02 as nir
8787
FROM modis
8888
WHERE year(acquisition_date) = 2017 AND day(acquisition_date) = 15 AND granule_id = 'h21v09'
8989
""")
90-
sql('DESCRIBE red_nir_monthly_2017').show()
90+
display(sql('DESCRIBE red_nir_monthly_2017'))
9191
```
9292

9393
### Step 3: Read tiles
@@ -106,16 +106,17 @@ OPTIONS (
106106

107107
### Step 4: Compute aggregates
108108

109-
```python, step_4_sql, results='raw'
110-
sql("""
109+
```python, step_4_sql
110+
grouped = sql("""
111111
SELECT month, ndvi_stats.* FROM (
112112
SELECT month, rf_agg_stats(rf_normalized_difference(nir, red)) as ndvi_stats
113113
FROM red_nir_tiles_monthly_2017
114114
WHERE st_intersects(st_reproject(rf_geometry(red), rf_crs(red), 'EPSG:4326'), st_makePoint(34.870605, -4.729727))
115115
GROUP BY month
116116
ORDER BY month
117117
)
118-
""").show()
118+
""")
119+
display(grouped)
119120
```
120121

121122
## Scala

pyrasterframes/src/main/python/docs/nodata-handling.pymd

Lines changed: 34 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -38,9 +38,10 @@ CellType.float64()
3838

3939
We can also inspect the cell type of a given _tile_ or `proj_raster` column.
4040

41-
```python, ct_from_sen, results='raw'
42-
spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif') \
43-
.select(rf_cell_type('proj_raster')).distinct().show()
41+
```python, ct_from_sen
42+
cell_types = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif') \
43+
.select(rf_cell_type('proj_raster')).distinct()
44+
display(cell_types)
4445
```
4546

4647
### Understanding Cell Types and NoData
@@ -93,13 +94,14 @@ unmasked = spark.read.raster(catalog=cat, catalog_col_names=['blue', 'scl'])
9394
unmasked.printSchema()
9495
```
9596

96-
```python, show_cell_types, results='raw'
97-
unmasked.select(rf_cell_type('blue'), rf_cell_type('scl')).distinct().show()
97+
```python, show_cell_types
98+
cell_types = unmasked.select(rf_cell_type('blue'), rf_cell_type('scl')).distinct()
99+
display(cell_types)
98100
```
99101

100102
Drawing on @ref:[local map algebra](local-algebra.md) techniques, we will create new _tile_ columns that are indicators of unwanted pixels, as defined above. Since the mask column is an integer type, the addition is equivalent to a logical or, so the boolean true values are 1.
101103

102-
```python, def_mask, results='raw'
104+
```python, def_mask
103105
from pyspark.sql.functions import lit
104106

105107
mask_part = unmasked.withColumn('nodata', rf_local_equal('scl', lit(0))) \
@@ -113,13 +115,15 @@ one_mask = mask_part.withColumn('mask', rf_local_add('nodata', 'defect')) \
113115
.withColumn('mask', rf_local_add('mask', 'cloud9')) \
114116
.withColumn('mask', rf_local_add('mask', 'cirrus'))
115117

116-
one_mask.select(rf_cell_type('mask')).distinct().show()
118+
cell_types = one_mask.select(rf_cell_type('mask')).distinct()
119+
display(cell_types)
117120
```
118121

119122
Because there is not a NoData already defined, we will choose one. In this particular example, the minimum value is greater than zero, so we can use 0 as the NoData value.
120123

121-
```python, pick_nd, results='raw'
122-
one_mask.agg(rf_agg_stats('blue').min.alias('blue_min')).show()
124+
```python, pick_nd
125+
blue_min = one_mask.agg(rf_agg_stats('blue').min.alias('blue_min'))
126+
display(blue_min)
123127
```
124128

125129
We can now construct the cell type string for our blue band's cell type, designating 0 as NoData.
@@ -135,14 +139,15 @@ Now we will use the @ref:[`rf_mask_by_value`](reference.md#rf-mask-by-value) to
135139
```python, mask_blu
136140
with_nd = rf_convert_cell_type('blue', masked_blue_ct)
137141
masked = one_mask.withColumn('blue_masked',
138-
rf_mask_by_value(with_nd, 'mask', lit(1))) \
139-
.drop('nodata', 'defect', 'cloud8', 'cloud9', 'cirrus', 'blue')
142+
rf_mask_by_value(with_nd, 'mask', lit(1))) \
143+
.drop('nodata', 'defect', 'cloud8', 'cloud9', 'cirrus', 'blue')
140144
```
141145

142146
We can verify that the number of NoData cells in the resulting `blue_masked` column matches the total of the boolean `mask` _tile_ to ensure our logic is correct.
143147

144-
```python, show_masked, results='raw'
145-
masked.select(rf_no_data_cells('blue_masked'), rf_tile_sum('mask')).show(10)
148+
```python, show_masked
149+
counts = masked.select(rf_no_data_cells('blue_masked'), rf_tile_sum('mask'))
150+
display(counts)
146151
```
147152

148153
It's also nice to view a sample. The white regions are areas of NoData.
@@ -247,22 +252,24 @@ RasterFrames supports having _tile_ columns with different cell types in a singl
247252

248253
Let's first create a RasterFrame that has columns of `float` and `int` cell type.
249254

250-
```python, show_1, results='raw'
255+
```python, show_1
251256
x = Tile((np.ones((100, 100))*2), CellType.float64())
252257
y = Tile((np.ones((100, 100))*3), CellType.int32())
253258
rf = spark.createDataFrame([Row(x=x, y=y)])
254259

255-
rf.select(rf_cell_type('x'), rf_cell_type('y')).distinct().show()
260+
cell_types = rf.select(rf_cell_type('x'), rf_cell_type('y')).distinct()
261+
display(cell_types)
256262
```
257263

258264
When performing a local operation between _tile_ columns with cell types `int` and `float`, the resulting _tile_ cell type will be `float`. In local algebra over two _tiles_ of different "sized" cell types, the resulting cell type will be the larger of the two input _tiles'_ cell types.
259265

260-
```python, show_2, results='raw'
261-
rf.select(
266+
```python, show_2
267+
sums = rf.select(
262268
rf_cell_type('x'),
263269
rf_cell_type('y'),
264270
rf_cell_type(rf_local_add('x', 'y')).alias('xy_sum'),
265-
).show(1)
271+
)
272+
display(sums)
266273
```
267274

268275
Combining _tile_ columns of different cell types gets a little trickier when user defined NoData cell types are involved. Let's create two _tile_ columns: one with a NoData value of 1, and one with a NoData value of 2 (using our previously defined `get_nodata_ct` function).
@@ -275,16 +282,18 @@ rf_nd = spark.createDataFrame([Row(x_nd_1=x_nd_1, x_nd_2=x_nd_2)])
275282

276283
Let's try adding the _tile_ columns with different NoData values. When there is an inconsistent NoData value in the two columns, the NoData value of the right-hand side of the sum is kept. In this case, this means the result has a NoData value of 1.
277284

278-
```python, show_3, results='raw'
285+
```python, show_3
279286
rf_nd_sum = rf_nd.withColumn('x_nd_sum', rf_local_add('x_nd_2', 'x_nd_1'))
280-
rf_nd_sum.select(rf_cell_type('x_nd_sum')).distinct().show()
287+
cell_types = rf_nd_sum.select(rf_cell_type('x_nd_sum')).distinct()
288+
display(cell_types)
281289
```
282290

283291
Reversing the order of the sum changes the NoData value of the resulting column to 2.
284292

285-
```python, show_4, results='raw'
293+
```python, show_4
286294
rf_nd_sum = rf_nd.withColumn('x_nd_sum', rf_local_add('x_nd_1', 'x_nd_2'))
287-
rf_nd_sum.select(rf_cell_type('x_nd_sum')).distinct().show()
295+
cell_types = rf_nd_sum.select(rf_cell_type('x_nd_sum')).distinct()
296+
display(cell_types)
288297
```
289298

290299
## NoData Values in Aggregation
@@ -313,6 +322,7 @@ masked_rf = rf.withColumn('tile_nd_1',
313322

314323
The results of `rf_tile_sum` vary on the _tiles_ that were masked. This is because any cells with NoData values are ignored in the aggregation. Note that `tile_nd_2` has the lowest sum, since it has the fewest amount of data cells.
315324

316-
```python, show_5, results='raw'
317-
masked_rf.select(rf_tile_sum('tile'), rf_tile_sum('tile_nd_1'), rf_tile_sum('tile_nd_2')).show()
325+
```python, show_5
326+
sums = masked_rf.select(rf_tile_sum('tile'), rf_tile_sum('tile_nd_1'), rf_tile_sum('tile_nd_2'))
327+
display(sums)
318328
```

pyrasterframes/src/main/python/docs/raster-catalogs.pymd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,7 @@ two_d_cat_df.show(truncate=False)
9494

9595
The concept of a _catalog_ is much more powerful when we consider examples beyond constructing the DataFrame, and instead read the data from an external source. Here's an extended example of reading a cloud-hosted CSV file containing MODIS scene metadata and transforming it into a _catalog_. The metadata describing the content of each URL is an important aspect of processing raster data.
9696

97-
```python, remote_csv, results='raw'
97+
```python, remote_csv
9898
from pyspark import SparkFiles
9999
from pyspark.sql import functions as F
100100

@@ -104,20 +104,20 @@ scene_list = spark.read \
104104
.format("csv") \
105105
.option("header", "true") \
106106
.load(SparkFiles.get("2018-07-04_scenes.txt"))
107-
scene_list.show(4, truncate=False)
107+
display(scene_list)
108108
```
109109

110110
Observe the scenes list file has URIs to `index.html` files in the download_url column. The image URI's are in the same directory. The filenames are of the form `${gid}_B${band}.TIF`. The next code chunk builds these URIs, which completes our catalog.
111111

112-
```python, show_remote_catalog, results='raw'
112+
```python, show_remote_catalog
113113
modis_catalog = scene_list \
114114
.withColumn('base_url',
115115
F.concat(F.regexp_replace('download_url', 'index.html$', ''), 'gid',)
116116
) \
117117
.withColumn('B01' , F.concat('base_url', F.lit("_B01.TIF"))) \
118118
.withColumn('B02' , F.concat('base_url', F.lit("_B02.TIF"))) \
119119
.withColumn('B03' , F.concat('base_url', F.lit("_B03.TIF")))
120-
modis_catalog.show(4, truncate=True)
120+
display(modis_catalog)
121121
```
122122

123123
## Using Built-in Catalogs

0 commit comments

Comments
 (0)