Merge pull request #540 from s22s/docs/write-geotiff-in-ML-examples

metasim · web-flow · commit 0da9aa936fc4 · 2021-01-26T12:11:52.000-05:00
update docs on geotiff writing
diff --git a/pyrasterframes/src/main/python/docs/raster-write.pymd b/pyrasterframes/src/main/python/docs/raster-write.pymd
@@ -53,11 +53,29 @@ ov
 
 ## GeoTIFFs
 
-GeoTIFF is one of the most common file formats for spatial data, providing flexibility in data encoding, representation, and storage. RasterFrames provides a specialized Spark DataFrame writer for rendering a RasterFrame to a GeoTIFF.
+GeoTIFF is one of the most common file formats for spatial data, providing flexibility in data encoding, representation, and storage. RasterFrames provides a specialized Spark DataFrame writer for rendering a RasterFrame to a GeoTIFF. It is accessed by calling `dataframe.write.geotiff`.
 
-One downside to GeoTIFF is that it is <b><u>not</u></b> a big-data native format. To create a GeoTIFF, all the data to be encoded has to be in the memory of one computer (in Spark parlance, this is a "collect"), limiting its maximum size substantially compared to that of a full cluster environment. When rendering GeoTIFFs in RasterFrames, you must either specify the dimensions of the output raster, or deliberately limit the size of the collected data.
+### Limitations and mitigations
 
-Fortunately, we can use the cluster computing capability to downsample the data into a more manageable size. For sake of example, let's render an overview of a scene's red band as a small raster, reprojecting it to latitude and longitude coordinates on the [WGS84](https://en.wikipedia.org/wiki/World_Geodetic_System) reference ellipsoid (aka [EPSG:4326](https://spatialreference.org/ref/epsg/4326/)).
+One downside to GeoTIFF is that it is <b><u>not</u></b> a big-data native format. To create a GeoTIFF, all the data to be written must be `collect`ed in the memory of the Spark driver. This means you must actively limit the size of the data to be written. It is trivial to lazily read a set of inputs that cannot feasibly be written to GeoTIFF in the same environment. 
+
+When writing GeoTIFFs in RasterFrames, you should limit the size of the collected data. Consider filtering the dataframe by time or @ref:[spatial filters](vector-data.md#geomesa-functions-and-spatial-relations).
+
+You can also specify the dimensions of the GeoTIFF file to be written using the `raster_dimensions` parameter as described below.
+
+### Parameters
+
+If there are many _tile_ or projected raster columns in the DataFrame, the GeoTIFF writer will write each one as a separate band in the file. Each band in the output will be tagged the input column names for reference.
+
+* `path`: the path local to the driver where the file will be written
+* `crs`: the PROJ4 string of the CRS the GeoTIFF is to be written in
+* `raster_dimensions`: optional, a tuple of two ints giving the size of the resulting file. If specified, RasterFrames will downsample the data in distributed fashion using bilinear resampling. If not specified, the default is to write the dataframe at full resolution, which can result in an `OutOfMemoryError`.
+
+### Example 
+
+See also the example in the @ref:[unsupervised learning page](unsupervised-learning.md).
+
+Let's render an overview of a scene's red band as a small raster, reprojecting it to latitude and longitude coordinates on the [WGS84](https://en.wikipedia.org/wiki/World_Geodetic_System) reference ellipsoid (aka [EPSG:4326](https://spatialreference.org/ref/epsg/4326/)).
 
 ```python write_geotiff
 outfile = os.path.join('/tmp', 'geotiff-overview.tif')
@@ -78,14 +96,9 @@ with rasterio.open(outfile) as src:
         histtype='stepfilled', title="Overview Histogram")
 ```
 
-If there are many _tile_ or projected raster columns in the DataFrame, the GeoTIFF writer will write each one as a separate band in the file. Each band in the output will be tagged the input column names for reference.
-
-@@@ note
-If no `raster_dimensions` column is specified the DataFrame contents are written at full resolution. As shown in the example above, you can also specify the size of the output GeoTIFF. Bilinear resampling is used. 
-@@@
 
 @@@ warning
-Attempting to write a full resolution GeoTIFF constructed from multiple scenes is likely to result in an out of memory error. The `raster_dimensions` parameter needs to be used in these cases.
+Attempting to write a full resolution GeoTIFF constructed from multiple scenes is likely to result in an out of memory error. Consider filtering the dataframe more aggressively and using a smaller value for the `raster_dimensions` parameter.
 @@@
 
 ### Color Composites
diff --git a/pyrasterframes/src/main/python/docs/unsupervised-learning.pymd b/pyrasterframes/src/main/python/docs/unsupervised-learning.pymd
@@ -30,19 +30,28 @@ from pyspark.ml import Pipeline
 
 The first step is to create a Spark DataFrame of our imagery data. To achieve that we will create a catalog DataFrame using the pattern from [the I/O page](raster-io.html#Single-Scene--Multiple-Bands). In the catalog, each row represents a distinct area and time, and each column is the URI to a band's image product. The resulting Spark DataFrame may have many rows per URI, with a column corresponding to each band.
 
-
 ```python, catalog
 filenamePattern = "https://rasterframes.s3.amazonaws.com/samples/elkton/L8-B{}-Elkton-VA.tiff"
 catalog_df = pd.DataFrame([
     {'b' + str(b): filenamePattern.format(b) for b in range(1, 8)}
 ])
 
-df = spark.read.raster(catalog_df, catalog_col_names=catalog_df.columns)
+tile_size = 256
+df = spark.read.raster(catalog_df, catalog_col_names=catalog_df.columns, tile_size=tile_size)
 df = df.withColumn('crs', rf_crs(df.b1)) \
        .withColumn('extent', rf_extent(df.b1))
 df.printSchema()
 ```
 
+In this small example, all the images in our `catalog_df` have the same @ref:[CRS](concepts.md#coordinate-reference-system-crs-), which we verify in the code snippet below. The `crs` object will be useful for visualization later. 
+
+```python, crses
+crses = df.select('crs.crsProj4').distinct().collect()
+print('Found ', len(crses), 'distinct CRS: ', crses)
+assert len(crses) == 1
+crs = crses[0]['crsProj4']
+```
+
 ## Create ML Pipeline
 
 SparkML requires that each observation be in its own row, and features for each observation be packed into a single `Vector`. For this unsupervised learning problem, we will treat each _pixel_ as an observation and each band as a feature. The first step is to "explode" the _tiles_ into a single row per pixel. In RasterFrames, generally a pixel is called a @ref:[`cell`](concepts.md#cell).
@@ -51,7 +60,7 @@ SparkML requires that each observation be in its own row, and features for each
 exploder = TileExploder()
 ```
 
-To "vectorize" the the band columns, we use the SparkML `VectorAssembler`. Each of the seven bands is a different feature.
+To "vectorize" the band columns, we use the SparkML `VectorAssembler`. Each of the seven bands is a different feature.
 
 ```python, assembler
 assembler = VectorAssembler() \
@@ -111,29 +120,22 @@ We can recreate the tiled data structure using the metadata added by the `TileEx
 ```python, assemble
 from pyrasterframes.rf_types import CellType
 
-tile_dims = df.select(rf_dimensions(df.b1).alias('dims')).first()['dims']
 retiled = clustered.groupBy('extent', 'crs') \
     .agg(
         rf_assemble_tile('column_index', 'row_index', 'prediction',
-            tile_dims['cols'], tile_dims['rows'], CellType.int8()).alias('prediction')
+            tile_size, tile_size, CellType.int8())
 )
 ```
 
-The resulting output is shown below.
+Next we will @ref:[write the output to a GeoTiff file](raster-write.md#geotiffs). Doing so in this case works quickly and well for a few specific reasons that may not hold in all cases. We can write the data at full resolution, by omitting the `raster_dimensions` argument, because we know the input raster dimensions are small. Also, the data is all in a single CRS, as we demonstrated above. Because the `catalog_df` is only a single row, we know the output GeoTIFF value at a given location corresponds to a single input. Finally, the `retiled` `DataFrame` only has a single `Tile` column, so the band interpretation is trivial.
 
 ```python, viz
-from pyrasterframes.rf_types import Extent
-aoi = Extent.from_row(
-    retiled.agg(rf_agg_reprojected_extent('extent', 'crs', 'epsg:3857')) \
-           .first()[0]
-)
-
-retiled.select(rf_agg_overview_raster('prediction', 558, 507, aoi, 'extent', 'crs'))
-```
+output_tif = 'unsupervised.tif'
 
+retiled.write.geotiff(output_tif, crs=crs)
 
-```python, viz-true-color, evaluate=False, echo=False
-#For comparison, the true color composite of the original data.
-#  this is really dark
-df.select(rf_render_png('b4', 'b3', 'b2'))
-```
+with rasterio.open(output_tif) as src:
+    for b in range(1, src.count + 1):
+        print("Tags on band", b, src.tags(b))
+    show(src)
+```