Update documentation on masking

vpipkt · vpipkt · commit bf6326a26c38 · 2019-11-05T16:24:15.000-05:00
Signed-off-by: Jason T. Brown &lt;jason@astraea.earth&gt;
diff --git a/docs/src/main/paradox/reference.md b/docs/src/main/paradox/reference.md
@@ -207,7 +207,7 @@ SQL implementation does not accept a cell_type argument. It returns a float64 ce
 
 ## Masking and NoData
 
-See @ref:[NoData handling](nodata-handling.md) for conceptual discussion of cell types and NoData.
+See the @ref:[masking](masking.md) page for conceptual discussion of masking operations.
 
 There are statistical functions of the count of data and NoData values per `tile` and aggregate over a `tile` column: @ref:[`rf_data_cells`](reference.md#rf-data-cells), @ref:[`rf_no_data_cells`](reference.md#rf-no-data-cells), @ref:[`rf_agg_data_cells`](reference.md#rf-agg-data-cells), and @ref:[`rf_agg_no_data_cells`](reference.md#rf-agg-no-data-cells).
 
diff --git a/pyrasterframes/src/main/python/docs/masking.pymd b/pyrasterframes/src/main/python/docs/masking.pymd
@@ -0,0 +1,135 @@
+# Masking
+
+```python setup, echo=False
+import pyrasterframes
+from pyrasterframes.utils import create_rf_spark_session
+from pyrasterframes.rasterfunctions import *
+import pyrasterframes.rf_ipython
+from IPython.display import display
+import pandas as pd
+import numpy as np
+from pyrasterframes.rf_types import Tile
+
+spark = create_rf_spark_session()
+```
+
+Masking is a common operation in raster processing. It is setting certain cells to the @ref:[NoData value](nodata-handling.md). This is usually done to remove low-quality observations from the raster processing. Another related use case is to @ref:["clip"](masking.md#clipping) a raster to a given polygon.  
+ 
+## Masking Example
+
+Let's demonstrate masking with a pair of bands of Sentinel-2 data. The measurement bands we will use, blue and green, have no defined NoData. They share quality information from a separate file called the scene classification (SCL), which delineates areas of missing data and probable clouds. For more information on this, see the [Sentinel-2 algorithm overview](https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm). Figure 3 tells us how to interpret the scene classification. For this example, we will exclude NoData, defective pixels, probable clouds, and cirrus clouds: values 0, 1, 8, 9, and 10.
+
+![Sentinel-2 Scene Classification Values](static/sentinel-2-scene-classification-labels.png)
+
+Credit: [Sentinel-2 algorithm overview](https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm)
+
+The first step is to create a catalog with our band of interest and the SCL band. We read the data from the catalog, so all _tiles_ are aligned across rows.
+
+```python, blue_scl_cat
+from pyspark.sql import Row
+
+blue_uri = 'https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif'
+green_uri = 'https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B03.tif'
+scl_uri = 'https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/SCL.tif'
+cat = spark.createDataFrame([Row(blue=blue_uri, green=green_uri, scl=scl_uri),])
+unmasked = spark.read.raster(cat, catalog_col_names=['blue', 'green', 'scl'])
+unmasked.printSchema()
+```
+
+```python, show_cell_types
+unmasked.select(rf_cell_type('blue'), rf_cell_type('scl')).distinct()
+```
+
+## Define CellType for Masked Tile
+
+Because there is not a NoData already defined for the blue band, we must choose one. In this particular example, the minimum value is greater than zero, so we can use 0 as the NoData value. We will construct a new `CellType` object to represent this. 
+
+```python, pick_nd
+blue_min = unmasked.agg(rf_agg_stats('blue').min.alias('blue_min'))
+print('Nonzero minimum value in the blue band:', blue_min.first())
+
+blue_ct = unmasked.select(rf_cell_type('blue')).distinct().first()[0][0]
+masked_blue_ct = CellType(blue_ct).with_no_data_value(0)
+masked_blue_ct.cell_type_name
+```
+
+We next convert the blue band to this cell type.
+
+```python, convert_blue
+converted = unmasked.select('scl', 'green', rf_convert_cell_type('blue', masked_blue_ct).alias('blue'))
+```
+
+## Apply Mask from Quality Band
+
+Now we set cells of our `blue` column to NoData for all locations where the `scl` tile is in our set of undesirable values. This is the actual _masking_ operation.
+
+```python, apply_mask_blue
+from pyspark.sql.functions import lit
+
+masked = converted.withColumn('blue_masked', rf_mask_by_values('blue', 'scl', [0, 1, 8, 9, 10]))
+masked
+```
+
+We can verify that the number of NoData cells in the resulting `blue_masked` column matches the total of the boolean `mask` _tile_ to ensure our logic is correct.
+
+```python, show_masked_counts
+masked.select(rf_no_data_cells('blue_masked'), rf_tile_sum(rf_local_is_in('scl', [0, 1, 8, 9, 10])))
+```
+
+It's also nice to view a sample. The white regions are areas of NoData.
+
+```python, display_blu, caption='Blue band masked against selected SCL values'
+sample = masked.orderBy(-rf_no_data_cells('blue_masked')).select(rf_tile('blue_masked'), rf_tile('scl')).first()
+display(sample[0])
+```
+
+And the original SCL data. The bright yellow is a cloudy region in the original image.
+
+```python, display_scl, caption='SCL tile for above'
+display(sample[1])
+```
+
+## Transferring Mask
+
+We can now apply the same mask from the blue column to the green column. Note here we have supressed the step of explicitly checking what a "safe" NoData value for the green band should be.  
+
+```python, mask_green
+masked.withColumn('green_masked', rf_mask(rf_convert_cell_type('green', masked_blue_ct), 'blue_masked'))  \
+      .orderBy(-rf_no_data_cells('blue_masked'))
+```
+
+## Clipping
+
+Clipping is the use of a polygon to determine the areas to mask in a raster. Typically the areas inside a polygon are retained and the cells outside are set to NoData. Given a geometry column on our DataFrame, we have to carry out three basic steps. First we have to ensure the vector geometry is correctly projected to the same @ref:[CRS](concepts.md#coordinate-reference-system-crs) as the raster. We'll continue with our example creating a simple polygon. Buffering a point will create an approximate circle.
+
+
+```python, reproject_geom
+to_rasterize = masked.withColumn('geom_4326', 
+                            st_bufferPoint(
+                                st_point(lit(-78.0783132), lit(38.3184340)), 
+                                lit(15000))) \
+                .withColumn('geom_native', st_reproject('geom_4326', rf_mk_crs('epsg:4326'), rf_crs('blue_masked')))
+```
+
+Second, we will rasterize the geometry, or burn-in the geometry into the same grid as the raster.
+
+```python, rasterize
+to_clip = to_rasterize.withColumn('clip_raster', 
+                                 rf_rasterize('geom_native', rf_geometry('blue_masked'), lit(1), rf_dimensions('blue_masked').cols, rf_dimensions('blue_masked').rows))
+
+# visualize some of the edges of our circle
+to_clip.select('clip_raster', 'blue_masked') \
+    .filter(rf_data_cells('clip_raster') > 20) \
+    .orderBy(rf_data_cells('clip_raster'))
+```
+
+Finally, we create a new _tile_ column with the blue band clipped to our circle. Again we will use the `rf_mask` function to pass the NoData regions along from the rasterized geometry.
+
+clipped = to_clip.select('blue_masked', 
+                         'clip_raster',
+                         rf_mask('blue_masked', 'clip_raster').alias('blue_clipped')) \
+                 .filter(rf_data_cells('clip_raster') > 20) \
+                 .orderBy(rf_data_cells('clip_raster')) 
+
+
+This kind of clipping technique is further used in @ref:[zonal statistics](zonal-algebra.pymd).
diff --git a/pyrasterframes/src/main/python/docs/nodata-handling.pymd b/pyrasterframes/src/main/python/docs/nodata-handling.pymd
@@ -2,7 +2,7 @@
 
 ## What is NoData?
 
-In raster operations, the preservation and correct processing of missing observations is very important. In [most DataFrames and in scientific computing](https://www.oreilly.com/learning/handling-missing-data), the idea of missing data is expressed as a `null` or `NaN` value. However, a great deal of raster data is stored for space efficiency, which typically leads to use of integral values with a ["sentinel" value](https://en.wikipedia.org/wiki/Sentinel_value) designated to represent missing observations. This sentinel value varies across data products and is usually called the "NoData" value.
+In raster operations, the preservation and correct processing of missing observations is very important. In [most DataFrames and in scientific computing](https://www.oreilly.com/learning/handling-missing-data), the idea of missing data is expressed as a `null` or `NaN` value. However, a great deal of raster data is stored for space efficiency, which typically leads to use of integral values with a ["sentinel" value](https://en.wikipedia.org/wiki/Sentinel_value) designated to represent missing observations. This sentinel value varies across data products and bands. In a generic sense, it is usually called the "NoData" value.
 
 RasterFrames provides a variety of functions to inspect and manage NoData within _tiles_.
 
@@ -75,85 +75,6 @@ print(CellType.float32().no_data_value())
 print(CellType.float32().with_no_data_value(-99.9).no_data_value())
 ```
 
-## Masking
-
-Let's continue the example above with Sentinel-2 data. Band 2 is blue and has no defined NoData. The quality information is in a separate file called the scene classification (SCL), which delineates areas of missing data and probable clouds. For more information on this, see the [Sentinel-2 algorithm overview](https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm). Figure 3 tells us how to interpret the scene classification. For this example, we will exclude NoData, defective pixels, probable clouds, and cirrus clouds: values 0, 1, 8, 9, and 10.
-
-![Sentinel-2 Scene Classification Values](static/sentinel-2-scene-classification-labels.png)
-
-Credit: [Sentinel-2 algorithm overview](https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm)
-
-The first step is to create a catalog with our band of interest and the SCL band. We read the data from the catalog, so the blue band and SCL _tiles_ are aligned across rows.
-
-```python, blue_scl_cat
-from pyspark.sql import Row
-
-blue_uri = 'https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif'
-scl_uri = 'https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/SCL.tif'
-cat = spark.createDataFrame([Row(blue=blue_uri, scl=scl_uri),])
-unmasked = spark.read.raster(cat, catalog_col_names=['blue', 'scl'])
-unmasked.printSchema()
-```
-
-```python, show_cell_types
-cell_types = unmasked.select(rf_cell_type('blue'), rf_cell_type('scl')).distinct()
-cell_types
-```
-
-Drawing on @ref:[local map algebra](local-algebra.md) techniques, we will create new _tile_ columns that are indicators of unwanted pixels, as defined above. Since the mask column is an integer type, the addition is equivalent to a logical or, so the boolean true values are 1.
-
-```python, def_mask
-from pyspark.sql.functions import lit
-
-mask = unmasked.withColumn('mask', rf_local_is_in('scl', [0, 1, 8, 9, 10]))
-
-cell_types = mask.select(rf_cell_type('mask')).distinct()
-cell_types
-```
-
-Because there is not a NoData already defined, we will choose one. In this particular example, the minimum value is greater than zero, so we can use 0 as the NoData value.
-
-```python, pick_nd
-blue_min = mask.agg(rf_agg_stats('blue').min.alias('blue_min'))
-blue_min
-```
-
-We can now construct the cell type string for our blue band's cell type, designating 0 as NoData.
-
-```python, get_ct_string
-blue_ct = mask.select(rf_cell_type('blue')).distinct().first()[0][0]
-masked_blue_ct = CellType(blue_ct).with_no_data_value(0)
-masked_blue_ct.cell_type_name
-```
-
-Now we will use the @ref:[`rf_mask_by_value`](reference.md#rf-mask-by-value) to designate the cloudy and other unwanted pixels as NoData in the blue column by converting the cell type and applying the mask.
-
-```python, mask_blu
-with_nd = rf_convert_cell_type('blue', masked_blue_ct)
-masked = mask.withColumn('blue_masked',
-    rf_mask_by_value(with_nd, 'mask', lit(1)))
-```
-
-We can verify that the number of NoData cells in the resulting `blue_masked` column matches the total of the boolean `mask` _tile_ to ensure our logic is correct.
-
-```python, show_masked
-counts = masked.select(rf_no_data_cells('blue_masked'), rf_tile_sum('mask'))
-counts
-```
-
-It's also nice to view a sample. The white regions are areas of NoData.
-
-```python, display_blu, caption='Blue band masked against selected SCL values'
-sample = masked.orderBy(-rf_no_data_cells('blue_masked')).select(rf_tile('blue_masked'), rf_tile('scl')).first()
-display(sample[0])
-```
-
-And the original SCL data. The bright yellow is a cloudy region in the original image.
-
-```python, display_scl, caption='SCL tile for above'
-display(sample[1])
-```
-
 ## NoData and Local Arithmetic
 
 Let's now explore how the presence of NoData affects @ref:[local map algebra](local-algebra.md) operations. To demonstrate the behavior, lets create two _tiles_. One _tile_ will have values of 0 and 1, and the other will have values of just 0.
diff --git a/pyrasterframes/src/main/python/docs/raster-processing.md b/pyrasterframes/src/main/python/docs/raster-processing.md
@@ -4,6 +4,7 @@
 
 * @ref:[Local Map Algebra](local-algebra.md)
 * @ref:["NoData" Handling](nodata-handling.md)
+* @ref:[Masking](masking.md)
 * @ref:[Zonal Map Algebra](zonal-algebra.md)
 * @ref:[Aggregation](aggregation.md)
 * @ref:[Time Series](time-series.md)