Skip to content

Commit bf6326a

Browse files
committed
Update documentation on masking
Signed-off-by: Jason T. Brown <[email protected]>
1 parent 103dc6c commit bf6326a

File tree

4 files changed

+138
-81
lines changed

4 files changed

+138
-81
lines changed

docs/src/main/paradox/reference.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -207,7 +207,7 @@ SQL implementation does not accept a cell_type argument. It returns a float64 ce
207207

208208
## Masking and NoData
209209

210-
See @ref:[NoData handling](nodata-handling.md) for conceptual discussion of cell types and NoData.
210+
See the @ref:[masking](masking.md) page for conceptual discussion of masking operations.
211211

212212
There are statistical functions of the count of data and NoData values per `tile` and aggregate over a `tile` column: @ref:[`rf_data_cells`](reference.md#rf-data-cells), @ref:[`rf_no_data_cells`](reference.md#rf-no-data-cells), @ref:[`rf_agg_data_cells`](reference.md#rf-agg-data-cells), and @ref:[`rf_agg_no_data_cells`](reference.md#rf-agg-no-data-cells).
213213

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# Masking
2+
3+
```python setup, echo=False
4+
import pyrasterframes
5+
from pyrasterframes.utils import create_rf_spark_session
6+
from pyrasterframes.rasterfunctions import *
7+
import pyrasterframes.rf_ipython
8+
from IPython.display import display
9+
import pandas as pd
10+
import numpy as np
11+
from pyrasterframes.rf_types import Tile
12+
13+
spark = create_rf_spark_session()
14+
```
15+
16+
Masking is a common operation in raster processing. It is setting certain cells to the @ref:[NoData value](nodata-handling.md). This is usually done to remove low-quality observations from the raster processing. Another related use case is to @ref:["clip"](masking.md#clipping) a raster to a given polygon.
17+
18+
## Masking Example
19+
20+
Let's demonstrate masking with a pair of bands of Sentinel-2 data. The measurement bands we will use, blue and green, have no defined NoData. They share quality information from a separate file called the scene classification (SCL), which delineates areas of missing data and probable clouds. For more information on this, see the [Sentinel-2 algorithm overview](https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm). Figure 3 tells us how to interpret the scene classification. For this example, we will exclude NoData, defective pixels, probable clouds, and cirrus clouds: values 0, 1, 8, 9, and 10.
21+
22+
![Sentinel-2 Scene Classification Values](static/sentinel-2-scene-classification-labels.png)
23+
24+
Credit: [Sentinel-2 algorithm overview](https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm)
25+
26+
The first step is to create a catalog with our band of interest and the SCL band. We read the data from the catalog, so all _tiles_ are aligned across rows.
27+
28+
```python, blue_scl_cat
29+
from pyspark.sql import Row
30+
31+
blue_uri = 'https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif'
32+
green_uri = 'https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B03.tif'
33+
scl_uri = 'https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/SCL.tif'
34+
cat = spark.createDataFrame([Row(blue=blue_uri, green=green_uri, scl=scl_uri),])
35+
unmasked = spark.read.raster(cat, catalog_col_names=['blue', 'green', 'scl'])
36+
unmasked.printSchema()
37+
```
38+
39+
```python, show_cell_types
40+
unmasked.select(rf_cell_type('blue'), rf_cell_type('scl')).distinct()
41+
```
42+
43+
## Define CellType for Masked Tile
44+
45+
Because there is not a NoData already defined for the blue band, we must choose one. In this particular example, the minimum value is greater than zero, so we can use 0 as the NoData value. We will construct a new `CellType` object to represent this.
46+
47+
```python, pick_nd
48+
blue_min = unmasked.agg(rf_agg_stats('blue').min.alias('blue_min'))
49+
print('Nonzero minimum value in the blue band:', blue_min.first())
50+
51+
blue_ct = unmasked.select(rf_cell_type('blue')).distinct().first()[0][0]
52+
masked_blue_ct = CellType(blue_ct).with_no_data_value(0)
53+
masked_blue_ct.cell_type_name
54+
```
55+
56+
We next convert the blue band to this cell type.
57+
58+
```python, convert_blue
59+
converted = unmasked.select('scl', 'green', rf_convert_cell_type('blue', masked_blue_ct).alias('blue'))
60+
```
61+
62+
## Apply Mask from Quality Band
63+
64+
Now we set cells of our `blue` column to NoData for all locations where the `scl` tile is in our set of undesirable values. This is the actual _masking_ operation.
65+
66+
```python, apply_mask_blue
67+
from pyspark.sql.functions import lit
68+
69+
masked = converted.withColumn('blue_masked', rf_mask_by_values('blue', 'scl', [0, 1, 8, 9, 10]))
70+
masked
71+
```
72+
73+
We can verify that the number of NoData cells in the resulting `blue_masked` column matches the total of the boolean `mask` _tile_ to ensure our logic is correct.
74+
75+
```python, show_masked_counts
76+
masked.select(rf_no_data_cells('blue_masked'), rf_tile_sum(rf_local_is_in('scl', [0, 1, 8, 9, 10])))
77+
```
78+
79+
It's also nice to view a sample. The white regions are areas of NoData.
80+
81+
```python, display_blu, caption='Blue band masked against selected SCL values'
82+
sample = masked.orderBy(-rf_no_data_cells('blue_masked')).select(rf_tile('blue_masked'), rf_tile('scl')).first()
83+
display(sample[0])
84+
```
85+
86+
And the original SCL data. The bright yellow is a cloudy region in the original image.
87+
88+
```python, display_scl, caption='SCL tile for above'
89+
display(sample[1])
90+
```
91+
92+
## Transferring Mask
93+
94+
We can now apply the same mask from the blue column to the green column. Note here we have supressed the step of explicitly checking what a "safe" NoData value for the green band should be.
95+
96+
```python, mask_green
97+
masked.withColumn('green_masked', rf_mask(rf_convert_cell_type('green', masked_blue_ct), 'blue_masked')) \
98+
.orderBy(-rf_no_data_cells('blue_masked'))
99+
```
100+
101+
## Clipping
102+
103+
Clipping is the use of a polygon to determine the areas to mask in a raster. Typically the areas inside a polygon are retained and the cells outside are set to NoData. Given a geometry column on our DataFrame, we have to carry out three basic steps. First we have to ensure the vector geometry is correctly projected to the same @ref:[CRS](concepts.md#coordinate-reference-system-crs) as the raster. We'll continue with our example creating a simple polygon. Buffering a point will create an approximate circle.
104+
105+
106+
```python, reproject_geom
107+
to_rasterize = masked.withColumn('geom_4326',
108+
st_bufferPoint(
109+
st_point(lit(-78.0783132), lit(38.3184340)),
110+
lit(15000))) \
111+
.withColumn('geom_native', st_reproject('geom_4326', rf_mk_crs('epsg:4326'), rf_crs('blue_masked')))
112+
```
113+
114+
Second, we will rasterize the geometry, or burn-in the geometry into the same grid as the raster.
115+
116+
```python, rasterize
117+
to_clip = to_rasterize.withColumn('clip_raster',
118+
rf_rasterize('geom_native', rf_geometry('blue_masked'), lit(1), rf_dimensions('blue_masked').cols, rf_dimensions('blue_masked').rows))
119+
120+
# visualize some of the edges of our circle
121+
to_clip.select('clip_raster', 'blue_masked') \
122+
.filter(rf_data_cells('clip_raster') > 20) \
123+
.orderBy(rf_data_cells('clip_raster'))
124+
```
125+
126+
Finally, we create a new _tile_ column with the blue band clipped to our circle. Again we will use the `rf_mask` function to pass the NoData regions along from the rasterized geometry.
127+
128+
clipped = to_clip.select('blue_masked',
129+
'clip_raster',
130+
rf_mask('blue_masked', 'clip_raster').alias('blue_clipped')) \
131+
.filter(rf_data_cells('clip_raster') > 20) \
132+
.orderBy(rf_data_cells('clip_raster'))
133+
134+
135+
This kind of clipping technique is further used in @ref:[zonal statistics](zonal-algebra.pymd).

pyrasterframes/src/main/python/docs/nodata-handling.pymd

Lines changed: 1 addition & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## What is NoData?
44

5-
In raster operations, the preservation and correct processing of missing observations is very important. In [most DataFrames and in scientific computing](https://www.oreilly.com/learning/handling-missing-data), the idea of missing data is expressed as a `null` or `NaN` value. However, a great deal of raster data is stored for space efficiency, which typically leads to use of integral values with a ["sentinel" value](https://en.wikipedia.org/wiki/Sentinel_value) designated to represent missing observations. This sentinel value varies across data products and is usually called the "NoData" value.
5+
In raster operations, the preservation and correct processing of missing observations is very important. In [most DataFrames and in scientific computing](https://www.oreilly.com/learning/handling-missing-data), the idea of missing data is expressed as a `null` or `NaN` value. However, a great deal of raster data is stored for space efficiency, which typically leads to use of integral values with a ["sentinel" value](https://en.wikipedia.org/wiki/Sentinel_value) designated to represent missing observations. This sentinel value varies across data products and bands. In a generic sense, it is usually called the "NoData" value.
66

77
RasterFrames provides a variety of functions to inspect and manage NoData within _tiles_.
88

@@ -75,85 +75,6 @@ print(CellType.float32().no_data_value())
7575
print(CellType.float32().with_no_data_value(-99.9).no_data_value())
7676
```
7777

78-
## Masking
79-
80-
Let's continue the example above with Sentinel-2 data. Band 2 is blue and has no defined NoData. The quality information is in a separate file called the scene classification (SCL), which delineates areas of missing data and probable clouds. For more information on this, see the [Sentinel-2 algorithm overview](https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm). Figure 3 tells us how to interpret the scene classification. For this example, we will exclude NoData, defective pixels, probable clouds, and cirrus clouds: values 0, 1, 8, 9, and 10.
81-
82-
![Sentinel-2 Scene Classification Values](static/sentinel-2-scene-classification-labels.png)
83-
84-
Credit: [Sentinel-2 algorithm overview](https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm)
85-
86-
The first step is to create a catalog with our band of interest and the SCL band. We read the data from the catalog, so the blue band and SCL _tiles_ are aligned across rows.
87-
88-
```python, blue_scl_cat
89-
from pyspark.sql import Row
90-
91-
blue_uri = 'https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif'
92-
scl_uri = 'https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/SCL.tif'
93-
cat = spark.createDataFrame([Row(blue=blue_uri, scl=scl_uri),])
94-
unmasked = spark.read.raster(cat, catalog_col_names=['blue', 'scl'])
95-
unmasked.printSchema()
96-
```
97-
98-
```python, show_cell_types
99-
cell_types = unmasked.select(rf_cell_type('blue'), rf_cell_type('scl')).distinct()
100-
cell_types
101-
```
102-
103-
Drawing on @ref:[local map algebra](local-algebra.md) techniques, we will create new _tile_ columns that are indicators of unwanted pixels, as defined above. Since the mask column is an integer type, the addition is equivalent to a logical or, so the boolean true values are 1.
104-
105-
```python, def_mask
106-
from pyspark.sql.functions import lit
107-
108-
mask = unmasked.withColumn('mask', rf_local_is_in('scl', [0, 1, 8, 9, 10]))
109-
110-
cell_types = mask.select(rf_cell_type('mask')).distinct()
111-
cell_types
112-
```
113-
114-
Because there is not a NoData already defined, we will choose one. In this particular example, the minimum value is greater than zero, so we can use 0 as the NoData value.
115-
116-
```python, pick_nd
117-
blue_min = mask.agg(rf_agg_stats('blue').min.alias('blue_min'))
118-
blue_min
119-
```
120-
121-
We can now construct the cell type string for our blue band's cell type, designating 0 as NoData.
122-
123-
```python, get_ct_string
124-
blue_ct = mask.select(rf_cell_type('blue')).distinct().first()[0][0]
125-
masked_blue_ct = CellType(blue_ct).with_no_data_value(0)
126-
masked_blue_ct.cell_type_name
127-
```
128-
129-
Now we will use the @ref:[`rf_mask_by_value`](reference.md#rf-mask-by-value) to designate the cloudy and other unwanted pixels as NoData in the blue column by converting the cell type and applying the mask.
130-
131-
```python, mask_blu
132-
with_nd = rf_convert_cell_type('blue', masked_blue_ct)
133-
masked = mask.withColumn('blue_masked',
134-
rf_mask_by_value(with_nd, 'mask', lit(1)))
135-
```
136-
137-
We can verify that the number of NoData cells in the resulting `blue_masked` column matches the total of the boolean `mask` _tile_ to ensure our logic is correct.
138-
139-
```python, show_masked
140-
counts = masked.select(rf_no_data_cells('blue_masked'), rf_tile_sum('mask'))
141-
counts
142-
```
143-
144-
It's also nice to view a sample. The white regions are areas of NoData.
145-
146-
```python, display_blu, caption='Blue band masked against selected SCL values'
147-
sample = masked.orderBy(-rf_no_data_cells('blue_masked')).select(rf_tile('blue_masked'), rf_tile('scl')).first()
148-
display(sample[0])
149-
```
150-
151-
And the original SCL data. The bright yellow is a cloudy region in the original image.
152-
153-
```python, display_scl, caption='SCL tile for above'
154-
display(sample[1])
155-
```
156-
15778
## NoData and Local Arithmetic
15879

15980
Let's now explore how the presence of NoData affects @ref:[local map algebra](local-algebra.md) operations. To demonstrate the behavior, lets create two _tiles_. One _tile_ will have values of 0 and 1, and the other will have values of just 0.

pyrasterframes/src/main/python/docs/raster-processing.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44

55
* @ref:[Local Map Algebra](local-algebra.md)
66
* @ref:["NoData" Handling](nodata-handling.md)
7+
* @ref:[Masking](masking.md)
78
* @ref:[Zonal Map Algebra](zonal-algebra.md)
89
* @ref:[Aggregation](aggregation.md)
910
* @ref:[Time Series](time-series.md)

0 commit comments

Comments
 (0)