Skip to content

Commit 3426ac6

Browse files
authored
Merge pull request #230 from s22s/feature/nodata-add-doc
update nodata docs to show cell type conversion behavior and aggregations
2 parents 6bf41f2 + c19a2a0 commit 3426ac6

File tree

2 files changed

+168
-14
lines changed

2 files changed

+168
-14
lines changed

pyrasterframes/src/main/python/docs/nodata-handling.pymd

Lines changed: 158 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,9 @@ import pyrasterframes
1515
from pyrasterframes.rasterfunctions import *
1616
import pyrasterframes.rf_ipython
1717
from IPython.display import display
18+
import pandas as pd
19+
import numpy as np
20+
from pyrasterframes.rf_types import Tile
1821

1922
spark = pyrasterframes.get_spark_session()
2023
```
@@ -130,31 +133,175 @@ We can verify that the number of NoData cells in the resulting `blue_masked` col
130133
masked.select(rf_no_data_cells('blue_masked'), rf_tile_sum('mask')).show(10)
131134
```
132135

133-
It's also nice to view a sample.
136+
It's also nice to view a sample. The white regions are areas of NoData.
134137

135-
```python show_masked
138+
```python, caption='Blue band masked against selected SCL values'
136139
sample = masked.orderBy(-rf_no_data_cells('blue_masked')).select(rf_tile('blue_masked'), rf_tile('scl')).first()
137140
display(sample[0])
138141
```
139142

140-
And the original SCL data.
143+
And the original SCL data. The bright yellow is a cloudy region in the original image.
141144

142-
```python show_scl
145+
```python, caption='SCL tile for above'
143146
display(sample[1])
144147
```
145148

146-
## NoData in Arithmetic Operations
149+
## NoData and Local Arithmatic
147150

148-
local algebra example; same celltype what happens to nodata
149-
Possibly use st_geomFromWkt and rf_rasterize to create something to work from
151+
Let's now explore how the presence of NoData affects @ref:[local map algebra](local-algebra.md) operations. To demonstrate the behaviour, lets create two tiles. One tile will have values of 0 and 1, and the other will have values of just 0.
150152

151-
agg
153+
154+
```python
155+
tile_size = 100
156+
x = np.zeros((tile_size, tile_size), dtype='int16')
157+
x[:,tile_size//2:] = 1
158+
x = Tile(x)
159+
y = Tile(np.zeros((tile_size, tile_size), dtype='int16'))
160+
161+
rf = spark.createDataFrame([Row(x=x, y=y)])
162+
print('x')
163+
display(x)
164+
```
165+
166+
```python
167+
print('y')
168+
display(y)
169+
```
170+
171+
Now, let's create a new column from `x` with the value of 1 changed to NoData. Then, we will add this new column with NoData to the `y` column. As shown below, the result of the sum also has NoData (represented in white). In general for local algebra operations, Data + NoData = NoData.
172+
173+
```python
174+
masked_rf = rf.withColumn('x_nd', rf_mask_by_value('x', 'x', lit(1)) )
175+
masked_rf = masked_rf.withColumn('x_nd_y_sum', rf_local_add('x_nd', 'y'))
176+
row = masked_rf.collect()[0]
177+
print('x with NoData')
178+
display(row.x_nd)
179+
```
180+
181+
```python
182+
print('x with NoData plus y')
183+
display(row.x_nd_y_sum)
184+
```
185+
To see more information about possible operations on Tile columns, see the @ref:[local map algebra](local-algebra.md) page and @ref:[function reference](reference.md#local-map-algebra).
186+
187+
## Changing a Tile's NoData Values
188+
189+
One way to mask a tile is to make a new tile with a user defined NoData value. We will explore this method below. First, lets create a DataFrame from a tile with values of 0, 1, 2, and 3. We will use numpy to create a 100x100 Tile with vertical bands containing values 0, 1, 2, and 3.
190+
191+
```python create_dummy_tile, caption='Dummy Tile'
192+
tile_size = 100
193+
x = np.zeros((tile_size, tile_size), dtype='int16')
194+
195+
# setting the values of the columns
196+
for i in range(4):
197+
x[:, i*tile_size//4:(i+1)*tile_size//4] = i
198+
x = Tile(x)
199+
200+
rf = spark.createDataFrame([Row(tile=x)])
201+
display(x)
202+
```
203+
204+
First, we mask the value of 1 by making a new column with the user defined cell type 'uint16ud1'. Then, we mask out the value of two by making a tile with the cell type 'uint16ud2'.
205+
206+
```python
207+
def get_nodata_ct(nd_val):
208+
return CellType('uint16').with_no_data_value(nd_val)
209+
210+
masked_rf = rf.withColumn('tile_nd_1',
211+
rf_convert_cell_type('tile', get_nodata_ct(1))) \
212+
.withColumn('tile_nd_2',
213+
rf_convert_cell_type('tile_nd_1', get_nodata_ct(2))) \
214+
```
215+
216+
```python
217+
collected = masked_rf.collect()
218+
```
219+
220+
Let's look at the new Tiles we created. The tile named `tile_nd_1` has the 1 values masked out as expected.
221+
222+
```python
223+
display(collected[0].tile_nd_1)
224+
```
225+
226+
And the tile named `tile_nd_2` has the values of 1 and 2 masked out. This is because we created the tile by setting a new user defined NoData value to `tile_nd_1` the values previously masked out in `tile_nd_1` stayed masked when creating `tile_nd_2`.
227+
228+
```python
229+
display(collected[0].tile_nd_2)
230+
```
152231

153232

154-
## Dealing with Multiple Cell Types
233+
## Combining Tiles with Different Data Types
155234

156-
Quick demo of one ND tile one raw tile
235+
RasterFrames supports having Tile columns with multiple cell types in a single DataFrame. It is important to understand how these different cell types interact.
157236

158-
Quick demo of ND in two different cell types
237+
Let's first create a RasterFrame that has columns of `float` and `int` cell type.
159238

239+
```python
240+
x = Tile((np.ones((100, 100))*2).astype('float'))
241+
y = Tile((np.ones((100, 100))*3.0).astype('int32'))
242+
rf = spark.createDataFrame([Row(x=x, y=y)])
243+
244+
rf.select(rf_cell_type('x'), rf_cell_type('y')).distinct().show()
245+
```
246+
247+
When performing a local operation between tile columns with cell types `int` and type `float`, the resulting tile cell type will be `float`. In local algebra over two tiles of different "sized" cell types, the resulting cell type will be the largest of the two input tiles' cell types.
248+
249+
```python
250+
rf.select(
251+
rf_cell_type('x'),
252+
rf_cell_type('y'),
253+
rf_cell_type(rf_local_add('x', 'y').alias('xy_sum')),
254+
).show(1)
255+
```
256+
257+
Combining tile columns of different cell types gets a little trickier when user defined NoData cell types are involved. Let's create 2 tile columns: one with a NoData value of 1, and one with a NoData value of 2.
258+
259+
```python
260+
x_nd_1 = Tile((np.ones((100, 100))*3), get_nodata_ct(1))
261+
x_nd_2 = Tile((np.ones((100, 100))*3), get_nodata_ct(2))
262+
rf_nd = spark.createDataFrame([Row(x_nd_1=x_nd_1, x_nd_2=x_nd_2)])
263+
```
264+
265+
Let's try adding the tile columns with different NoData values. When there is an inconsistent NoData value in the two columns, the NoData value of the right-hand side of the sum is kept. In this case, this means the result has a NoData value of 1.
266+
267+
```python
268+
rf_nd_sum = rf_nd.withColumn('x_nd_sum', rf_local_add('x_nd_2', 'x_nd_1'))
269+
rf_nd_sum.select(rf_cell_type('x_nd_sum')).distinct().show()
270+
```
160271

272+
Reversing the order of the sum changes the NoData value of the resulting column to 2.
273+
274+
```python
275+
rf_nd_sum = rf_nd.withColumn('x_nd_sum', rf_local_add('x_nd_1', 'x_nd_2'))
276+
rf_nd_sum.select(rf_cell_type('x_nd_sum')).distinct().show()
277+
```
278+
279+
## NoData Values in Aggregation
280+
281+
Let's use the same tile as before to demonstrate how NoData values affect tile aggregations.
282+
283+
```python
284+
tile_size = 100
285+
x = np.zeros((tile_size, tile_size), dtype='int16')
286+
for i in range(4):
287+
x[:, i*tile_size//4:(i+1)*tile_size//4] = i
288+
x = Tile(x)
289+
290+
rf = spark.createDataFrame([Row(tile=x)])
291+
display(x)
292+
```
293+
294+
First we create the two new masked tile columns as before. One with only the value of 1 masked, and the other with and values of 1 and 2 masked.
295+
296+
```python
297+
masked_rf = rf.withColumn('tile_nd_1',
298+
rf_convert_cell_type('tile', get_nodata_ct(1))) \
299+
.withColumn('tile_nd_2',
300+
rf_convert_cell_type('tile_nd_1', get_nodata_ct(2)))
301+
```
302+
303+
The results of `rf_tile_sum` vary on the tiles that were masked. This is because any cells with NoData values are ignored in the aggregation. Note that `tile_nd_2` has the lowest sum, since it has the fewest amount of data cells.
304+
305+
```python
306+
masked_rf.select(rf_tile_sum('tile'), rf_tile_sum('tile_nd_1'), rf_tile_sum('tile_nd_2')).show()
307+
```

pyrasterframes/src/main/python/pyrasterframes/rasterfunctions.py

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -36,9 +36,16 @@ def _context_call(name, *args):
3636
return f(*args)
3737

3838

39-
def _parse_cell_type(cell_type_str):
40-
""" Convert the string cell type to the expected CellType object."""
41-
return _context_call('_parse_cell_type', cell_type_str)
39+
def _parse_cell_type(cell_type_arg):
40+
""" Convert the cell type representation to the expected JVM CellType object."""
41+
42+
def to_jvm(ct):
43+
return _context_call('_parse_cell_type', ct)
44+
45+
if isinstance(cell_type_arg, str):
46+
return to_jvm(cell_type_arg)
47+
elif isinstance(cell_type_arg, CellType):
48+
return to_jvm(cell_type_arg.cell_type_name)
4249

4350

4451
def rf_cell_types():

0 commit comments

Comments
 (0)