Skip to content

Commit edee965

Browse files
Courtney Whalenvpipkt
authored andcommitted
squashed changes
Signed-off-by: Courtney Whalen <[email protected]>
1 parent 867d309 commit edee965

File tree

10 files changed

+110
-107
lines changed

10 files changed

+110
-107
lines changed

pyrasterframes/src/main/python/docs/aggregation.pymd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,19 +28,19 @@ SELECT 2 as id, rf_local_multiply(rf_make_ones_tile(5, 5, 'float32'), 3) as tile
2828
rf.select("id", rf_render_matrix("tile")).show(10, False)
2929
```
3030

31-
In this code block we are using the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the tile aggregate mean of cells in each row of column `tile`. The mean of each tile is computed separately, so the first mean is 1.0 and the second mean is 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
31+
In this code block, we are using the @ref:[`rf_tile_mean`](reference.md#rf-tile-mean) function to compute the tile aggregate mean of cells in each row of column `tile`. The mean of each tile is computed separately, so the first mean is 1.0 and the second mean is 3.0. Notice that the number of rows in the DataFrame is the same before and after the aggregation.
3232

3333
```python
3434
rf.select(F.col('id'), rf_tile_mean(F.col('tile'))).show(10, False)
3535
```
3636

37-
In this code block we are using the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
37+
In this code block, we are using the @ref:[`rf_agg_mean`](reference.md#rf-agg-mean) function to compute the DataFrame aggregate, which averages 25 values of 1.0 and 25 values of 3.0, across the fifty cells in two rows. Note that only a single row is returned since the average is computed over the full DataFrame.
3838

3939
```python
4040
rf.agg(rf_agg_mean(F.col('tile'))).show(10, False)
4141
```
4242

43-
In this code block we are using the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. In this example it is computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the `tile`.
43+
In this code block, we are using the @ref:[`rf_agg_local_mean`](reference.md#rf-agg-local-mean) function to compute the element-wise local aggregate mean across the two rows. In this example it is computing the mean of one value of 1.0 and one value of 3.0 to arrive at the element-wise mean, but doing so twenty-five times, one for each position in the `tile`.
4444

4545
To compute an element-wise local aggregate, tiles need have the same dimensions as in the example below where both tiles have 5 rows and 5 columns. If we tried to compute an element-wise local aggregate over the DataFrame without equal tile dimensions, we would get a runtime error.
4646

pyrasterframes/src/main/python/docs/getting-started.pymd

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ from pyspark.sql.functions import lit
2828
# Read a MODIS surface reflectance granule
2929
df = spark.read.raster('https://modis-pds.s3.amazonaws.com/MCD43A4.006/11/08/2019059/MCD43A4.A2019059.h11v08.006.2019072203257_B02.TIF')
3030

31-
# Add 3 element-wise, show some rows of the dataframe
31+
# Add 3 element-wise, show some rows of the DataFrame
3232
df.select(rf_local_add(df.proj_raster, lit(3))).show(5, False)
3333
```
3434

@@ -52,9 +52,7 @@ You can also use RasterFrames in the following environments:
5252
1. Install [docker](https://docs.docker.com/install/)
5353
1. Pull the image: `docker pull s22s/rasterframes-notebook`
5454
1. Run a container with the image, for example:
55-
56-
docker run -p 8808:8888 -p 44040:4040 -v /path/to/notebooks:/home/notebooks rasterframes-notebook:latest
57-
55+
`docker run -p 8808:8888 -p 44040:4040 -v /path/to/notebooks:/home/notebooks rasterframes-notebook:latest`
5856
1. In a browser, open `localhost:8808` in the example above.
5957

6058
See [RasterFrames Notebook README](https://github.com/locationtech/rasterframes/blob/develop/rf-notebook/README.md) for instructions on building the Docker image for this Jupyter notebook server.
@@ -94,7 +92,11 @@ SparkSession available as 'spark'.
9492

9593
Now you have the configured SparkSession with RasterFrames enabled.
9694

97-
## Installing GDAL
95+
```python, echo=False
96+
spark.stop()
97+
```
98+
99+
## Installing GDAL
98100

99101
GDAL provides a wide variety of drivers to read data from many different raster formats. If GDAL is installed in the environment, RasterFrames will be able to @ref:[read](raster-read.md) those formats. If you are using the @ref:[Jupyter Notebook image](getting-started.md#jupyter-notebook), GDAL is already installed for you. Otherwise follow the instructions below. Version 2.4.1 or greater is required.
100102

@@ -111,7 +113,7 @@ brew install gdal
111113
Using [`apt-get`](https://wiki.debian.org/Apt):
112114

113115
```bash
114-
sudo apt-get update
116+
sudo apt-get update
115117
sudo apt-get install gdal-bin
116118
```
117119

@@ -133,4 +135,4 @@ from pyrasterframes.utils import gdal_version
133135
print(gdal_version())
134136
```
135137

136-
This will print out something like "GDAL x.y.z, released 20yy/mm/dd". If it reports `not available`, then GDAL isn't installed in a place where the RasterFrames runtime was able to find it. Please [file an issue](https://github.com/locationtech/rasterframes/issues) to get help resolving it.
138+
This will print out something like "GDAL x.y.z, released 20yy/mm/dd". If it reports `not available`, then GDAL isn't installed in a place where the RasterFrames runtime was able to find it. Please [file an issue](https://github.com/locationtech/rasterframes/issues) to get help resolving it.

pyrasterframes/src/main/python/docs/local-algebra.pymd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ RasterFrames provides a wide variety of local map algebra functions. There are s
5151
* A function on a Tile and a scalar is a binary operation; example: @ref:[rf_local_less](reference.md#rf-local-less); or
5252
* A function on many Tiles is a n-ary operation; example: @ref:[rf_agg_local_min](reference.md#rf-agg-local-min)
5353

54-
We can express the normalized difference with a combination of `rf_local_divide`, `rf_local_subtract`, and `rf_local_add`. Since the normalized difference is so common there is a convenience method `rf_normalized_difference` which we use in this example. We will append a new column to the DataFrame, which will apply the map alegbra function to each row.
54+
We can express the normalized difference with a combination of `rf_local_divide`, `rf_local_subtract`, and `rf_local_add`. Since the normalized difference is so common, there is a convenience method `rf_normalized_difference`, which we use in this example. We will append a new column to the DataFrame, which will apply the map alegbra function to each row.
5555

5656
```python
5757
df = df.withColumn('ndvi', rf_normalized_difference(df.nir, df.red))
@@ -70,4 +70,4 @@ We continue examining NDVI in the @ref:[time series](time-series.md) section.
7070

7171
```python, echo=False
7272
spark.stop()
73-
```
73+
```

pyrasterframes/src/main/python/docs/nodata-handling.pymd

Lines changed: 32 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22

33
## What is NoData?
44

5-
In raster operations, the preservation and correct processing of missing observations is very important. In [most dataframes and scientific computing](https://www.oreilly.com/learning/handling-missing-data), the idea of missing data is expressed as a `null` or `NaN` value. A great deal of raster data is stored for space efficiency. This typically leads to use of integral values and a "sentinel" value to represent missing observations. This sentinel value varies across data products and is usually called the "NoData" value.
5+
In raster operations, the preservation and correct processing of missing observations is very important. In [most DataFrames and scientific computing](https://www.oreilly.com/learning/handling-missing-data), the idea of missing data is expressed as a `null` or `NaN` value. A great deal of raster data is stored for space efficiency. This typically leads to use of integral values and a "sentinel" value to represent missing observations. This sentinel value varies across data products and is usually called the "NoData" value.
66

7-
RasterFrames provides a variety of functions to inspect and manage NoData within `tile`s.
7+
RasterFrames provides a variety of functions to inspect and manage NoData within `tile`s.
88

99
## Cell Types
1010

@@ -40,7 +40,7 @@ spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif
4040

4141
### Understanding Cell Types and NoData
4242

43-
Use the methods on the `CellType` class to learn more about a specific cell type. Take for example the cell type of our sample data above.
43+
We can use the methods on the `CellType` class to learn more about a specific cell type. Let's consider the cell type of our sample data above.
4444

4545
```python
4646
ct = CellType('uint16raw')
@@ -55,13 +55,13 @@ ct = CellType('uint16')
5555
ct, ct.is_floating_point(), ct.has_no_data(), ct.no_data_value()
5656
```
5757

58-
In this case, the minimum value of 0 is designated as the NoData value. For integral valued cell types, the NoData is typically zero, the maximum, or the minimum value for the underlying data type. The NoData value can also be a user-defined value. In that case the value is designated with a `ud`.
58+
In this case, the minimum value of 0 is designated as the NoData value. For integral-valued cell types, the NoData is typically zero, the maximum, or the minimum value for the underlying data type. The NoData value can also be a user-defined value. In that case the value is designated with a `ud`.
5959

6060
```python
6161
CellType.uint16().with_no_data_value(99).cell_type_name
6262
```
6363

64-
Floating point types by default have `NaN` as the NoData value. However a user-defined NoData can be set.
64+
Floating point types have `NaN` as the NoData value by default. However, a user-defined NoData can be set.
6565

6666
```python float_ud
6767
print(CellType.float32().no_data_value())
@@ -70,9 +70,13 @@ print(CellType.float32().with_no_data_value(-99.9).no_data_value())
7070

7171
## Masking
7272

73-
Let's continue the example above with Sentinel-2 data. Band 2 is blue and has no defined NoData. The quality information is in a separate file called the scene classification (SCL), which delineates areas of missing data and probable clouds. For much more information on that, see the [Sentinel-2 algorithm overview](https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm). Figure 3 tells us how to interpret the scene classification. For this example, we will exclude NoData, defective pixels, probable clouds, and cirrus clouds: values 0, 1, 8, 9, and 10.
73+
Let's continue the example above with Sentinel-2 data. Band 2 is blue and has no defined NoData. The quality information is in a separate file called the scene classification (SCL), which delineates areas of missing data and probable clouds. For more information on that, see the [Sentinel-2 algorithm overview](https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm). Figure 3 tells us how to interpret the scene classification. For this example, we will exclude NoData, defective pixels, probable clouds, and cirrus clouds: values 0, 1, 8, 9, and 10.
7474

75-
The first step is to create a catalog with our band of interest and the SCL band. We read the data from the catalog and now the blue band and SCL tiles are aligned across rows.
75+
![Sentinel-2 Scene Classification Values](static/sentinel-2-scene-classification-labels.png)
76+
77+
Credit: [Sentinel-2 algorithm overview](https://earth.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-2a/algorithm)
78+
79+
The first step is to create a catalog with our band of interest and the SCL band. We read the data from the catalog, so the blue band and SCL tiles are aligned across rows.
7680

7781
```python blue_scl_cat
7882
from pyspark.sql import Row
@@ -85,7 +89,7 @@ unmasked.printSchema()
8589
unmasked.select(rf_cell_type('blue'), rf_cell_type('scl')).distinct().show()
8690
```
8791

88-
Drawing on @ref:[local map algebra](local-algebra.md) techniques, we will create a new tile column containing our indicator of unwanted pixels, as defined above.
92+
Drawing on @ref:[local map algebra](local-algebra.md) techniques, we will create new tile columns that are indicators of unwanted pixels, as defined above. Since the mask column is bit type, the addition is equivalent to a logical or, so the true values are 1.
8993

9094
```python def_mask
9195
from pyspark.sql.functions import lit
@@ -94,7 +98,7 @@ mask_part = unmasked.withColumn('nodata', rf_local_equal('scl', lit(0))) \
9498
.withColumn('defect', rf_local_equal('scl', lit(1))) \
9599
.withColumn('cloud8', rf_local_equal('scl', lit(8))) \
96100
.withColumn('cloud9', rf_local_equal('scl', lit(9))) \
97-
.withColumn('cirrus', rf_local_equal('scl', lit(10)))
101+
.withColumn('cirrus', rf_local_equal('scl', lit(10)))
98102

99103
one_mask = mask_part.withColumn('mask', rf_local_add('nodata', 'defect')) \
100104
.withColumn('mask', rf_local_add('mask', 'cloud8')) \
@@ -104,30 +108,30 @@ one_mask = mask_part.withColumn('mask', rf_local_add('nodata', 'defect')) \
104108
one_mask.select(rf_cell_type('mask')).distinct().show()
105109
```
106110

107-
Now we will use the @ref:[`rf_mask_by_value`](reference.md#rf-mask-by-value) to designate the cloudy and other unwanted pixels as NoData in the blue column. Because there is not a NoData already defined, we will choose one. Note that in this particular example the minimum value is greater than zero, so we can use 0 as the NoData value.
111+
Because there is not a NoData already defined, we will choose one. In this particular example, the minimum value is greater than zero, so we can use 0 as the NoData value.
108112

109113
```python pick_nd
110114
one_mask.agg(rf_agg_stats('blue').min.alias('blue_min')).show()
111115
```
112116

113-
We can now construct the cell type string for our blue band's cell type, but designating 0 as NoData.
117+
We can now construct the cell type string for our blue band's cell type, designating 0 as NoData.
114118

115119
```python get_ct_string
116120
blue_ct = one_mask.select(rf_cell_type('blue')).distinct().first()[0][0]
117121
masked_blue_ct = CellType(blue_ct).with_no_data_value(0)
118122
masked_blue_ct.cell_type_name
119123
```
120124

121-
Convert the cell type and apply the mask. Since the mask column is bit type, the addition done above was equivalent to a logical or. So the true values are 1.
125+
Now we will use the @ref:[`rf_mask_by_value`](reference.md#rf-mask-by-value) to designate the cloudy and other unwanted pixels as NoData in the blue column by converting the cell type and applying the mask.
122126

123-
```python mask_blu
124-
with_nd = rf_convert_cell_type('blue', masked_blue_ct.cell_type_name)
125-
masked = one_mask.withColumn('blue_masked',
127+
```python mask_blu
128+
with_nd = rf_convert_cell_type('blue', masked_blue_ct)
129+
masked = one_mask.withColumn('blue_masked',
126130
rf_mask_by_value(with_nd, 'mask', lit(1))) \
127131
.drop('nodata', 'defect', 'cloud8', 'cloud9', 'cirrus', 'blue')
128132
```
129133

130-
We can verify that the number of NoData cells in the resulting `blue_masked` column matches the total of the bit-type `mask` tile.
134+
We can verify that the number of NoData cells in the resulting `blue_masked` column matches the total of the bit-type `mask` tile to ensure our logic is correct.
131135

132136
```python
133137
masked.select(rf_no_data_cells('blue_masked'), rf_tile_sum('mask')).show(10)
@@ -148,7 +152,7 @@ display(sample[1])
148152

149153
## NoData and Local Arithmatic
150154

151-
Let's now explore how the presence of NoData affects @ref:[local map algebra](local-algebra.md) operations. To demonstrate the behaviour, lets create two tiles. One tile will have values of 0 and 1, and the other will have values of just 0.
155+
Let's now explore how the presence of NoData affects @ref:[local map algebra](local-algebra.md) operations. To demonstrate the behaviour, lets create two tiles. One tile will have values of 0 and 1, and the other will have values of just 0.
152156

153157

154158
```python
@@ -168,7 +172,7 @@ print('y')
168172
display(y)
169173
```
170174

171-
Now, let's create a new column from `x` with the value of 1 changed to NoData. Then, we will add this new column with NoData to the `y` column. As shown below, the result of the sum also has NoData (represented in white). In general for local algebra operations, Data + NoData = NoData.
175+
Now, let's create a new column from `x` with the value of 1 changed to NoData. Then, we will add this new column with NoData to the `y` column. As shown below, the result of the sum also has NoData (represented in white). In general for local algebra operations, Data + NoData = NoData.
172176

173177
```python
174178
masked_rf = rf.withColumn('x_nd', rf_mask_by_value('x', 'x', lit(1)) )
@@ -207,7 +211,7 @@ First, we mask the value of 1 by making a new column with the user defined cell
207211
def get_nodata_ct(nd_val):
208212
return CellType('uint16').with_no_data_value(nd_val)
209213

210-
masked_rf = rf.withColumn('tile_nd_1',
214+
masked_rf = rf.withColumn('tile_nd_1',
211215
rf_convert_cell_type('tile', get_nodata_ct(1))) \
212216
.withColumn('tile_nd_2',
213217
rf_convert_cell_type('tile_nd_1', get_nodata_ct(2))) \
@@ -217,7 +221,7 @@ masked_rf = rf.withColumn('tile_nd_1',
217221
collected = masked_rf.collect()
218222
```
219223

220-
Let's look at the new Tiles we created. The tile named `tile_nd_1` has the 1 values masked out as expected.
224+
Let's look at the new Tiles we created. The tile named `tile_nd_1` has the 1 values masked out as expected.
221225

222226
```python
223227
display(collected[0].tile_nd_1)
@@ -232,9 +236,9 @@ display(collected[0].tile_nd_2)
232236

233237
## Combining Tiles with Different Data Types
234238

235-
RasterFrames supports having Tile columns with multiple cell types in a single DataFrame. It is important to understand how these different cell types interact.
239+
RasterFrames supports having Tile columns with multiple cell types in a single DataFrame. It is important to understand how these different cell types interact.
236240

237-
Let's first create a RasterFrame that has columns of `float` and `int` cell type.
241+
Let's first create a RasterFrame that has columns of `float` and `int` cell type.
238242

239243
```python
240244
x = Tile((np.ones((100, 100))*2).astype('float'))
@@ -248,9 +252,9 @@ When performing a local operation between tile columns with cell types `int` and
248252

249253
```python
250254
rf.select(
251-
rf_cell_type('x'),
255+
rf_cell_type('x'),
252256
rf_cell_type('y'),
253-
rf_cell_type(rf_local_add('x', 'y').alias('xy_sum')),
257+
rf_cell_type(rf_local_add('x', 'y').alias('xy_sum')),
254258
).show(1)
255259
```
256260

@@ -262,14 +266,14 @@ x_nd_2 = Tile((np.ones((100, 100))*3), get_nodata_ct(2))
262266
rf_nd = spark.createDataFrame([Row(x_nd_1=x_nd_1, x_nd_2=x_nd_2)])
263267
```
264268

265-
Let's try adding the tile columns with different NoData values. When there is an inconsistent NoData value in the two columns, the NoData value of the right-hand side of the sum is kept. In this case, this means the result has a NoData value of 1.
269+
Let's try adding the tile columns with different NoData values. When there is an inconsistent NoData value in the two columns, the NoData value of the right-hand side of the sum is kept. In this case, this means the result has a NoData value of 1.
266270

267271
```python
268272
rf_nd_sum = rf_nd.withColumn('x_nd_sum', rf_local_add('x_nd_2', 'x_nd_1'))
269273
rf_nd_sum.select(rf_cell_type('x_nd_sum')).distinct().show()
270274
```
271275

272-
Reversing the order of the sum changes the NoData value of the resulting column to 2.
276+
Reversing the order of the sum changes the NoData value of the resulting column to 2.
273277

274278
```python
275279
rf_nd_sum = rf_nd.withColumn('x_nd_sum', rf_local_add('x_nd_1', 'x_nd_2'))
@@ -291,10 +295,10 @@ rf = spark.createDataFrame([Row(tile=x)])
291295
display(x)
292296
```
293297

294-
First we create the two new masked tile columns as before. One with only the value of 1 masked, and the other with and values of 1 and 2 masked.
298+
First we create the two new masked tile columns as before. One with only the value of 1 masked, and the other with and values of 1 and 2 masked.
295299

296300
```python
297-
masked_rf = rf.withColumn('tile_nd_1',
301+
masked_rf = rf.withColumn('tile_nd_1',
298302
rf_convert_cell_type('tile', get_nodata_ct(1))) \
299303
.withColumn('tile_nd_2',
300304
rf_convert_cell_type('tile_nd_1', get_nodata_ct(2)))

0 commit comments

Comments
 (0)