Skip to content

Commit a3038ab

Browse files
committed
Merge branch 'develop' into feature/travis-set-python3
* develop: Docs on raster write: samples, parquet Move numpy-pandas doc to pymd format; initial content Small docs tweaks lingering from PR 187 raster reader can take a Pandas DataFrame or geopandas GeoDF change link on line 61 to go to nodata-handling.md fix tile stats and histogram functions; promote headings up 1 level; remove rf_cell_types convert function ref code chunks to running code
2 parents fca70aa + d2be387 commit a3038ab

File tree

10 files changed

+527
-314
lines changed

10 files changed

+527
-314
lines changed

pyrasterframes/src/main/python/docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ The source code can be found on GitHub at [locationtech/rasterframes](https://gi
4343
* [Raster Data I/O](raster-io.md)
4444
* [Vector Data](vector-data.md)
4545
* [Raster Processing](raster-processing.md)
46-
* [Pandas, NumPy & RasterFrames](pandas-numpy.md)
46+
* [Numpy, Pandas, & RasterFrames](numpy-pandas.md)
4747
* [UDF Reference](reference.md)
4848
* [Release Notes](release-notes.md)
4949
@@@

pyrasterframes/src/main/python/docs/nodata-handling.pymd

Lines changed: 27 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,6 @@ RasterFrames provides a variety of functions to inspect and manage NoData within
1010

1111
To understand how NoData is handled in RasterFrames, we first need to understand the different underlying types of data called cell types. The cell types are GeoTrellis `CellType`s, so the [GeoTrellis documentation](https://geotrellis.readthedocs.io/en/latest/guide/core-concepts.html?#working-with-cell-values) is a valuable resource on how these are defined.
1212

13-
Use the function `rf_cell_type` to find the cell type of a specific set of raster data.
14-
1513
```python setup, echo=False
1614
import pyrasterframes
1715
from pyrasterframes.rasterfunctions import *
@@ -21,23 +19,27 @@ from IPython.display import display
2119
spark = pyrasterframes.get_spark_session()
2220
```
2321

24-
```python ct_from_sen
25-
spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif') \
26-
.select(rf_cell_type('proj_raster')).distinct().show()
22+
The `CellType` class from the `rf_types` submodule allows us to create a representation of any valid cell type. There are convenience methods to create instances for a variety of basic types.
23+
24+
```python celltype_ctors
25+
from pyrasterframes.rf_types import CellType
26+
import inspect
27+
28+
[c[0] for c in inspect.getmembers(CellType, inspect.ismethod)]
2729
```
2830

29-
The function `rf_cell_types` provides a convenient list of basic cell types. Note that this list is not exhaustive.
31+
We can also inspect the cell type of a given _tile_ or `proj_raster` column.
3032

31-
```python
32-
rf_cell_types()
33+
```python ct_from_sen
34+
spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif') \
35+
.select(rf_cell_type('proj_raster')).distinct().show()
3336
```
3437

3538
### Understanding Cell Types and NoData
3639

37-
Use the `CellType` class to learn more about a specific cell type.
40+
Use the methods on the `CellType` class to learn more about a specific cell type. Take for example the cell type of our sample data above.
3841

3942
```python
40-
from pyrasterframes.rf_types import CellType
4143
ct = CellType('uint16raw')
4244
ct, ct.is_floating_point(), ct.has_no_data()
4345
```
@@ -85,30 +87,30 @@ Drawing on @ref:[local map algebra](local-algebra.md) techniques, we will create
8587
```python def_mask
8688
from pyspark.sql.functions import lit
8789

88-
unmasked = unmasked.withColumn('nodata', rf_local_equal('scl', lit(0)))
89-
unmasked = unmasked.withColumn('defect', rf_local_equal('scl', lit(1)))
90-
unmasked = unmasked.withColumn('cloud8', rf_local_equal('scl', lit(8)))
91-
unmasked = unmasked.withColumn('cloud9', rf_local_equal('scl', lit(9)))
92-
unmasked = unmasked.withColumn('cirrus', rf_local_equal('scl', lit(10)))
90+
mask_part = unmasked.withColumn('nodata', rf_local_equal('scl', lit(0))) \
91+
.withColumn('defect', rf_local_equal('scl', lit(1))) \
92+
.withColumn('cloud8', rf_local_equal('scl', lit(8))) \
93+
.withColumn('cloud9', rf_local_equal('scl', lit(9))) \
94+
.withColumn('cirrus', rf_local_equal('scl', lit(10)))
9395

94-
unmasked = unmasked.withColumn('mask', rf_local_add('nodata', 'defect'))
95-
unmasked = unmasked.withColumn('mask', rf_local_add('mask', 'cloud8'))
96-
unmasked = unmasked.withColumn('mask', rf_local_add('mask', 'cloud9'))
97-
unmasked = unmasked.withColumn('mask', rf_local_add('mask', 'cirrus'))
96+
one_mask = mask_part.withColumn('mask', rf_local_add('nodata', 'defect')) \
97+
.withColumn('mask', rf_local_add('mask', 'cloud8')) \
98+
.withColumn('mask', rf_local_add('mask', 'cloud9')) \
99+
.withColumn('mask', rf_local_add('mask', 'cirrus'))
98100

99-
unmasked.select(rf_cell_type('mask')).distinct().show()
101+
one_mask.select(rf_cell_type('mask')).distinct().show()
100102
```
101103

102104
Now we will use the @ref:[`rf_mask_by_value`](reference.md#rf-mask-by-value) to designate the cloudy and other unwanted pixels as NoData in the blue column. Because there is not a NoData already defined, we will choose one. Note that in this particular example the minimum value is greater than zero, so we can use 0 as the NoData value.
103105

104106
```python pick_nd
105-
unmasked.agg(rf_agg_stats('blue').min.alias('blue_min')).show()
107+
one_mask.agg(rf_agg_stats('blue').min.alias('blue_min')).show()
106108
```
107109

108110
We can now construct the cell type string for our blue band's cell type, but designating 0 as NoData.
109111

110112
```python get_ct_string
111-
blue_ct = unmasked.select(rf_cell_type('blue')).distinct().first()[0][0]
113+
blue_ct = one_mask.select(rf_cell_type('blue')).distinct().first()[0][0]
112114
masked_blue_ct = CellType(blue_ct).with_no_data_value(0)
113115
masked_blue_ct.cell_type_name
114116
```
@@ -117,9 +119,9 @@ Convert the cell type and apply the mask. Since the mask column is bit type, the
117119

118120
```python mask_blu
119121
with_nd = rf_convert_cell_type('blue', masked_blue_ct.cell_type_name)
120-
masked = unmasked.withColumn('blue_masked',
121-
rf_mask_by_value(with_nd, 'mask', lit(1))) \
122-
.drop('nodata', 'defect', 'cloud8', 'cloud9', 'cirrus', 'blue')
122+
masked = one_mask.withColumn('blue_masked',
123+
rf_mask_by_value(with_nd, 'mask', lit(1))) \
124+
.drop('nodata', 'defect', 'cloud8', 'cloud9', 'cirrus', 'blue')
123125
```
124126

125127
We can verify that the number of NoData cells in the resulting `blue_masked` column matches the total of the bit-type `mask` tile.
Lines changed: 119 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,119 @@
1+
# NumPy and Pandas Interoperabilty
2+
3+
In the Python Spark API, the work of distributed computing over the DataFrame is done on many executors (the Spark term for workers) inside Java virual machines (JVM). Most calls to `pyspark` are passed to a Java process via the `py4j` library. The user can also ask for data inside the JVM to be brought over to the Python driver (the Spark term for the client application). When dealing with _tiles_, the driver will recieve this data a lightweight wrapper object around a NumPy ndarray. It is also possible to write lambda functions against NumPy arrays and evaluate them in the Spark DataFrame.
4+
5+
## Performance Considerations
6+
7+
When working with large, distributed datasets in Spark care is required when invoking _actions_ on the data. In general _transformations_ are lazily evaluated in Spark, meaning the code runs fast and it doesn't move any data around. But _actions_ cause the evaluation to happen, meaning all the lazily planned _transformations_ are going to be computed and data is going to be processed and moved around. In general, if a [`pyspark` function](https://spark.apache.org/docs/2.3.2/api/python/pyspark.sql.html) returns a DataFrame, it is probably a _transformation_ and otherwise it is an _action_.
8+
9+
When many _actions_ are invoked, a lot of data can flow from executors to the driver. In `pyspark` the data then has to move from the driver JVM to the Python process running the driver. When that happens if there are any _tiles_ in the data, they will be converted to a Python [`Tile`](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/main/python/pyrasterframes/rf_types.py) object. In practical work with Earth observation data, the _tiles_ are frequently 256 by 256 arrays, which may be 100kb or more each. Individually they are small, a DataFrame can easily have dozens of such tile columns and millions of rows.
10+
11+
All of this discussion reinforces important principle for working with Spark: understanding the cost of an _action_ is important; and use of @ref:[aggreates](aggregation.md), summaries, or samples to manage the cost of _actions_.
12+
13+
## The `Tile` Class
14+
15+
In Python, _tiles_ are represented with the `rf_types.Tile` class. This is a NumPy `ndarray` with two dimensions, along with some additional metadata allowing correct conversion to the GeoTrellis @ref:[cell type](nodata-handling.md#cell-types).
16+
17+
```python tile_intro
18+
from pyrasterframes.rf_types import Tile
19+
import numpy as np
20+
21+
t = Tile(np.random.randn(4, 4))
22+
print(str(t))
23+
```
24+
25+
You can access the NumPy array with the `cells` member of `Tile`.
26+
27+
```python tile_cells
28+
t.cells.shape, t.cells.nbytes
29+
```
30+
31+
## DataFrame `toPandas`
32+
33+
As discussed in the @ref:[raster writing chapter](raster-write.md#dataframe-samples), a pretty display of Pandas DataFrame containing _tiles_ is available by importing the `rf_ipython` submodule. In addition, as discussed in the @ref:[vector data chapter](vector-data.md), any geometry type in the Spark dataframe will be converted into a Shapely geometry. Taken together, we can can easily get the spatial information and raster data as a NumPy array, all within a Pandas DataFrame.
34+
35+
```python spark_session, echo=False
36+
import pyrasterframes
37+
from pyrasterframes.rasterfunctions import *
38+
from IPython.display import display
39+
spark = pyrasterframes.get_spark_session()
40+
41+
```
42+
43+
```python toPandas
44+
import pyrasterframes.rf_ipython
45+
from pyspark.sql.functions import lit, col
46+
47+
cat = spark.read.format('aws-pds-modis-catalog').load() \
48+
.filter(
49+
(col('granule_id') == 'h11v04') &
50+
(col('acquisition_date') > lit('2018-02-19')) &
51+
(col('acquisition_date') < lit('2018-02-22'))
52+
)
53+
54+
spark_df = spark.read.raster(catalog=cat, catalog_col_names=['B01']) \
55+
.select(
56+
'acquisition_date',
57+
'granule_id',
58+
rf_tile('B01').alias('tile'),
59+
rf_geometry('B01').alias('tile_geom')
60+
)
61+
62+
pandas_df = spark_df.limit(10).toPandas()
63+
pandas_df.iloc[0].apply(lambda v: type(v))
64+
```
65+
66+
## User Defined Functions
67+
68+
As we demonstrated with @ref:[vector data](vector-data.md#shapely-geometry-support), we can also make use of the `Tile` type to create [user-defined functions (UDF)](https://spark.apache.org/docs/2.3.2/api/python/pyspark.sql.html#pyspark.sql.functions.udf) that can take a _tile_ as input, return a _tile_ as output, or both. Here is a trivial and **inefficient** example of doing both. A serious performance implication of user defined functions in Python is that all the executors must move the Java objects to Python, evaluate the function, then move the Python objects back to Java. Use the many @ref:[built-in functions](reference.md) wherever possible, and ask the [community](https://gitter.im/s22s/raster-frames) if you have an idea for a function that should be included.
69+
70+
We will demonstrate creating a UDF that is logically equivalent to a built in function. We'll quickly show that the resulting _tiles_ are approximately equivalent. The reason they are not exactly the same is because one is computed in Python and the other is computed in Java.
71+
72+
```python udf
73+
from pyrasterframes.rf_types import TileUDT
74+
from pyspark.sql.functions import udf
75+
76+
@udf(TileUDT())
77+
def my_udf(t):
78+
import numpy as np
79+
return Tile(np.log1p(t.cells))
80+
81+
udf_df = spark_df.limit(1).select(
82+
my_udf('tile').alias('udf_result'),
83+
rf_log1p('tile').alias('built_in_result')
84+
).toPandas()
85+
86+
row = udf_df.iloc[0]
87+
diff = row['udf_result'] - row['built_in_result']
88+
print(type(diff))
89+
np.abs(diff.cells).max()
90+
```
91+
92+
We can also inspect an image of the difference between the two _tiles_; it is just random noise. Both tiles have the same structure of NoData, the white areas.
93+
94+
```python udf_diff_noise_tile
95+
display(diff)
96+
```
97+
98+
## Creating a Spark DataFrame
99+
100+
You can also create a Spark DataFrame with a column full of `Tile` objects or Shapely geomtery objects.
101+
102+
The example below will create a Pandas DataFrame with ten rows of noise tiles and random Points. We will then create a Spark DataFrame from it.
103+
104+
```python create_spark_df
105+
import pandas as pd
106+
from shapely.geometry import Point
107+
108+
pandas_df = pd.DataFrame([
109+
{
110+
'tile': Tile(np.random.randn(100, 100)),
111+
'geom': Point(-90 + 90 * np.random.random((2, 1)))
112+
} for _ in range(10)
113+
])
114+
115+
spark_df = spark.createDataFrame(pandas_df)
116+
117+
spark_df.printSchema()
118+
spark_df.count()
119+
```

pyrasterframes/src/main/python/docs/pandas-numpy.md

Lines changed: 0 additions & 10 deletions
This file was deleted.

pyrasterframes/src/main/python/docs/raster-io.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,19 +2,18 @@
22

33
The standard mechanism by which any data is brought in and out of a Spark Dataframe is the [Spark SQL DataSource][DS]. RasterFrames provides specialized DataSources for geospatial raster data and maintains compatibility with existing general purpose DataSources, such as Parquet.
44

5-
Three types of DataSources will be introduced:
6-
75
* @ref:[Catalog Readers](raster-catalogs.md)
86
- `aws-pds-l8-catalog`: built-in catalog over [Landsat on AWS][Landsat]
97
- `aws-pds-modis-catalog`: built-in catalog over [MODIS on AWS][MODIS]
108
- `geotrellis-catalog`: for enumerating [GeoTrellis layers][GTLayer]
119
* @ref:[Raster Readers](raster-read.md)
12-
- `raster`: the standard reader for most raster data
10+
- `raster`: the standard reader for most raster data, including single raster files or catalogs
1311
- `geotiff`: a simplified reader for reading a single GeoTIFF file
1412
- `geotrellis`: for reading a [GeoTrellis layer][GTLayer])
1513
* @ref:[Raster Writers](raster-write.md)
16-
- `geotrellis`: for creating a [GeoTrellis layer][GTLayer]
14+
- @ref:[Tile](raster-write.md#tile-samples) and @ref:[DataFrame](raster-write.md#dataframe-samples) samples
1715
- `geotiff`: beta writer to GeoTiff file
16+
- `geotrellis`: creating a [GeoTrellis layer][GTLayer]
1817
- [`parquet`][Parquet]: general purpose writer
1918

2019
There is also support for @ref:[vector data](vector-data.md) for masking and data labeling.

pyrasterframes/src/main/python/docs/raster-read.pymd

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,15 @@ RasterFrames registers a DataSource named `raster` that enables reading of GeoTI
1212

1313
## Single Raster
1414

15-
The simplest form is reading a single raster from a single URI:
15+
The simplest form is reading a single raster from a single URI.
1616

1717
```python read_one_uri
1818
rf = spark.read.raster('https://s22s-test-geotiffs.s3.amazonaws.com/luray_snp/B02.tif')
1919
rf.printSchema()
2020
```
2121

22+
The file at the address above is a valid [Cloud Optimized GeoTIFF (COG)](https://www.cogeo.org/), which RasterFrames fully supports. RasterFrames will take advantage of the optimizations in the COG format to enable more efficient reading compared to vanilla GeoTIFFs.
23+
2224
Let's unpack the `proj_raster` column and look at the contents in more detail. It contains a [_CRS_][CRS], a spatial _extent_ measured in that CRS, and a two-dimensional array of numeric values called a _tile_.
2325

2426
```python unpack_schema

0 commit comments

Comments
 (0)