You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Merge branch 'develop' into feature/travis-set-python3
* develop:
Docs on raster write: samples, parquet
Move numpy-pandas doc to pymd format; initial content
Small docs tweaks lingering from PR 187
raster reader can take a Pandas DataFrame or geopandas GeoDF
change link on line 61 to go to nodata-handling.md
fix tile stats and histogram functions; promote headings up 1 level; remove rf_cell_types
convert function ref code chunks to running code
Copy file name to clipboardExpand all lines: pyrasterframes/src/main/python/docs/nodata-handling.pymd
+27-25Lines changed: 27 additions & 25 deletions
Original file line number
Diff line number
Diff line change
@@ -10,8 +10,6 @@ RasterFrames provides a variety of functions to inspect and manage NoData within
10
10
11
11
To understand how NoData is handled in RasterFrames, we first need to understand the different underlying types of data called cell types. The cell types are GeoTrellis `CellType`s, so the [GeoTrellis documentation](https://geotrellis.readthedocs.io/en/latest/guide/core-concepts.html?#working-with-cell-values) is a valuable resource on how these are defined.
12
12
13
-
Use the function `rf_cell_type` to find the cell type of a specific set of raster data.
14
-
15
13
```python setup, echo=False
16
14
import pyrasterframes
17
15
from pyrasterframes.rasterfunctions import *
@@ -21,23 +19,27 @@ from IPython.display import display
The `CellType` class from the `rf_types` submodule allows us to create a representation of any valid cell type. There are convenience methods to create instances for a variety of basic types.
23
+
24
+
```python celltype_ctors
25
+
from pyrasterframes.rf_types import CellType
26
+
import inspect
27
+
28
+
[c[0] for c in inspect.getmembers(CellType, inspect.ismethod)]
27
29
```
28
30
29
-
The function `rf_cell_types` provides a convenient list of basic cell types. Note that this list is not exhaustive.
31
+
We can also inspect the cell type of a given _tile_ or `proj_raster` column.
Now we will use the @ref:[`rf_mask_by_value`](reference.md#rf-mask-by-value) to designate the cloudy and other unwanted pixels as NoData in the blue column. Because there is not a NoData already defined, we will choose one. Note that in this particular example the minimum value is greater than zero, so we can use 0 as the NoData value.
In the Python Spark API, the work of distributed computing over the DataFrame is done on many executors (the Spark term for workers) inside Java virual machines (JVM). Most calls to `pyspark` are passed to a Java process via the `py4j` library. The user can also ask for data inside the JVM to be brought over to the Python driver (the Spark term for the client application). When dealing with _tiles_, the driver will recieve this data a lightweight wrapper object around a NumPy ndarray. It is also possible to write lambda functions against NumPy arrays and evaluate them in the Spark DataFrame.
4
+
5
+
## Performance Considerations
6
+
7
+
When working with large, distributed datasets in Spark care is required when invoking _actions_ on the data. In general _transformations_ are lazily evaluated in Spark, meaning the code runs fast and it doesn't move any data around. But _actions_ cause the evaluation to happen, meaning all the lazily planned _transformations_ are going to be computed and data is going to be processed and moved around. In general, if a [`pyspark` function](https://spark.apache.org/docs/2.3.2/api/python/pyspark.sql.html) returns a DataFrame, it is probably a _transformation_ and otherwise it is an _action_.
8
+
9
+
When many _actions_ are invoked, a lot of data can flow from executors to the driver. In `pyspark` the data then has to move from the driver JVM to the Python process running the driver. When that happens if there are any _tiles_ in the data, they will be converted to a Python [`Tile`](https://github.com/locationtech/rasterframes/blob/develop/pyrasterframes/src/main/python/pyrasterframes/rf_types.py) object. In practical work with Earth observation data, the _tiles_ are frequently 256 by 256 arrays, which may be 100kb or more each. Individually they are small, a DataFrame can easily have dozens of such tile columns and millions of rows.
10
+
11
+
All of this discussion reinforces important principle for working with Spark: understanding the cost of an _action_ is important; and use of @ref:[aggreates](aggregation.md), summaries, or samples to manage the cost of _actions_.
12
+
13
+
## The `Tile` Class
14
+
15
+
In Python, _tiles_ are represented with the `rf_types.Tile` class. This is a NumPy `ndarray` with two dimensions, along with some additional metadata allowing correct conversion to the GeoTrellis @ref:[cell type](nodata-handling.md#cell-types).
16
+
17
+
```python tile_intro
18
+
from pyrasterframes.rf_types import Tile
19
+
import numpy as np
20
+
21
+
t = Tile(np.random.randn(4, 4))
22
+
print(str(t))
23
+
```
24
+
25
+
You can access the NumPy array with the `cells` member of `Tile`.
26
+
27
+
```python tile_cells
28
+
t.cells.shape, t.cells.nbytes
29
+
```
30
+
31
+
## DataFrame `toPandas`
32
+
33
+
As discussed in the @ref:[raster writing chapter](raster-write.md#dataframe-samples), a pretty display of Pandas DataFrame containing _tiles_ is available by importing the `rf_ipython` submodule. In addition, as discussed in the @ref:[vector data chapter](vector-data.md), any geometry type in the Spark dataframe will be converted into a Shapely geometry. Taken together, we can can easily get the spatial information and raster data as a NumPy array, all within a Pandas DataFrame.
As we demonstrated with @ref:[vector data](vector-data.md#shapely-geometry-support), we can also make use of the `Tile` type to create [user-defined functions (UDF)](https://spark.apache.org/docs/2.3.2/api/python/pyspark.sql.html#pyspark.sql.functions.udf) that can take a _tile_ as input, return a _tile_ as output, or both. Here is a trivial and **inefficient** example of doing both. A serious performance implication of user defined functions in Python is that all the executors must move the Java objects to Python, evaluate the function, then move the Python objects back to Java. Use the many @ref:[built-in functions](reference.md) wherever possible, and ask the [community](https://gitter.im/s22s/raster-frames) if you have an idea for a function that should be included.
69
+
70
+
We will demonstrate creating a UDF that is logically equivalent to a built in function. We'll quickly show that the resulting _tiles_ are approximately equivalent. The reason they are not exactly the same is because one is computed in Python and the other is computed in Java.
71
+
72
+
```python udf
73
+
from pyrasterframes.rf_types import TileUDT
74
+
from pyspark.sql.functions import udf
75
+
76
+
@udf(TileUDT())
77
+
def my_udf(t):
78
+
import numpy as np
79
+
return Tile(np.log1p(t.cells))
80
+
81
+
udf_df = spark_df.limit(1).select(
82
+
my_udf('tile').alias('udf_result'),
83
+
rf_log1p('tile').alias('built_in_result')
84
+
).toPandas()
85
+
86
+
row = udf_df.iloc[0]
87
+
diff = row['udf_result'] - row['built_in_result']
88
+
print(type(diff))
89
+
np.abs(diff.cells).max()
90
+
```
91
+
92
+
We can also inspect an image of the difference between the two _tiles_; it is just random noise. Both tiles have the same structure of NoData, the white areas.
93
+
94
+
```python udf_diff_noise_tile
95
+
display(diff)
96
+
```
97
+
98
+
## Creating a Spark DataFrame
99
+
100
+
You can also create a Spark DataFrame with a column full of `Tile` objects or Shapely geomtery objects.
101
+
102
+
The example below will create a Pandas DataFrame with ten rows of noise tiles and random Points. We will then create a Spark DataFrame from it.
Copy file name to clipboardExpand all lines: pyrasterframes/src/main/python/docs/raster-io.md
+3-4Lines changed: 3 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,19 +2,18 @@
2
2
3
3
The standard mechanism by which any data is brought in and out of a Spark Dataframe is the [Spark SQL DataSource][DS]. RasterFrames provides specialized DataSources for geospatial raster data and maintains compatibility with existing general purpose DataSources, such as Parquet.
4
4
5
-
Three types of DataSources will be introduced:
6
-
7
5
*@ref:[Catalog Readers](raster-catalogs.md)
8
6
-`aws-pds-l8-catalog`: built-in catalog over [Landsat on AWS][Landsat]
9
7
-`aws-pds-modis-catalog`: built-in catalog over [MODIS on AWS][MODIS]
10
8
-`geotrellis-catalog`: for enumerating [GeoTrellis layers][GTLayer]
11
9
*@ref:[Raster Readers](raster-read.md)
12
-
-`raster`: the standard reader for most raster data
10
+
-`raster`: the standard reader for most raster data, including single raster files or catalogs
13
11
-`geotiff`: a simplified reader for reading a single GeoTIFF file
14
12
-`geotrellis`: for reading a [GeoTrellis layer][GTLayer])
15
13
*@ref:[Raster Writers](raster-write.md)
16
-
-`geotrellis`: for creating a [GeoTrellis layer][GTLayer]
14
+
-@ref:[Tile](raster-write.md#tile-samples) and @ref:[DataFrame](raster-write.md#dataframe-samples) samples
17
15
-`geotiff`: beta writer to GeoTiff file
16
+
-`geotrellis`: creating a [GeoTrellis layer][GTLayer]
18
17
-[`parquet`][Parquet]: general purpose writer
19
18
20
19
There is also support for @ref:[vector data](vector-data.md) for masking and data labeling.
The file at the address above is a valid [Cloud Optimized GeoTIFF (COG)](https://www.cogeo.org/), which RasterFrames fully supports. RasterFrames will take advantage of the optimizations in the COG format to enable more efficient reading compared to vanilla GeoTIFFs.
23
+
22
24
Let's unpack the `proj_raster` column and look at the contents in more detail. It contains a [_CRS_][CRS], a spatial _extent_ measured in that CRS, and a two-dimensional array of numeric values called a _tile_.
0 commit comments