Skip to content

Commit e5be8cd

Browse files
authored
Merge pull request #437 from s22s/docs/spatial-join-expansion
Add raster join page to docs; expand spatial index discussion
2 parents 8b948fc + 2d033de commit e5be8cd

File tree

3 files changed

+79
-1
lines changed

3 files changed

+79
-1
lines changed

docs/src/main/paradox/raster-processing.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@
88
* @ref:[Zonal Map Algebra](zonal-algebra.md)
99
* @ref:[Aggregation](aggregation.md)
1010
* @ref:[Time Series](time-series.md)
11+
* @ref:[Raster Join](raster-join.md)
1112
* @ref:[Machine Learning](machine-learning.md)
1213

1314
@@@
Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# Raster Join
2+
3+
```python, init, echo=False
4+
from IPython.display import display
5+
import pyrasterframes.rf_ipython
6+
import pandas as pd
7+
from pyrasterframes.utils import create_rf_spark_session
8+
from pyrasterframes.rasterfunctions import *
9+
from pyspark.sql.functions import *
10+
spark = create_rf_spark_session()
11+
12+
```
13+
14+
## Description
15+
16+
A common operation for raster data is reprojecting or warping the data to a different @ref:[CRS][CRS] with a specific @link:[transform](https://gdal.org/user/raster_data_model.html#affine-geotransform) { open=new }. In many use cases, the particulars of the warp operation depend on another set of raster data. Furthermore, the warp is done to put both sets of raster data to a common set of grid to enable manipulation of the datasets together.
17+
18+
In RasterFrames, you can perform a **Raster Join** on two DataFrames containing raster data.
19+
The operation will perform a _spatial join_ based on the [CRS][CRS] and [extent][extent] data in each DataFrame. By default it is a left join and uses an intersection operator.
20+
For each candidate row, all _tile_ columns on the right hand side are warped to match the left hand side's [CRS][CRS], [extent][extent], and dimensions. Warping relies on GeoTrellis library code and uses nearest neighbor resampling method.
21+
The operation is also an aggregate, with multiple intersecting right-hand side tiles `merge`d into the result. There is no guarantee about the ordering of tiles used to select cell values in the case of overlapping tiles.
22+
When using the @ref:[`raster` DataSource](raster-join.md) you will automatically get the @ref:[CRS][CRS] and @ref:[extent][extent] information needed to do this operation.
23+
24+
25+
## Example Code
26+
27+
Because the raster join is a distributed spatial join, indexing of both DataFrames using the [spatial index][spatial-index] is crucial for performance.
28+
29+
```python, example_raster_join
30+
# Southern Mozambique December 29, 2016
31+
modis = spark.read.raster('s3://astraea-opendata/MCD43A4.006/21/11/2016297/MCD43A4.A2016297.h21v11.006.2016306075821_B01.TIF',
32+
spatial_index_partitions=True) \
33+
.withColumnRenamed('proj_raster', 'modis')
34+
35+
landsat8 = spark.read.raster('https://landsat-pds.s3.us-west-2.amazonaws.com/c1/L8/167/077/LC08_L1TP_167077_20161015_20170319_01_T1/LC08_L1TP_167077_20161015_20170319_01_T1_B4.TIF',
36+
spatial_index_partitions=True) \
37+
.withColumnRenamed('proj_raster', 'landsat')
38+
39+
rj = landsat8.raster_join(modis)
40+
41+
# Show some non-empty tiles
42+
rj.select('landsat', 'modis', 'crs', 'extent') \
43+
.filter(rf_data_cells('modis') > 0) \
44+
.filter(rf_tile_max('landsat') > 0)
45+
```
46+
47+
## Additional Options
48+
49+
The following optional arguments are allowed:
50+
51+
* `left_extent` - the column on the left-hand DataFrame giving the [extent][extent] of the tile columns
52+
* `left_crs` - the column on the left-hand DataFrame giving the [CRS][CRS] of the tile columns
53+
* `right_extent` - the column on the right-hand DataFrame giving the [extent][extent] of the tile columns
54+
* `right_crs` - the column on the right-hand DataFrame giving the [CRS][CRS] of the tile columns
55+
* `join_exprs` - a single column expression as would be used in the [`on` parameter of `join`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join)
56+
57+
58+
Note that the `join_exprs` will override the join behavior described above. By default the expression is equivalent to:
59+
60+
```python, join_expr, evaluate=False
61+
st_intersects(
62+
st_geometry(left[left_extent]),
63+
st_reproject(st_geometry(right[right_extent]), right[right_crs], left[left_crs])
64+
)
65+
```
66+
67+
68+
69+
[CRS]: concepts.md#coordinate-reference-system--crs
70+
[extent]: concepts.md#extent
71+
[spatial-index]:raster-read.md#spatial-indexing-and-partitioning

pyrasterframes/src/main/python/docs/raster-read.pymd

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -217,13 +217,19 @@ In the initial examples on this page, you may have noticed that the realized (no
217217

218218
## Spatial Indexing and Partitioning
219219

220-
It's often desirable to take extra steps in ensuring your data is effectively distributed over your computing resources. One way of doing that is using something called a ["space filling curve"](https://en.wikipedia.org/wiki/Space-filling_curve), which turns an N-dimensional value into a one dimensional value, with properties that favor keeping entities near each other in N-space near each other in index space. To have RasterFrames add a spatial index based partitioning on a raster reads, use the `spatial_index_partitions` parameter:
220+
It's often desirable to take extra steps in ensuring your data is effectively distributed over your computing resources. One way of doing that is using something called a ["space filling curve"](https://en.wikipedia.org/wiki/Space-filling_curve), which turns an N-dimensional value into a one dimensional value, with properties that favor keeping entities near each other in N-space near each other in index space. In particular RasterFrames support space-filling curves mapping the geographic location of _tiles_ to a one-dimensional index space called [`xz2`](https://www.geomesa.org/documentation/user/datastores/index_overview.html). To have RasterFrames add a spatial index based partitioning on a raster reads, use the `spatial_index_partitions` parameter. By default it will use the same number of partitions as configured in [`spark.sql.shuffle.partitions`](https://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options).
221221

222222
```python, spatial_indexing
223223
df = spark.read.raster(uri, spatial_index_partitions=True)
224224
df
225225
```
226226

227+
You can also pass a positive integer to the parameter to specify the number of desired partitions.
228+
229+
```python, spatial_indexing
230+
df = spark.read.raster(uri, spatial_index_partitions=800)
231+
```
232+
227233
## GeoTrellis
228234

229235
### GeoTrellis Catalogs

0 commit comments

Comments
 (0)