Skip to content

Commit 745f0bd

Browse files
asinghvi17aplavin
andauthored
Spatial joins (#113)
* Hook spatial predicates up to FlexiJoins.jl by an extension * Add a tutorial on how to use spatial joins via FlexiJoin * Add tests * Allow `GO.equals` as a predicate * Improve docs significantly * Fix example name * Hide theming without causing global effects * Clean up the implementation * Implement tree based joins for speed on larger datasets * Apply suggestions from code review Co-authored-by: Alexander Plavin <[email protected]> * Add instructions on how to add custom predicates --------- Co-authored-by: Alexander Plavin <[email protected]>
1 parent 8f46d15 commit 745f0bd

File tree

7 files changed

+220
-1
lines changed

7 files changed

+220
-1
lines changed

Project.toml

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,17 +8,21 @@ CoordinateTransformations = "150eb455-5306-5404-9cee-2592286d6298"
88
GeoInterface = "cf35fbd7-0cd7-5166-be24-54bfbe79505f"
99
GeometryBasics = "5c1252a2-5f33-56bf-86c9-59e7332b4326"
1010
LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e"
11+
SortTileRecursiveTree = "746ee33f-1797-42c2-866d-db2fce69d14d"
1112
Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2"
1213
Tables = "bd369af6-aec1-5ad0-b16a-f7cc5008161c"
1314

1415
[weakdeps]
16+
FlexiJoins = "e37f2e79-19fa-4eb7-8510-b63b51fe0a37"
1517
Proj = "c94c279d-25a6-4763-9509-64d165bea63e"
1618

1719
[extensions]
20+
GeometryOpsFlexiJoinsExt = "FlexiJoins"
1821
GeometryOpsProjExt = "Proj"
1922

2023
[compat]
2124
CoordinateTransformations = "0.5, 0.6"
25+
FlexiJoins = "0.1.30"
2226
GeoInterface = "1.2"
2327
GeometryBasics = "0.4.7"
2428
LinearAlgebra = "1"
@@ -32,6 +36,7 @@ ArchGDAL = "c9ce4bd3-c3d5-55b8-8973-c0e20141b8c3"
3236
CoordinateTransformations = "150eb455-5306-5404-9cee-2592286d6298"
3337
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
3438
Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
39+
FlexiJoins = "e37f2e79-19fa-4eb7-8510-b63b51fe0a37"
3540
GeoFormatTypes = "68eda718-8dee-11e9-39e7-89f7f65f511f"
3641
GeoJSON = "61d90e0f-e114-555e-ac52-39dfb47a3ef9"
3742
JLD2 = "033835bb-8acc-5ee8-8aae-3f567f8a3819"
@@ -42,4 +47,4 @@ Shapefile = "8e980c4a-a4fe-5da2-b3a7-4b4b0353a2f4"
4247
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
4348

4449
[targets]
45-
test = ["ArchGDAL", "CoordinateTransformations", "DataFrames", "Distributions", "GeoFormatTypes", "GeoJSON", "Proj", "JLD2", "LibGEOS", "Random", "Shapefile", "Test"]
50+
test = ["ArchGDAL", "CoordinateTransformations", "DataFrames", "Distributions", "FlexiJoins", "GeoFormatTypes", "GeoJSON", "Proj", "JLD2", "LibGEOS", "Random", "Shapefile", "Test"]

docs/Project.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ Distributions = "31c24e10-a181-5473-b8eb-7969acd0382f"
1212
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
1313
DocumenterVitepress = "4710194d-e776-4893-9690-8d956a29c365"
1414
DoubleFloats = "497a8b3b-efae-58df-a0af-a86822472b78"
15+
FlexiJoins = "e37f2e79-19fa-4eb7-8510-b63b51fe0a37"
16+
GADM = "a8dd9ffe-31dc-4cf5-a379-ea69100a8233"
1517
ExactPredicates = "429591f6-91af-11e9-00e2-59fbe8cec110"
1618
GeoDatasets = "ddc7317b-88db-5cb5-a849-8449e5df04f9"
1719
GeoInterface = "cf35fbd7-0cd7-5166-be24-54bfbe79505f"
@@ -23,6 +25,8 @@ LibGEOS = "a90b1aa1-3769-5649-ba7e-abc5a9d163eb"
2325
Literate = "98b081ad-f1c9-55d3-8b20-4c87d4299306"
2426
Makie = "ee78f7c6-11fb-53f2-987a-cfe4a2b5a57a"
2527
MakieThemes = "e296ed71-da82-5faf-88ab-0034a9761098"
28+
Measurements = "eff96d63-e80a-5855-80a2-b1b0885c5ab7"
29+
MonteCarloMeasurements = "0987c9cc-fe09-11e8-30f0-b96dd679fdca"
2630
MultiFloats = "bdf0d083-296b-4888-a5b6-7498122e68a5"
2731
Printf = "de0858da-6303-5e67-8744-51eddeeeb8d7"
2832
Proj = "c94c279d-25a6-4763-9509-64d165bea63e"

docs/make.jl

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,9 @@ makedocs(;
9494
pages=[
9595
"Introduction" => "introduction.md",
9696
"API Reference" => "api.md",
97+
"Tutorials" => [
98+
"Spatial Joins" => "tutorials/spatial_joins.md",
99+
],
97100
"Explanations" => [
98101
"Paradigms" => "paradigms.md",
99102
"Peculiarities" => "peculiarities.md",

docs/src/tutorials/spatial_joins.md

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# Spatial joins
2+
3+
Spatial joins are [table joins](https://www.geeksforgeeks.org/sql-join-set-1-inner-left-right-and-full-joins/) which are based not on equality, but on some predicate ``p(x, y)``, which takes two geometries, and returns a value of either `true` or `false`. For geometries, the [`DE-9IM`](https://en.wikipedia.org/wiki/DE-9IM) spatial relationship model is used to determine the spatial relationship between two geometries.
4+
5+
Spatial joins can be done between any geometry types (from geometrycollections to points), just as geometrical predicates can be evaluated on any geometries.
6+
7+
In this tutorial, we will show how to perform a spatial join on first a toy dataset and then two Natural Earth datasets, to show how this can be used in the real world.
8+
9+
In order to perform the spatial join, we use **[FlexiJoins.jl](https://github.com/JuliaAPlavin/FlexiJoins.jl)** to perform the join, specifically using its `by_pred` joining method. This allows the user to specify a predicate in the following manner:
10+
```julia
11+
[inner/left/right/outer/...]join((table1, table1),
12+
by_pred(:table1_column, predicate_function, :table2_column) # & add other conditions here
13+
)
14+
```
15+
16+
We have enabled the use of all of GeometryOps' boolean comparisons here. These are:
17+
18+
```julia
19+
GO.contains, GO.within, GO.intersects, GO.touches, GO.crosses, GO.disjoint, GO.overlaps, GO.covers, GO.coveredby, GO.equals
20+
```
21+
22+
!!! tip
23+
Always place the dataframe with more complex geometries second, as that is the one which will be sorted into a tree.
24+
25+
## Simple example
26+
27+
This example demonstrates how to perform a spatial join between two datasets: a set of polygons and a set of randomly generated points.
28+
29+
The polygons are represented as a DataFrame with geometries and colors, while the points are stored in a separate DataFrame.
30+
31+
The spatial join is performed using the `contains` predicate from GeometryOps, which checks if each point is contained within any of the polygons. The resulting joined DataFrame is then used to plot the points, colored according to the containing polygon.
32+
33+
First, we generate our data. We create two triangle polygons which, together, span the rectangle (0, 0, 1, 1), and a set of points which are randomly distributed within this rectangle.
34+
35+
```@example spatialjoins
36+
import GeoInterface as GI, GeometryOps as GO
37+
using FlexiJoins, DataFrames
38+
39+
using CairoMakie, GeoInterfaceMakie
40+
41+
pl = GI.Polygon([GI.LinearRing([(0, 0), (1, 0), (1, 1), (0, 0)])])
42+
pu = GI.Polygon([GI.LinearRing([(0, 0), (0, 1), (1, 1), (0, 0)])])
43+
poly_df = DataFrame(geometry = [pl, pu], color = [:red, :blue])
44+
f, a, p = Makie.with_theme(Attributes(; Axis = (; aspect = DataAspect()))) do # hide
45+
f, a, p = poly(poly_df.geometry; color = tuple.(poly_df.color, 0.3))
46+
end # hide
47+
```
48+
49+
Here, the upper polygon is blue, and the lower polygon is red. Keep this in mind!
50+
51+
Now, we generate the points.
52+
53+
```@example spatialjoins
54+
points = tuple.(rand(1000), rand(1000))
55+
points_df = DataFrame(geometry = points)
56+
scatter!(points_df.geometry)
57+
f
58+
```
59+
60+
You can see that they are evenly distributed around the box. But how do we know which points are in which polygons?
61+
62+
We have to join the two dataframes based on which polygon (if any) each point lies within.
63+
64+
Now, we can perform the "spatial join" using FlexiJoins. We are performing an outer join here
65+
66+
```@example spatialjoins
67+
@time joined_df = FlexiJoins.innerjoin(
68+
(points_df, poly_df),
69+
by_pred(:geometry, GO.within, :geometry)
70+
)
71+
```
72+
73+
```@example spatialjoins
74+
scatter!(a, joined_df.geometry; color = joined_df.color)
75+
f
76+
```
77+
78+
Here, you can see that the colors were assigned appropriately to the scattered points!
79+
80+
## Real-world example
81+
82+
Suppose I have a list of polygons representing administrative regions (or mining sites, or what have you), and I have a list of polygons for each country. I want to find the country each region is in.
83+
84+
```julia real
85+
import GeoInterface as GI, GeometryOps as GO
86+
using FlexiJoins, DataFrames, GADM # GADM gives us country and sublevel geometry
87+
88+
using CairoMakie, GeoInterfaceMakie
89+
90+
country_df = GADM.get.(["JPN", "USA", "IND", "DEU", "FRA"]) |> DataFrame
91+
country_df.geometry = GI.GeometryCollection.(GO.tuples.(country_df.geom))
92+
93+
state_doublets = [
94+
("USA", "New York"),
95+
("USA", "California"),
96+
("IND", "Karnataka"),
97+
("DEU", "Berlin"),
98+
("FRA", "Grand Est"),
99+
("JPN", "Tokyo"),
100+
]
101+
102+
state_full_df = (x -> GADM.get(x...)).(state_doublets) |> DataFrame
103+
state_full_df.geom = GO.tuples.(only.(state_full_df.geom))
104+
state_compact_df = state_full_df[:, [:geom, :NAME_1]]
105+
```
106+
107+
```julia real
108+
innerjoin((state_compact_df, country_df), by_pred(:geom, GO.within, :geometry))
109+
innerjoin((state_compact_df, view(country_df, 1:1, :)), by_pred(:geom, GO.within, :geometry))
110+
```
111+
112+
!!! warning
113+
This is how you would do this, but it doesn't work yet, since the GeometryOps predicates are quite slow on large polygons. If you try this, the code will continue to run for a very, very long time (it took 12 hours on my laptop, but with minimal CPU usage).
114+
115+
## Enabling custom predicates
116+
117+
In case you want to use a custom predicate, you only need to define a method to tell FlexiJoins how to use it.
118+
119+
For example, let's suppose you wanted to perform a spatial join on geometries which are some distance away from each other:
120+
121+
```julia
122+
my_predicate_function = <(5) abs GO.distance
123+
```
124+
125+
You would need to define `FlexiJoins.supports_mode` on your predicate:
126+
127+
```julia{3}
128+
FlexiJoins.supports_mode(
129+
::FlexiJoins.Mode.NestedLoopFast,
130+
::FlexiJoins.ByPred{typeof(my_predicate_function)},
131+
datas
132+
) = true
133+
```
134+
135+
This will enable FlexiJoins to support your custom function, when it's passed to `by_pred(:geometry, my_predicate_function, :geometry)`.
Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
module GeometryOpsFlexiJoinsExt
2+
3+
using GeometryOps
4+
using FlexiJoins
5+
6+
import GeometryOps as GO, GeoInterface as GI
7+
using SortTileRecursiveTree, Tables
8+
9+
10+
# This module defines the FlexiJoins APIs for GeometryOps' boolean comparison functions, taken from DE-9IM.
11+
12+
# First, we define the joining modes (Tree, NestedLoopFast) that the GO DE-9IM functions support.
13+
const GO_DE9IM_FUNCS = Union{typeof(GO.contains), typeof(GO.within), typeof(GO.intersects), typeof(GO.disjoint), typeof(GO.touches), typeof(GO.crosses), typeof(GO.overlaps), typeof(GO.covers), typeof(GO.coveredby), typeof(GO.equals)}
14+
# NestedLoopFast is the naive fallback method
15+
FlexiJoins.supports_mode(::FlexiJoins.Mode.NestedLoopFast, ::FlexiJoins.ByPred{F}, datas) where F <: GO_DE9IM_FUNCS = true
16+
# This method allows you to cache a tree, which we do by using an STRtree.
17+
# TODO: wrap GO predicate functions in a `TreeJoiner` struct or something, to indicate that we want to use trees,
18+
# since they can be slower in some situations.
19+
FlexiJoins.supports_mode(::FlexiJoins.Mode.Tree, ::FlexiJoins.ByPred{F}, datas) where F <: GO_DE9IM_FUNCS = true
20+
21+
# Nested loop support is simple, and needs no further support.
22+
# However, for trees, we need to define how the tree is prepared and how it is used.
23+
# This is done by defining the `prepare_for_join` function to return an STRTree,
24+
# and by defining the `findmatchix` function as querying that tree before checking
25+
# intersections.
26+
27+
# In theory, one could extract the tree from e.g a GeoPackage or some future GeoDataFrame.
28+
29+
FlexiJoins.prepare_for_join(::FlexiJoins.Mode.Tree, X, cond::FlexiJoins.ByPred{<: GO_DE9IM_FUNCS}) = (X, SortTileRecursiveTree.STRtree(map(cond.Rf, X)))
30+
function FlexiJoins.findmatchix(::FlexiJoins.Mode.Tree, cond::FlexiJoins.ByPred{F}, ix_a, a, (B, tree)::Tuple, multi::typeof(identity)) where F <: GO_DE9IM_FUNCS
31+
idxs = SortTileRecursiveTree.query(tree, cond.Lf(a))
32+
intersecting_idxs = filter!(idxs) do idx
33+
cond.pred(a, cond.Rf(B[idx]))
34+
end
35+
return intersecting_idxs
36+
end
37+
38+
# Finally, for completeness, we define the `swap_sides` function for those predicates which are defined as inversions.
39+
40+
FlexiJoins.swap_sides(::typeof(GO.contains)) = GO.within
41+
FlexiJoins.swap_sides(::typeof(GO.within)) = GO.contains
42+
FlexiJoins.swap_sides(::typeof(GO.coveredby)) = GO.covers
43+
FlexiJoins.swap_sides(::typeof(GO.covers)) = GO.coveredby
44+
45+
# That's a wrap, folks!
46+
47+
end
48+

test/extensions/flexijoins.jl

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
import GeometryOps as GO, GeoInterface as GI
2+
using FlexiJoins, DataFrames
3+
4+
pl = GI.Polygon([GI.LinearRing([(0, 0), (1, 0), (1, 1), (0, 0)])])
5+
pu = GI.Polygon([GI.LinearRing([(0, 0), (0, 1), (1, 1), (0, 0)])])
6+
poly_df = DataFrame(geometry = [pl, pu], color = [:red, :blue])
7+
8+
points = tuple.(rand(100), rand(100))
9+
points_df = DataFrame(geometry = points)
10+
11+
@testset "Basic integration" begin
12+
13+
@test_nowarn joined_df = FlexiJoins.innerjoin((poly_df, points_df), by_pred(:geometry, GO.contains, :geometry))
14+
# Test that the join happened correctly
15+
joined_df = FlexiJoins.innerjoin((poly_df, points_df), by_pred(:geometry, GO.contains, :geometry))
16+
@test all(GO.contains.((pl,), joined_df.geometry_1[joined_df.color .== :red]))
17+
@test all(GO.contains.((pu,), joined_df.geometry_1[joined_df.color .== :blue]))
18+
# Test that within also works
19+
@test_nowarn joined_df = FlexiJoins.innerjoin((points_df, poly_df), by_pred(:geometry, GO.within, :geometry))
20+
21+
end
22+

test/runtests.jl

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,4 +41,6 @@ const GO = GeometryOps
4141
include("transformations/correction/closed_ring.jl")
4242
include("transformations/correction/intersecting_polygons.jl")
4343
end
44+
# # Extensions
45+
@testset "FlexiJoins" begin include("extensions/flexijoins.jl") end
4446
end

0 commit comments

Comments
 (0)