Academic Inquiry: Spatial Analysis with DuckDB for Engineering Thesis #565

2xqqjnq · 2025-05-14T14:51:42Z

2xqqjnq
May 14, 2025

Dear DuckDB Team,

We are two engineering students, BAIDA Brahim and BOUCHANTIYA Mohamed,
currently completing our final year project (PFE) for the Engineering
Degree in Geomatics and Topographic Sciences at the Institut
Agronomique et Vétérinaire Hassan II in Rabat, Morocco.

Our PFE is entitled "Exploration of DuckDB Spatial for Analytics on
Massive Geospatial Data".
As part of this work, we are focusing on a benchmarking study between
PostGIS and DuckDB Spatial.

Specifically, we are comparing the performance of both engines on the
spatial analysis of a large building dataset across Morocco (~3GB,
approximately 24 million geometries).

During our experiments, we noticed an important difference:

In PostGIS, after validating the geometries, we obtain approximately
24 million valid geometries.

In DuckDB, after validation, we obtain approximately 12 million valid
geometries.

Our investigations suggest that PostGIS may be slightly tolerant
towards minor polygon closure errors, while DuckDB seems to enforce
stricter validation rules.

➔ We would greatly appreciate if you could clarify:

Is DuckDB Spatial deliberately stricter in its validation of geometries?
Which validation standard or model is DuckDB following internally
for spatial data?

Moreover, we are very interested in better understanding the internal
architecture of DuckDB regarding the processing of massive spatial
datasets, particularly:

How DuckDB manages spatial indexing and optimization.

How DuckDB achieves efficient handling of very large spatial files.

Any official documentation, technical notes, or research articles that
could enrich our thesis would be highly appreciated.

Thank you very much for your time and support.
We remain available for any further information or collaboration if needed.

Sincerely,
BAIDA Brahim & BOUCHANTIYA Mohamed
Institut Agronomique et Vétérinaire Hassan II
Rabat, Morocco

Maxxen · 2025-05-14T15:38:21Z

Maxxen
May 14, 2025
Maintainer

Hello!
Lots of questions, but here are some quick thoughts

Is DuckDB Spatial deliberately stricter in its validation of geometries?

DuckDB uses the GEOS library to determine validity. GEOS is in turn based on JTS. It seems like PostGIS also uses GEOS for this (https://postgis.net/docs/ST_IsValid.html), but there might be differences in the GEOS version used by spatial/postgis. (or its a bug). Feel free to share a reproducible example and I can have a look.

Which validation standard or model is DuckDB following internally
for spatial data?

I don't know. I would assume GEOS uses the Simple Features definition of validity. I don't think there is a specific algorithm for it, you just have to test all the required preconditions and determine if they all hold. That said, in the algorithms I've implemented natively in DuckDB-spatial (not based on GEOS) I've tried to be resistant towards the easy forms of degenerate geometries (e.g. linestrings with 1 point, polygons with less than 4 vertices), and DuckDB doesn't enforce that stored geometries are valid.

How DuckDB manages spatial indexing and optimization.

Indexes:
Because DuckDB is an OLAP columnar database, indexes are not as important as they are in typical BTree-based row-by-row database systems like PostgreSQL, SQLServer or SQLite. We generally advice against using the ART index except in very specific circumstances where you have highly selective queries (expect less than 0.1% of the rows to be returned).

Similarly, the spatial extension implements a R-Tree index, but using it is not as much of a requirement as it is in PostGIS. It only someitmes helps when scanning data (NOT joins), and a lot of the times result in much worse performance unless you have: highly selective queries, very large/wide rows, slow disk. It's also single threaded (for now) - basically, DuckDB is much faster at just scanning the entire dataset than doing random access reads as the result of first scanning an index. There's also some memory limits when it comes to indexes.

I wrote more about it here

optimizations:

There's a lot to write about here. In general I would recommend looking through the merged PR's in this repo. If they include any significant optimizations I usually try to write about it. I also did this talk where I dived into the internals a bit - it's a bit outdated now though.

We recently got a dedicated spatial join operator which creates an r-tree on-the-fly on the build-side when performing a join. This makes spatial joins much faster in general. You can read more about it here

The other main point of optimization is the serialized geometry format used in DuckDB. Its not well documented, but its basically similar to how PostGIS stores geometries, which has the benefit of making sure that if the geometry blob is aligned to 8 bytes (which is mostly the case, but not always in DuckDB), all the vertex data (e.g. the coordinates) will also be aligned, which accessing the memory as doubles slightly faster. We also do the same trick as PostGIS and store a cached approximate bounding box in front of each geometry blob, which makes it possible to do a quick check if e.g. there's even the possibility of two geometries intersecting by just comparing the bounding boxes, without having to first deserialize and materialize the both geometries.

This is something I've generally spent a lot of time on. To really benefit from DuckDBs multi-threaded vectorized execution it is very important to try to keep the actual function execution loop lean and avoid allocating/deallocating a bunch of memory. And so this is a big motivation to re-implement a lot of the spatial functions natively (e.g. not relying on GEOS or some other library) as we can better control e.g. memory allocation. To illustrate, I recently did some benchmarking and figured out that almost half the time of executing a spatial join goes to deserializing and constructing geometries into GEOS objects, and only ~2% actually computing the predicate.

In comparison, our own geometry library make heavy use of arena-allocation, can reference coordinate data directly from the serialized buffers, and sometimes don't even need to allocate anything dynamically at all. But this is still a work in progress, and most of the important (difficult, e.g. spatial predicates, overlay/clipping) algorithms are still based on GEOS.

There's also the topic of compression which I haven't even begun working on, but is on my todo list - basically DuckDB supports pluggable per-column compression, and my theory is that large geometries compress very well (e.g. most points in a line string or polygon are close to each other, so you can just store the deltas)

There's a ton more to write, but basically these are the three things I've focused on:

operator level optimizations (e.g. spatial join operator, index scan operator)
optimized serialized format and de/serialization routines
function/expression execution, e.g. avoid allocations, avoid recursion, use arenas, vectorized branches.

And the best way to get a feel for this is to dig through the code/check out merged PRs.

How DuckDB achieves efficient handling of very large spatial files.

I guess most of the cool stuff related to this isn't really specific to spatial - I would recommend to read up on some of the internals/design decisions made in DuckDB that makes it fast for large data in general. (e.g. vectorized execution, push-based multi-threaded streaming query engine, lightweight compression, compressed execution, columnar storage)

Most spatial file support is provided by the GDAL library. DuckDB doesn't really do a lot to make the access faster besides using the columnar Arrow export functionality of GDAL.

I guess the main non-gdal based format is GeoParquet which uses DuckDBs highly optimized parquet extension. But the only geo-specific part of this is that the parquet reader supports efficientpredicate pushdown, which makes it possible to filter e.g. bounding box ranges very efficiently to avoid scanning rows that obviously can't e.g. intersect with a geometry. See e.g. how to download overture data to see this in action.

Please keep me updated if you end up writing a paper, would love to read it when its done!

0 replies

2xqqjnq · 2025-05-15T09:17:55Z

2xqqjnq
May 15, 2025
Author

Dear DuckDB Team,

We thank you again for your kind and insightful response to our previous message.

We would like to share with you more precise elements of our work and take this opportunity to ask a few additional technical and strategic questions.

🔍 1. Geometry Validation – ST_IsValid
In our benchmark, we tested the validity of the geometries using the following query in DuckDB:

SELECT COUNT(*) FROM batiments_full WHERE ST_IsValid(geom);
This query was executed using Python and the time module for benchmarking. The result returned 12,297,482 valid geometries, with an execution time of approximately 5.4 seconds, confirming DuckDB's very fast in-memory and vectorized processing model.

📊 2. Benchmark Summary
We benchmarked DuckDB Spatial against PostGIS on a real dataset of building footprints in Morocco (~3GB, ~12 million polygons). Six main queries were tested:

Query PostGIS Time DuckDB Time PostGIS Result DuckDB Result
NOT ST_IsValid 8.33 s 2.64 s 0 0
ST_Intersects (Casablanca) 23.5 s 2.36 s 81,873 81,873
ST_DWithin (500m, Rabat) 28.91 s 1.26 s 1,545 1,259
ST_Area (avg) 19.37 s 2.93 s 98.32 m² 117.23 m²
Buffer Intersections (Casablanca) 389 s 96.34 s 5,597,416 4,781,206
Spatial Join (4 cities) 69.0 s 7.51 s 492,274 483,832

❗ 3. Geometry vs. Geography – Observed Differences
One key issue we encountered during the ST_DWithin and ST_Area .... tests is related to the type system:

We noticed that DuckDB Spatial currently does not support the geography type, unlike PostGIS which allows precise geodetic calculations using spherical models.

To perform spatial analysis in meters with DuckDB, we had to project our geometries from EPSG:4326 (latitude/longitude) to a projected CRS (EPSG:3857). This step introduced noticeable differences in the results, especially in:

Proximity queries (ST_DWithin): DuckDB returned fewer buildings due to the planar approximation of distance.

Area calculations: DuckDB returned larger average areas due to projection distortion.

👉 We understand this tradeoff improves performance, but we would like to ask:

Is there a plan to support the geography type or geodetic calculations in the future?

In which contexts do you recommend using geometry-only processing, and when should we handle reprojection manually?

📨 4. Request for Feedback and Acknowledgment in Our Report
We would be very grateful if we could send you the full benchmarking section of our report as a PDF (in English or French) so that you may suggest improvements or alternative approaches.

Your feedback would be of great value to us.

Moreover, as part of our final year engineering thesis, we would like to formally mention your name as an external contributor/supervisor in our academic report. This would greatly enhance the impact of our work and help us promote DuckDB Spatial in our academic and professional ecosystem in Morocco.

Thank you once again for your attention, generosity, and impressive work on DuckDB Spatial.

We look forward to your response and remain available for any clarification or contribution.

Best regards,
BAIDA Brahim & BOUCHANTIYA Mohamed
Final Year Engineering Students
Institut Agronomique et Vétérinaire Hassan II
Rabat, Morocco

1 reply

e-kotov Aug 11, 2025

Hi, FYI, there is a community GEOGRAPHY extension https://duckdb.org/community_extensions/extensions/geography.html + https://github.com/paleolimbot/duckdb-geography

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Academic Inquiry: Spatial Analysis with DuckDB for Engineering Thesis #565

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Academic Inquiry: Spatial Analysis with DuckDB for Engineering Thesis #565

Uh oh!

2xqqjnq May 14, 2025

Replies: 2 comments · 1 reply

Uh oh!

Uh oh!

Maxxen May 14, 2025 Maintainer

Uh oh!

2xqqjnq May 15, 2025 Author

Uh oh!

e-kotov Aug 11, 2025

2xqqjnq
May 14, 2025

Replies: 2 comments 1 reply

Maxxen
May 14, 2025
Maintainer

2xqqjnq
May 15, 2025
Author