Replies: 2 comments 1 reply
-
Hello!
DuckDB uses the GEOS library to determine validity. GEOS is in turn based on JTS. It seems like PostGIS also uses GEOS for this (https://postgis.net/docs/ST_IsValid.html), but there might be differences in the GEOS version used by spatial/postgis. (or its a bug). Feel free to share a reproducible example and I can have a look.
I don't know. I would assume GEOS uses the Simple Features definition of validity. I don't think there is a specific algorithm for it, you just have to test all the required preconditions and determine if they all hold. That said, in the algorithms I've implemented natively in DuckDB-spatial (not based on GEOS) I've tried to be resistant towards the easy forms of degenerate geometries (e.g. linestrings with 1 point, polygons with less than 4 vertices), and DuckDB doesn't enforce that stored geometries are valid.
Indexes: Similarly, the spatial extension implements a R-Tree index, but using it is not as much of a requirement as it is in PostGIS. It only someitmes helps when scanning data (NOT joins), and a lot of the times result in much worse performance unless you have: highly selective queries, very large/wide rows, slow disk. It's also single threaded (for now) - basically, DuckDB is much faster at just scanning the entire dataset than doing random access reads as the result of first scanning an index. There's also some memory limits when it comes to indexes. I wrote more about it here optimizations: There's a lot to write about here. In general I would recommend looking through the merged PR's in this repo. If they include any significant optimizations I usually try to write about it. I also did this talk where I dived into the internals a bit - it's a bit outdated now though. We recently got a dedicated spatial join operator which creates an r-tree on-the-fly on the build-side when performing a join. This makes spatial joins much faster in general. You can read more about it here The other main point of optimization is the serialized geometry format used in DuckDB. Its not well documented, but its basically similar to how PostGIS stores geometries, which has the benefit of making sure that if the geometry blob is aligned to 8 bytes (which is mostly the case, but not always in DuckDB), all the vertex data (e.g. the coordinates) will also be aligned, which accessing the memory as This is something I've generally spent a lot of time on. To really benefit from DuckDBs multi-threaded vectorized execution it is very important to try to keep the actual function execution loop lean and avoid allocating/deallocating a bunch of memory. And so this is a big motivation to re-implement a lot of the spatial functions natively (e.g. not relying on GEOS or some other library) as we can better control e.g. memory allocation. To illustrate, I recently did some benchmarking and figured out that almost half the time of executing a spatial join goes to deserializing and constructing geometries into GEOS objects, and only ~2% actually computing the predicate. In comparison, our own geometry library make heavy use of arena-allocation, can reference coordinate data directly from the serialized buffers, and sometimes don't even need to allocate anything dynamically at all. But this is still a work in progress, and most of the important (difficult, e.g. spatial predicates, overlay/clipping) algorithms are still based on GEOS. There's also the topic of compression which I haven't even begun working on, but is on my todo list - basically DuckDB supports pluggable per-column compression, and my theory is that large geometries compress very well (e.g. most points in a line string or polygon are close to each other, so you can just store the deltas) There's a ton more to write, but basically these are the three things I've focused on:
And the best way to get a feel for this is to dig through the code/check out merged PRs.
I guess most of the cool stuff related to this isn't really specific to spatial - I would recommend to read up on some of the internals/design decisions made in DuckDB that makes it fast for large data in general. (e.g. vectorized execution, push-based multi-threaded streaming query engine, lightweight compression, compressed execution, columnar storage) Most spatial file support is provided by the GDAL library. DuckDB doesn't really do a lot to make the access faster besides using the columnar Arrow export functionality of GDAL. I guess the main non-gdal based format is GeoParquet which uses DuckDBs highly optimized parquet extension. But the only geo-specific part of this is that the parquet reader supports efficientpredicate pushdown, which makes it possible to filter e.g. bounding box ranges very efficiently to avoid scanning rows that obviously can't e.g. intersect with a geometry. See e.g. how to download overture data to see this in action. Please keep me updated if you end up writing a paper, would love to read it when its done! |
Beta Was this translation helpful? Give feedback.
-
Dear DuckDB Team, We thank you again for your kind and insightful response to our previous message. We would like to share with you more precise elements of our work and take this opportunity to ask a few additional technical and strategic questions. 🔍 1. Geometry Validation – ST_IsValid SELECT COUNT(*) FROM batiments_full WHERE ST_IsValid(geom); 📊 2. Benchmark Summary Query PostGIS Time DuckDB Time PostGIS Result DuckDB Result ❗ 3. Geometry vs. Geography – Observed Differences We noticed that DuckDB Spatial currently does not support the geography type, unlike PostGIS which allows precise geodetic calculations using spherical models. To perform spatial analysis in meters with DuckDB, we had to project our geometries from EPSG:4326 (latitude/longitude) to a projected CRS (EPSG:3857). This step introduced noticeable differences in the results, especially in: Proximity queries (ST_DWithin): DuckDB returned fewer buildings due to the planar approximation of distance. Area calculations: DuckDB returned larger average areas due to projection distortion. 👉 We understand this tradeoff improves performance, but we would like to ask: Is there a plan to support the geography type or geodetic calculations in the future? In which contexts do you recommend using geometry-only processing, and when should we handle reprojection manually? 📨 4. Request for Feedback and Acknowledgment in Our Report Your feedback would be of great value to us. Moreover, as part of our final year engineering thesis, we would like to formally mention your name as an external contributor/supervisor in our academic report. This would greatly enhance the impact of our work and help us promote DuckDB Spatial in our academic and professional ecosystem in Morocco. Thank you once again for your attention, generosity, and impressive work on DuckDB Spatial. We look forward to your response and remain available for any clarification or contribution. Best regards, |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Dear DuckDB Team,
We are two engineering students, BAIDA Brahim and BOUCHANTIYA Mohamed,
currently completing our final year project (PFE) for the Engineering
Degree in Geomatics and Topographic Sciences at the Institut
Agronomique et Vétérinaire Hassan II in Rabat, Morocco.
Our PFE is entitled "Exploration of DuckDB Spatial for Analytics on
Massive Geospatial Data".
As part of this work, we are focusing on a benchmarking study between
PostGIS and DuckDB Spatial.
Specifically, we are comparing the performance of both engines on the
spatial analysis of a large building dataset across Morocco (~3GB,
approximately 24 million geometries).
During our experiments, we noticed an important difference:
In PostGIS, after validating the geometries, we obtain approximately
24 million valid geometries.
In DuckDB, after validation, we obtain approximately 12 million valid
geometries.
Our investigations suggest that PostGIS may be slightly tolerant
towards minor polygon closure errors, while DuckDB seems to enforce
stricter validation rules.
➔ We would greatly appreciate if you could clarify:
for spatial data?
Moreover, we are very interested in better understanding the internal
architecture of DuckDB regarding the processing of massive spatial
datasets, particularly:
How DuckDB manages spatial indexing and optimization.
How DuckDB achieves efficient handling of very large spatial files.
Any official documentation, technical notes, or research articles that
could enrich our thesis would be highly appreciated.
Thank you very much for your time and support.
We remain available for any further information or collaboration if needed.
Sincerely,
BAIDA Brahim & BOUCHANTIYA Mohamed
Institut Agronomique et Vétérinaire Hassan II
Rabat, Morocco
Beta Was this translation helpful? Give feedback.
All reactions