-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Summary
When Databricks tables contain geometry columns (e.g., geometry(4326)), the ADBC driver returns them as plain binary Arrow arrays without any extension metadata. This means downstream consumers like DuckDB cannot automatically recognize them as geometry.
Proposal: Tag geometry columns with geoarrow.wkb Arrow extension metadata so they flow as native geometry types through the Arrow ecosystem.
Background
GeoArrow defines standard Arrow extension types for geospatial data. The geoarrow.wkb extension type wraps WKB-encoded geometry in a binary Arrow array with two metadata fields:
ARROW:extension:name="geoarrow.wkb"ARROW:extension:metadata= JSON with CRS info (e.g.,{"crs": "OGC:CRS84"})
DuckDB 1.5 has built-in GEOMETRY type support and can consume geoarrow.wkb Arrow arrays natively via register_geoarrow_extensions(). The DuckDB Snowflake extension already uses this pattern for geometry passthrough (iqea-ai/duckdb-snowflake#24).
Current workaround
When using the Databricks ADBC driver with DuckDB's adbc_scanner, geometry requires explicit WKB conversion on both sides:
-- Databricks side: explicitly convert to WKB
SELECT *, ST_AsBinary(geom) as geom_wkb FROM my_table
-- DuckDB side: explicitly convert back from WKB
SELECT ST_GeomFromWKB(geom_wkb) as geom FROM adbc_scan(...)Proposed behavior
If the driver tagged geometry columns with geoarrow.wkb metadata:
-- Just works — geometry flows as native type
SELECT * FROM adbc_scan(...)Implementation sketch
In ipc_reader_adapter.go, after obtaining the Arrow schema from the IPC stream:
- Query Databricks column metadata (from
INFORMATION_SCHEMA.COLUMNSor the Thrift response) to identify which columns haveDATA_TYPEmatchinggeometry(...) - For those columns, modify the Arrow schema field to include:
ARROW:extension:name="geoarrow.wkb"ARROW:extension:metadata={"crs": {"type": "authority_code", "value": "OGC:CRS84"}}(or extract SRID from the type definition)
- The underlying binary data is already WKB, so no data transformation is needed — just metadata annotation
Use case
We're benchmarking geospatial data transfer between DuckDB and cloud warehouses (duckdb-warehouse-transfer). The Databricks ADBC export via adbc_scanner already achieves ~24,000 rows/sec (2x faster than the @databricks/sql JSON connector). Adding geoarrow.wkb metadata would eliminate the WKB conversion overhead and enable native geometry passthrough.