Skip to content

Add geoarrow.wkb Arrow extension metadata for geometry columns #339

@jatorre

Description

@jatorre

Summary

When Databricks tables contain geometry columns (e.g., geometry(4326)), the ADBC driver returns them as plain binary Arrow arrays without any extension metadata. This means downstream consumers like DuckDB cannot automatically recognize them as geometry.

Proposal: Tag geometry columns with geoarrow.wkb Arrow extension metadata so they flow as native geometry types through the Arrow ecosystem.

Background

GeoArrow defines standard Arrow extension types for geospatial data. The geoarrow.wkb extension type wraps WKB-encoded geometry in a binary Arrow array with two metadata fields:

  • ARROW:extension:name = "geoarrow.wkb"
  • ARROW:extension:metadata = JSON with CRS info (e.g., {"crs": "OGC:CRS84"})

DuckDB 1.5 has built-in GEOMETRY type support and can consume geoarrow.wkb Arrow arrays natively via register_geoarrow_extensions(). The DuckDB Snowflake extension already uses this pattern for geometry passthrough (iqea-ai/duckdb-snowflake#24).

Current workaround

When using the Databricks ADBC driver with DuckDB's adbc_scanner, geometry requires explicit WKB conversion on both sides:

-- Databricks side: explicitly convert to WKB
SELECT *, ST_AsBinary(geom) as geom_wkb FROM my_table

-- DuckDB side: explicitly convert back from WKB
SELECT ST_GeomFromWKB(geom_wkb) as geom FROM adbc_scan(...)

Proposed behavior

If the driver tagged geometry columns with geoarrow.wkb metadata:

-- Just works — geometry flows as native type
SELECT * FROM adbc_scan(...)

Implementation sketch

In ipc_reader_adapter.go, after obtaining the Arrow schema from the IPC stream:

  1. Query Databricks column metadata (from INFORMATION_SCHEMA.COLUMNS or the Thrift response) to identify which columns have DATA_TYPE matching geometry(...)
  2. For those columns, modify the Arrow schema field to include:
    • ARROW:extension:name = "geoarrow.wkb"
    • ARROW:extension:metadata = {"crs": {"type": "authority_code", "value": "OGC:CRS84"}} (or extract SRID from the type definition)
  3. The underlying binary data is already WKB, so no data transformation is needed — just metadata annotation

Use case

We're benchmarking geospatial data transfer between DuckDB and cloud warehouses (duckdb-warehouse-transfer). The Databricks ADBC export via adbc_scanner already achieves ~24,000 rows/sec (2x faster than the @databricks/sql JSON connector). Adding geoarrow.wkb metadata would eliminate the WKB conversion overhead and enable native geometry passthrough.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions