-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Summary
When PR #247 (staging + COPY INTO bulk ingest) lands, geometry columns sent via adbc_insert with geoarrow.wkb Arrow metadata will need special handling. Databricks doesn't support direct ingestion of geometry types via COPY INTO — the data must arrive as BINARY (WKB) and be converted server-side with ST_GeomFromWKB.
This is the same pattern already implemented in the Snowflake ADBC driver (adbc-drivers/snowflake#99) and proposed for Redshift (adbc-drivers/redshift#3).
Proposed Solution
When the driver detects geoarrow.wkb or geoarrow.wkt in Arrow field extension metadata during ingest:
- Staging: Create the column as
BINARY(for WKB) orSTRING(for WKT) in the staging table - COPY INTO: Load via Parquet → Volume → COPY INTO (PR feat(go): implement staging + COPY INTO bulk ingest with
BulkIngestManager#247's path — works fine for BINARY) - CTAS: Convert to geometry on Databricks:
Or for GEOGRAPHY:
CREATE TABLE target AS SELECT *, ST_GeomFromWKB(geom_col) as geom FROM staging;
ST_GeogFromWKB(geom_col) - Cleanup: Drop the staging table
Statement option
adbc.databricks.statement.ingest_geo_type = "geometry" (default) | "geography"
For GEOGRAPHY: ST_GeogFromWKB(geom_col) instead of ST_GeomFromWKB.
SRID from CRS metadata
The geoarrow.wkb field may carry CRS metadata (PROJJSON or EPSG:NNNN). This connects with PR #350 (geoarrow.wkb export) which already handles CRS on the export side — the import side should mirror that.
Prior Art
| Driver | Import PR | Pattern |
|---|---|---|
| Snowflake | #99 (open) | geoarrow.wkb → BINARY via Parquet → PUT → COPY INTO → CTAS TO_GEOGRAPHY |
| Redshift | #3 (proposed) | geoarrow.wkb → VARBYTE via Parquet → S3 → COPY → CTAS ST_GeomFromWKB |
| Databricks | this issue | geoarrow.wkb → BINARY via Parquet → Volume → COPY INTO → CTAS ST_GeomFromWKB |
All three drivers follow the same three-step pattern: staging as binary → bulk load → server-side conversion. The details differ only in the SQL dialect and staging mechanism.
Current Workaround
Users must manually convert geometry to WKB before calling adbc_insert, then run CTAS on Databricks:
# In DuckDB
CREATE TABLE _import AS SELECT *, ST_AsWKB(geom) as geom_wkb EXCLUDE(geom) FROM source;
# adbc_insert sends geom_wkb as BINARY (works with PR #247)
# Then on Databricks:
CREATE TABLE final AS SELECT *, ST_GeomFromWKB(geom_wkb) as geom FROM staging;This is what our benchmark scripts do today and it works at ~15-23K rows/sec. Making it transparent in the driver would enable a unified adbc_insert API for geometry across all warehouses.
Relationship to other PRs
- feat(go): implement staging + COPY INTO bulk ingest with
BulkIngestManager#247 — Staging + COPY INTO bulk ingest (prerequisite — provides the transport layer) - feat: expose Arrow-native geospatial option (databricks.arrow.native_geospatial) #350 — geoarrow.wkb export (the export-side counterpart)
- feat: Arrow-native geospatial serialization (geospatialAsArrow) databricks/databricks-sql-go#328 — geospatialAsArrow Thrift flag (enables native geo Arrow transport)