Skip to content

feat: GeoArrow-aware bulk ingest — WKB staging + CTAS for geometry columns #361

@jatorre

Description

@jatorre

Summary

When PR #247 (staging + COPY INTO bulk ingest) lands, geometry columns sent via adbc_insert with geoarrow.wkb Arrow metadata will need special handling. Databricks doesn't support direct ingestion of geometry types via COPY INTO — the data must arrive as BINARY (WKB) and be converted server-side with ST_GeomFromWKB.

This is the same pattern already implemented in the Snowflake ADBC driver (adbc-drivers/snowflake#99) and proposed for Redshift (adbc-drivers/redshift#3).

Proposed Solution

When the driver detects geoarrow.wkb or geoarrow.wkt in Arrow field extension metadata during ingest:

  1. Staging: Create the column as BINARY (for WKB) or STRING (for WKT) in the staging table
  2. COPY INTO: Load via Parquet → Volume → COPY INTO (PR feat(go): implement staging + COPY INTO bulk ingest with BulkIngestManager #247's path — works fine for BINARY)
  3. CTAS: Convert to geometry on Databricks:
    CREATE TABLE target AS
    SELECT *, ST_GeomFromWKB(geom_col) as geom
    FROM staging;
    Or for GEOGRAPHY: ST_GeogFromWKB(geom_col)
  4. Cleanup: Drop the staging table

Statement option

adbc.databricks.statement.ingest_geo_type = "geometry" (default) | "geography"

For GEOGRAPHY: ST_GeogFromWKB(geom_col) instead of ST_GeomFromWKB.

SRID from CRS metadata

The geoarrow.wkb field may carry CRS metadata (PROJJSON or EPSG:NNNN). This connects with PR #350 (geoarrow.wkb export) which already handles CRS on the export side — the import side should mirror that.

Prior Art

Driver Import PR Pattern
Snowflake #99 (open) geoarrow.wkb → BINARY via Parquet → PUT → COPY INTO → CTAS TO_GEOGRAPHY
Redshift #3 (proposed) geoarrow.wkb → VARBYTE via Parquet → S3 → COPY → CTAS ST_GeomFromWKB
Databricks this issue geoarrow.wkb → BINARY via Parquet → Volume → COPY INTO → CTAS ST_GeomFromWKB

All three drivers follow the same three-step pattern: staging as binary → bulk load → server-side conversion. The details differ only in the SQL dialect and staging mechanism.

Current Workaround

Users must manually convert geometry to WKB before calling adbc_insert, then run CTAS on Databricks:

# In DuckDB
CREATE TABLE _import AS SELECT *, ST_AsWKB(geom) as geom_wkb EXCLUDE(geom) FROM source;
# adbc_insert sends geom_wkb as BINARY (works with PR #247)
# Then on Databricks:
CREATE TABLE final AS SELECT *, ST_GeomFromWKB(geom_wkb) as geom FROM staging;

This is what our benchmark scripts do today and it works at ~15-23K rows/sec. Making it transparent in the driver would enable a unified adbc_insert API for geometry across all warehouses.

Relationship to other PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions