Skip to content

feat: GeoArrow support for bulk ingestion (GEOGRAPHY/GEOMETRY)#99

Open
jatorre wants to merge 1 commit intoadbc-drivers:mainfrom
jatorre:geoarrow-support
Open

feat: GeoArrow support for bulk ingestion (GEOGRAPHY/GEOMETRY)#99
jatorre wants to merge 1 commit intoadbc-drivers:mainfrom
jatorre:geoarrow-support

Conversation

@jatorre
Copy link

@jatorre jatorre commented Mar 18, 2026

Summary

Adds geospatial column support to the Snowflake ADBC driver's bulk ingestion path. When Arrow columns carry geoarrow.wkb or geoarrow.wkt extension metadata, the driver automatically creates GEOGRAPHY or GEOMETRY columns in Snowflake and converts the data.

  • Detects geoarrow columns from ARROW:extension:name field metadata (handles C Data Interface where Go-level extension types are stripped)
  • New statement option adbc.snowflake.statement.ingest_geo_type: "geography" (default, WGS84/4326) or "geometry" (any SRID)
  • Extracts SRID from geoarrow CRS metadata (PROJJSON or "EPSG:NNNN" format) for GEOMETRY columns
  • Unit tests for type mapping and SRID extraction

How it works

  1. Bulk ingest loads data as BINARY via the existing Parquet → PUT → COPY INTO pipeline (unchanged)
  2. After COPY, geoarrow columns are detected and converted via CTAS with TO_GEOGRAPHY/TO_GEOMETRY
  3. For GEOMETRY columns, SRID is applied via ST_SETSRID if present in geoarrow metadata

Why CTAS instead of direct COPY INTO GEOGRAPHY?

Snowflake's COPY INTO from Parquet cannot load WKB directly into GEOGRAPHY/GEOMETRY columns — only CSV and JSON/AVRO support direct geospatial loading from stages (docs). The CTAS workaround (rename → CTAS with conversion → drop staging) adds minimal overhead at scale.

A future optimization could use COPY transforms (SELECT ... FROM @stage) to convert inline.

Benchmark results

Tested with Czech Republic OSM Geofabrik data (real-world geometries):

Dataset Rows Throughput Geometry type
POIs 465,280 38,119 rows/sec Point
Roads 1,885,651 56,804 rows/sec LineString
Buildings 5,014,886 68,611 rows/sec Polygon

Export (not in this PR)

Export/read-path geoarrow support is not included. Detecting GEOGRAPHY/GEOMETRY columns on the read path is non-trivial because:

  • With GEOGRAPHY_OUTPUT_FORMAT=EWKB, srcMeta.Type becomes "binary" (type info lost)
  • With default GeoJSON format, srcMeta.Type is "object" (same as VARIANT/OBJECT)

This is related to the broader SRID/CRS propagation discussion across ADBC drivers:

For GEOGRAPHY, CRS is always EPSG:4326. For GEOMETRY, SRID requires data buffering to extract — a common challenge that would benefit from a cross-driver solution.

Context

This is part of a broader effort to add GeoArrow support across ADBC drivers. Previously opened as apache/arrow-adbc#4114, moved here per maintainer request.

Test plan

  • Unit tests for toSnowflakeType with geoarrow extension types
  • Unit tests for extractSRIDFromMeta (PROJJSON, simple EPSG string, null, empty, invalid)
  • Existing TestIngestBatchedParquetWithFileLimit still passes
  • End-to-end tested against real Snowflake with points, lines, and polygons
  • Verified GEOGRAPHY column type created in Snowflake via INFORMATION_SCHEMA

🤖 Generated with Claude Code

Detect geoarrow.wkb/geoarrow.wkt columns during adbc_insert and create
GEOGRAPHY or GEOMETRY columns in Snowflake, with automatic WKB→geo
conversion and SRID support.

How it works:
1. Bulk ingest loads data as BINARY via existing Parquet→PUT→COPY INTO
2. After COPY, geoarrow columns are detected from Arrow field metadata
   (ARROW:extension:name) and converted via CTAS with TO_GEOGRAPHY or
   TO_GEOMETRY. SRID is extracted from geoarrow CRS metadata (PROJJSON
   or "EPSG:NNNN") and applied via ST_SETSRID for GEOMETRY columns.

The CTAS post-processing is needed because Snowflake's COPY INTO from
Parquet cannot load WKB directly into GEOGRAPHY/GEOMETRY columns — only
CSV and JSON/AVRO support direct geospatial loading from stages. See:
https://docs.snowflake.com/en/sql-reference/data-types-geospatial#loading-geospatial-data-from-stages

New statement option:
- adbc.snowflake.statement.ingest_geo_type: "geography" (default) or
  "geometry". GEOGRAPHY is WGS84/SRID 4326; GEOMETRY supports any SRID.

Benchmarked with Czech Republic OSM Geofabrik data against Snowflake:
- Points (465K):      38,119 rows/sec
- LineStrings (1.9M): 56,804 rows/sec
- Polygons (5M):      68,611 rows/sec

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant