Skip to content

feat(go): implement staging + COPY INTO bulk ingest with BulkIngestManager#247

Open
krinart wants to merge 3 commits intoadbc-drivers:mainfrom
spiceai:viktor/improve-databricks-bulk-insert
Open

feat(go): implement staging + COPY INTO bulk ingest with BulkIngestManager#247
krinart wants to merge 3 commits intoadbc-drivers:mainfrom
spiceai:viktor/improve-databricks-bulk-insert

Conversation

@krinart
Copy link

@krinart krinart commented Feb 23, 2026

What's Changed

Replaces row-by-row INSERT-based bulk ingest with the Staging + COPY INTO pattern using driverbase.BulkIngestManager.

Changes:

  • Implement driverbase.BulkIngestImpl interface
  • Add staging_client.go for Databricks Files API HTTP operations
  • Add databricks.staging.volume_path and databricks.staging.prefix database options
  • Parse URI to extract hostname/token for staging client when using URI mode

Comment on lines +221 to +223
// The manager's Close() releases the bound stream, so nil it out
// to prevent double-release in statement.Close()
s.boundStream = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be put before any early returns/right after the defer above

@lidavidm
Copy link
Contributor

Tests in progress: https://github.com/adbc-drivers/databricks/actions/runs/22333980000

@github-actions
Copy link

@krinart
Copy link
Author

krinart commented Feb 24, 2026

Tests failed because there is one required parameter for bulk ingestion now: databricks.staging.volume_path

I think it needs to be fixed in integration tests setup, or have some default value.

Any suggestions?

@lidavidm
Copy link
Contributor

It can be configured here:

setup = model.DriverSetup(
database={
"databricks.server_hostname": model.FromEnv("DATABRICKS_HOST"),
# "databricks.access_token": model.FromEnv("DATABRICKS_ACCESSTOKEN"),
"databricks.oauth.client_id": model.FromEnv("DATABRICKS_OAUTH_CLIENT_ID"),
"databricks.oauth.client_secret": model.FromEnv(
"DATABRICKS_OAUTH_CLIENT_SECRET"
),
"databricks.http_path": model.FromEnv("DATABRICKS_HTTPPATH"),
"databricks.catalog": "main",
"databricks.schema": "adbc_testing",
},
connection={},
statement={},
)

If we need server-side configuration we can ask Databricks for help

@lidavidm
Copy link
Contributor

lidavidm commented Mar 4, 2026

Sorry, this slipped my attention...I will talk with Databricks when I get a chance to see if we can configure this in CI, and then get this merged

@lidavidm
Copy link
Contributor

lidavidm commented Mar 5, 2026

Ok, let's see what they say!

@jatorre
Copy link

jatorre commented Mar 15, 2026

Tested PR #247 — bulk ingest benchmarks

Built the branch locally and tested against Databricks SQL Warehouse (serverless). The staging + COPY INTO approach works great and is a massive improvement over the current row-by-row INSERT.

Setup

  • macOS arm64, DuckDB 1.5 adbc_scanner extension, adbc_insert() function
  • Databricks SQL Warehouse (serverless), Unity Catalog Managed Volume
  • databricks.staging.volume_path = catalog.schema.volume

Results

Synthetic data (2 columns: INT + VARCHAR):

Rows Old driver (row-by-row) PR #247 (staging + COPY INTO) Speedup
100 134.65s (0.7 rows/sec) 12.28s (8 rows/sec) 11x
1,000 hung (>10 min) 11.69s (86 rows/sec)
10,000 n/a 7.26s (1,378 rows/sec) -
100,000 n/a 7.81s (12,804 rows/sec) -

Real geospatial data (multi-column with WKB BLOB geometry):

Dataset Rows PR #247 insert time Insert rate Total (incl. geo conversion)
airports (points, ~40 cols) 893 7.85s 114 rows/sec -
addresses (points, 6 cols) 146,787 9.45s 15,528 rows/sec 10,729 rows/sec
albania buildings (polygons) 425,300 18.10s 23,493 rows/sec 18,644 rows/sec

The fixed overhead is ~5-7s (Parquet serialize + Volume upload + COPY INTO), so throughput scales well with dataset size.

Notes

  • Volume must exist before first use — got 404 NOT_FOUND until I created it via the Unity Catalog REST API
  • The databricks.staging.volume_path option works well as a three-part name (catalog.schema.volume)
  • BLOB columns (WKB geometry) work fine through the Parquet staging path
  • Catalog names with hyphens (e.g. carto-dev-data) work correctly in the Files API URL path

Happy to help with CI configuration for testing — creating a Managed Volume is a single REST API call.

@lidavidm
Copy link
Contributor

(Databricks maintains the CI config here - I've asked about creating the volume so we can test this)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants