feat(go): implement staging + COPY INTO bulk ingest with `BulkIngestManager` by krinart · Pull Request #247 · adbc-drivers/databricks

krinart · 2026-02-23T22:59:45Z

What's Changed

Replaces row-by-row INSERT-based bulk ingest with the Staging + COPY INTO pattern using driverbase.BulkIngestManager.

Changes:

Implement driverbase.BulkIngestImpl interface
Add staging_client.go for Databricks Files API HTTP operations
Add databricks.staging.volume_path and databricks.staging.prefix database options
Parse URI to extract hostname/token for staging client when using URI mode

lidavidm · 2026-02-24T02:25:39Z

go/statement.go

+	// The manager's Close() releases the bound stream, so nil it out
+	// to prevent double-release in statement.Close()
+	s.boundStream = nil


This should be put before any early returns/right after the defer above

lidavidm · 2026-02-24T02:27:20Z

Tests in progress: https://github.com/adbc-drivers/databricks/actions/runs/22333980000

github-actions · 2026-02-24T02:31:15Z

❌ Test failed: spiceai/adbc-databricks@8d056d1
Workflow run: https://github.com/adbc-drivers/databricks/actions/runs/22333980000

krinart · 2026-02-24T02:34:00Z

Tests failed because there is one required parameter for bulk ingestion now: databricks.staging.volume_path

I think it needs to be fixed in integration tests setup, or have some default value.

Any suggestions?

lidavidm · 2026-02-24T02:37:21Z

It can be configured here:

databricks/go/validation/tests/databricks.py

Lines 50 to 64 in 3fdc9f1

    
           setup = model.DriverSetup( 
        
               database={ 
        
                   "databricks.server_hostname": model.FromEnv("DATABRICKS_HOST"), 
        
                   # "databricks.access_token": model.FromEnv("DATABRICKS_ACCESSTOKEN"), 
        
                   "databricks.oauth.client_id": model.FromEnv("DATABRICKS_OAUTH_CLIENT_ID"), 
        
                   "databricks.oauth.client_secret": model.FromEnv( 
        
                       "DATABRICKS_OAUTH_CLIENT_SECRET" 
        
                   ), 
        
                   "databricks.http_path": model.FromEnv("DATABRICKS_HTTPPATH"), 
        
                   "databricks.catalog": "main", 
        
                   "databricks.schema": "adbc_testing", 
        
               }, 
        
               connection={}, 
        
               statement={}, 
        
           )

If we need server-side configuration we can ask Databricks for help

lidavidm · 2026-03-04T00:04:12Z

Sorry, this slipped my attention...I will talk with Databricks when I get a chance to see if we can configure this in CI, and then get this merged

lidavidm · 2026-03-05T06:43:15Z

Ok, let's see what they say!

jatorre · 2026-03-15T08:00:23Z

Tested PR #247 — bulk ingest benchmarks

Built the branch locally and tested against Databricks SQL Warehouse (serverless). The staging + COPY INTO approach works great and is a massive improvement over the current row-by-row INSERT.

Setup

macOS arm64, DuckDB 1.5 adbc_scanner extension, adbc_insert() function
Databricks SQL Warehouse (serverless), Unity Catalog Managed Volume
databricks.staging.volume_path = catalog.schema.volume

Results

Synthetic data (2 columns: INT + VARCHAR):

Rows	Old driver (row-by-row)	PR #247 (staging + COPY INTO)	Speedup
100	134.65s (0.7 rows/sec)	12.28s (8 rows/sec)	11x
1,000	hung (>10 min)	11.69s (86 rows/sec)	∞
10,000	n/a	7.26s (1,378 rows/sec)	-
100,000	n/a	7.81s (12,804 rows/sec)	-

Real geospatial data (multi-column with WKB BLOB geometry):

Dataset	Rows	PR #247 insert time	Insert rate	Total (incl. geo conversion)
airports (points, ~40 cols)	893	7.85s	114 rows/sec	-
addresses (points, 6 cols)	146,787	9.45s	15,528 rows/sec	10,729 rows/sec
albania buildings (polygons)	425,300	18.10s	23,493 rows/sec	18,644 rows/sec

The fixed overhead is ~5-7s (Parquet serialize + Volume upload + COPY INTO), so throughput scales well with dataset size.

Notes

Volume must exist before first use — got 404 NOT_FOUND until I created it via the Unity Catalog REST API
The databricks.staging.volume_path option works well as a three-part name (catalog.schema.volume)
BLOB columns (WKB geometry) work fine through the Parquet staging path
Catalog names with hyphens (e.g. carto-dev-data) work correctly in the Files API URL path

Happy to help with CI configuration for testing — creating a Managed Volume is a single REST API call.

lidavidm · 2026-03-15T08:51:48Z

(Databricks maintains the CI config here - I've asked about creating the volume so we can test this)

krinart added 3 commits February 23, 2026 14:41

Improve bulk insert go driver

63c666c

Add tests

2122d53

Lint

8d056d1

krinart marked this pull request as ready for review February 23, 2026 23:00

krinart requested a deployment to azure-prod February 23, 2026 23:06 — with GitHub Actions Waiting

lidavidm reviewed Feb 24, 2026

View reviewed changes

jatorre mentioned this pull request Mar 19, 2026

feat: GeoArrow-aware bulk ingest — WKB staging + CTAS for geometry columns #361

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(go): implement staging + COPY INTO bulk ingest with `BulkIngestManager`#247

feat(go): implement staging + COPY INTO bulk ingest with `BulkIngestManager`#247
krinart wants to merge 3 commits intoadbc-drivers:mainfrom
spiceai:viktor/improve-databricks-bulk-insert

krinart commented Feb 23, 2026

Uh oh!

lidavidm Feb 24, 2026

Uh oh!

lidavidm commented Feb 24, 2026

Uh oh!

github-actions bot commented Feb 24, 2026

Uh oh!

krinart commented Feb 24, 2026

Uh oh!

lidavidm commented Feb 24, 2026

Uh oh!

lidavidm commented Mar 4, 2026

Uh oh!

lidavidm commented Mar 5, 2026

Uh oh!

jatorre commented Mar 15, 2026

Uh oh!

lidavidm commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

krinart commented Feb 23, 2026

What's Changed

Uh oh!

lidavidm Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

lidavidm commented Feb 24, 2026

Uh oh!

github-actions bot commented Feb 24, 2026

Uh oh!

krinart commented Feb 24, 2026

Uh oh!

lidavidm commented Feb 24, 2026

Uh oh!

lidavidm commented Mar 4, 2026

Uh oh!

lidavidm commented Mar 5, 2026

Uh oh!

jatorre commented Mar 15, 2026

Tested PR #247 — bulk ingest benchmarks

Setup

Results

Notes

Uh oh!

lidavidm commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants