feat(go): implement staging + COPY INTO bulk ingest with BulkIngestManager#247
feat(go): implement staging + COPY INTO bulk ingest with BulkIngestManager#247krinart wants to merge 3 commits intoadbc-drivers:mainfrom
BulkIngestManager#247Conversation
| // The manager's Close() releases the bound stream, so nil it out | ||
| // to prevent double-release in statement.Close() | ||
| s.boundStream = nil |
There was a problem hiding this comment.
This should be put before any early returns/right after the defer above
|
Tests in progress: https://github.com/adbc-drivers/databricks/actions/runs/22333980000 |
|
❌ Test failed: spiceai/adbc-databricks@8d056d1 |
|
Tests failed because there is one required parameter for bulk ingestion now: I think it needs to be fixed in integration tests setup, or have some default value. Any suggestions? |
|
It can be configured here: databricks/go/validation/tests/databricks.py Lines 50 to 64 in 3fdc9f1 If we need server-side configuration we can ask Databricks for help |
|
Sorry, this slipped my attention...I will talk with Databricks when I get a chance to see if we can configure this in CI, and then get this merged |
|
Ok, let's see what they say! |
Tested PR #247 — bulk ingest benchmarksBuilt the branch locally and tested against Databricks SQL Warehouse (serverless). The staging + COPY INTO approach works great and is a massive improvement over the current row-by-row INSERT. Setup
ResultsSynthetic data (2 columns: INT + VARCHAR):
Real geospatial data (multi-column with WKB BLOB geometry):
The fixed overhead is ~5-7s (Parquet serialize + Volume upload + COPY INTO), so throughput scales well with dataset size. Notes
Happy to help with CI configuration for testing — creating a Managed Volume is a single REST API call. |
|
(Databricks maintains the CI config here - I've asked about creating the volume so we can test this) |
What's Changed
Replaces row-by-row INSERT-based bulk ingest with the Staging + COPY INTO pattern using
driverbase.BulkIngestManager.Changes:
driverbase.BulkIngestImplinterfacestaging_client.gofor Databricks Files API HTTP operationsdatabricks.staging.volume_pathanddatabricks.staging.prefixdatabase options