feat(databricks-jdbc-driver): Add export bucket support for Google Cloud Storage #9407

qiao-x · 2025-03-31T06:08:26Z

Check List

Tests have been run in packages where changes made if available
Linter has been run for changed code
Tests for the changes have been added if not covered yet
Docs have been added / updated if required

Issue Reference this PR resolves

Closes #9393

Description of Changes Made (if issue reference is not provided)

add gcs bucket type

qiao-x · 2025-03-31T06:09:58Z

#9393

qiao-x · 2025-04-03T03:52:22Z

@KSDaemon BTW i also found this method dropTable(tableName: string, options?: QueryOptions) doesn't handle table name quoting as well. Should we uniform the way to return a full table name for Databricks driver?(Maybe in a separated MR)

KSDaemon · 2025-04-03T08:07:46Z

@qiao-x I think yes, but let's make it in a separate PR.

qiao-x · 2025-04-07T02:22:09Z

@KSDaemon Hi, how should I write tests for those changes? Hope we can get this done recently and then we can use it in released versions since it's helpful for throughput (we see a 100x increase for large datasets)

KSDaemon · 2025-04-07T10:00:03Z

packages/cubejs-databricks-jdbc-driver/src/DatabricksDriver.ts

+      return this.extractFilesFromGCS(
+        { credentials: this.config.gcsCredentials },
+        url.host,
+        objectSearchPrefix+".csv",


Is it important to add .csv here? Are there any other unrelated files that might be captured by objectSearchPrefix that should be excluded?

Having this:

protected async extractFilesFromGCS( gcsConfig: GoogleStorageClientConfig, bucketName: string, tableName: string ): Promise<string[]> { ....... const [files] = await bucket.getFiles({ prefix: `${tableName}/` });

Does Databricks create a folder for a table with a csv suffix? Like my_exported_table.cvs/ with a bunch of real csv files?

it looks like this:

gs://bucket-name/ └── table-name.csv/ ├── _SUCCESS ├── _committed_4308966877412207793 ├── _started_4308966877412207793 ├── part-00000-tid-4308966877412207793-4aede004-bc41-469a-8d29-7b806221dbf4-140315-1-c000.csv ├── part-00001-tid-4308966877412207793-4aede004-bc41-469a-8d29-7b806221dbf4-140316-1-c000.csv ├── part-00002-tid-4308966877412207793-4aede004-bc41-469a-8d29-7b806221dbf4-140317-1-c000.csv └── part-00003-tid-4308966877412207793-4aede004-bc41-469a-8d29-7b806221dbf4-140318-1-c000.csv

It's kind of weird actually since the folder name has a .csv suffix in it, like table-name.csv/

CREATE TABLE database.schema.table_name USING CSV LOCATION 'gs://bucket-name/table-name.csv' OPTIONS (escape = '"') AS ( select ... )

maybe we don't need the csv suffix in the LOCATION? But i don't have a databricks-aws env to test if it's the same behavior as exporting to GCS.

Yeah, I think your're right. We should remove csv posstfix from the location.

KSDaemon · 2025-04-07T10:07:21Z

packages/cubejs-databricks-jdbc-driver/src/DatabricksDriver.ts

+        decodeURIComponent(new URL(file).pathname).endsWith('.csv') || 
+        decodeURIComponent(new URL(file).pathname).endsWith('.csv.gz')


The answers to the questions above might affect these lines. Are they needed?

yes they are needed since those files will also be returned from extractFilesFromGCS, which i am not sure if it will cause any problems on the cubestore side (I observed those files in the logs of cubestore as well)

├── _SUCCESS ── _committed_4308966877412207793 ── _started_4308966877412207793

I think this should not be a problem. Anyway - if we want to make some filtering - this should be done in extractFilesFromGCS in BaseDriver for all.

packages/cubejs-databricks-jdbc-driver/src/DatabricksDriver.ts

KSDaemon · 2025-04-07T10:10:20Z

And I think convert to quoted tableFullName was not done in this PR :)

KSDaemon · 2025-04-07T11:12:14Z

Regarding adding tests: you can have a look at https://github.com/cube-js/cube/pull/8730/files as an example...It's a bit noisy with other things, but I hope it's still visible how to do it. Have a look at changes in cubejs-testing-drivers package: you need to add a test file, add package.json script, extend fixture, add snapshots (you can simply clone from other and change snapshot names) and drop a line into drivers-tests CI yaml.

qiao-x · 2025-04-08T03:06:18Z

And I think convert to quoted tableFullName was not done in this PR :)

yes, I rolled back those changes and I think it's better to include them in another MR. Removed from PR descriptions.

KSDaemon · 2025-04-09T08:52:20Z

@qiao-x I'll pull this PR and update it with CI tests. Let's see how it goes.

qiao-x requested a review from a team as a code owner March 31, 2025 06:08

github-actions bot added the pr:community Contribution from Cube.js community members. label Mar 31, 2025

qiao-x mentioned this pull request Mar 31, 2025

Databricks Driver: Export Bucket On GCS #9393

Closed

vercel bot deployed to Preview March 31, 2025 06:11 View deployment

Databricks export bucket for google cloud storage

fc5097b

qiao-x force-pushed the feature/gcs-export-bucket branch from 34675f0 to fc5097b Compare March 31, 2025 06:33

vercel bot deployed to Preview March 31, 2025 06:37 View deployment

KSDaemon self-assigned this Mar 31, 2025

KSDaemon changed the title ~~Databricks export bucket for google cloud storage~~ feat(databricks-jdbc-driver): Add export bucket support for Google Cloud Storage Mar 31, 2025

simplify docs

a789514

vercel bot deployed to Preview April 1, 2025 02:43 View deployment

qiao-x added 2 commits April 3, 2025 17:46

add correct search prefix and filter after extract

180093d

rollback tableFullName construction

7c0503f

KSDaemon reviewed Apr 7, 2025

View reviewed changes

packages/cubejs-databricks-jdbc-driver/src/DatabricksDriver.ts Outdated Show resolved Hide resolved

abstract out the bucket type array

7df6bac

KSDaemon mentioned this pull request Apr 9, 2025

feat(databricks-jdbc-driver): Add export bucket support for Google Cloud Storage #9445

Merged

4 tasks

KSDaemon closed this in #9445 Apr 11, 2025

		decodeURIComponent(new URL(file).pathname).endsWith('.csv') \|\|
		decodeURIComponent(new URL(file).pathname).endsWith('.csv.gz')

feat(databricks-jdbc-driver): Add export bucket support for Google Cloud Storage #9407

feat(databricks-jdbc-driver): Add export bucket support for Google Cloud Storage #9407

Uh oh!

Conversation

qiao-x commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qiao-x commented Mar 31, 2025

Uh oh!

qiao-x commented Apr 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KSDaemon commented Apr 3, 2025

Uh oh!

qiao-x commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KSDaemon Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

KSDaemon Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

qiao-x Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

qiao-x Apr 8, 2025

Choose a reason for hiding this comment

Uh oh!

KSDaemon Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

KSDaemon Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

qiao-x Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KSDaemon Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KSDaemon commented Apr 7, 2025

Uh oh!

KSDaemon commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qiao-x commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KSDaemon commented Apr 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qiao-x commented Mar 31, 2025 •

edited

Loading

qiao-x commented Apr 3, 2025 •

edited

Loading

qiao-x commented Apr 7, 2025 •

edited

Loading

qiao-x Apr 8, 2025 •

edited

Loading

KSDaemon commented Apr 7, 2025 •

edited

Loading

qiao-x commented Apr 8, 2025 •

edited

Loading