Skip to content

Conversation

@qiao-x
Copy link
Contributor

@qiao-x qiao-x commented Mar 31, 2025

Check List

  • Tests have been run in packages where changes made if available
  • Linter has been run for changed code
  • Tests for the changes have been added if not covered yet
  • Docs have been added / updated if required

Issue Reference this PR resolves

Closes #9393

Description of Changes Made (if issue reference is not provided)

  • add gcs bucket type

@qiao-x qiao-x requested a review from a team as a code owner March 31, 2025 06:08
@github-actions github-actions bot added the pr:community Contribution from Cube.js community members. label Mar 31, 2025
@qiao-x
Copy link
Contributor Author

qiao-x commented Mar 31, 2025

#9393

@qiao-x qiao-x force-pushed the feature/gcs-export-bucket branch from 34675f0 to fc5097b Compare March 31, 2025 06:33
@KSDaemon KSDaemon self-assigned this Mar 31, 2025
@KSDaemon KSDaemon changed the title Databricks export bucket for google cloud storage feat(databricks-jdbc-driver): Add export bucket support for Google Cloud Storage Mar 31, 2025
@qiao-x
Copy link
Contributor Author

qiao-x commented Apr 3, 2025

@KSDaemon BTW i also found this method dropTable(tableName: string, options?: QueryOptions) doesn't handle table name quoting as well. Should we uniform the way to return a full table name for Databricks driver?(Maybe in a separated MR)

@KSDaemon
Copy link
Member

KSDaemon commented Apr 3, 2025

@qiao-x I think yes, but let's make it in a separate PR.

@qiao-x
Copy link
Contributor Author

qiao-x commented Apr 7, 2025

@KSDaemon Hi, how should I write tests for those changes? Hope we can get this done recently and then we can use it in released versions since it's helpful for throughput (we see a 100x increase for large datasets)

return this.extractFilesFromGCS(
{ credentials: this.config.gcsCredentials },
url.host,
objectSearchPrefix+".csv",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important to add .csv here? Are there any other unrelated files that might be captured by objectSearchPrefix that should be excluded?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having this:

  protected async extractFilesFromGCS(
    gcsConfig: GoogleStorageClientConfig,
    bucketName: string,
    tableName: string
  ): Promise<string[]> {
.......
const [files] = await bucket.getFiles({ prefix: `${tableName}/` });

Does Databricks create a folder for a table with a csv suffix? Like my_exported_table.cvs/ with a bunch of real csv files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this:

gs://bucket-name/
└── table-name.csv/
    ├── _SUCCESS
    ├── _committed_4308966877412207793
    ├── _started_4308966877412207793
    ├── part-00000-tid-4308966877412207793-4aede004-bc41-469a-8d29-7b806221dbf4-140315-1-c000.csv
    ├── part-00001-tid-4308966877412207793-4aede004-bc41-469a-8d29-7b806221dbf4-140316-1-c000.csv
    ├── part-00002-tid-4308966877412207793-4aede004-bc41-469a-8d29-7b806221dbf4-140317-1-c000.csv
    └── part-00003-tid-4308966877412207793-4aede004-bc41-469a-8d29-7b806221dbf4-140318-1-c000.csv

It's kind of weird actually since the folder name has a .csv suffix in it, like table-name.csv/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CREATE TABLE database.schema.table_name
USING CSV
LOCATION 'gs://bucket-name/table-name.csv'
OPTIONS (escape = '"') AS
(
 select ...
)

maybe we don't need the csv suffix in the LOCATION? But i don't have a databricks-aws env to test if it's the same behavior as exporting to GCS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think your're right. We should remove csv posstfix from the location.

Comment on lines +751 to +752
decodeURIComponent(new URL(file).pathname).endsWith('.csv') ||
decodeURIComponent(new URL(file).pathname).endsWith('.csv.gz')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The answers to the questions above might affect these lines. Are they needed?

Copy link
Contributor Author

@qiao-x qiao-x Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes they are needed since those files will also be returned from extractFilesFromGCS, which i am not sure if it will cause any problems on the cubestore side (I observed those files in the logs of cubestore as well)

├── _SUCCESS
 ── _committed_4308966877412207793
 ── _started_4308966877412207793

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not be a problem. Anyway - if we want to make some filtering - this should be done in extractFilesFromGCS in BaseDriver for all.

@KSDaemon
Copy link
Member

KSDaemon commented Apr 7, 2025

And I think convert to quoted tableFullName was not done in this PR :)

@KSDaemon
Copy link
Member

KSDaemon commented Apr 7, 2025

Regarding adding tests: you can have a look at https://github.com/cube-js/cube/pull/8730/files as an example...It's a bit noisy with other things, but I hope it's still visible how to do it. Have a look at changes in cubejs-testing-drivers package: you need to add a test file, add package.json script, extend fixture, add snapshots (you can simply clone from other and change snapshot names) and drop a line into drivers-tests CI yaml.

@qiao-x
Copy link
Contributor Author

qiao-x commented Apr 8, 2025

And I think convert to quoted tableFullName was not done in this PR :)

yes, I rolled back those changes and I think it's better to include them in another MR. Removed from PR descriptions.

@KSDaemon
Copy link
Member

KSDaemon commented Apr 9, 2025

@qiao-x I'll pull this PR and update it with CI tests. Let's see how it goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr:community Contribution from Cube.js community members.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Databricks Driver: Export Bucket On GCS

2 participants