-
Notifications
You must be signed in to change notification settings - Fork 1.9k
feat(databricks-jdbc-driver): Add export bucket support for Google Cloud Storage #9407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
34675f0 to
fc5097b
Compare
|
@KSDaemon BTW i also found this method |
|
@qiao-x I think yes, but let's make it in a separate PR. |
|
@KSDaemon Hi, how should I write tests for those changes? Hope we can get this done recently and then we can use it in released versions since it's helpful for throughput (we see a 100x increase for large datasets) |
| return this.extractFilesFromGCS( | ||
| { credentials: this.config.gcsCredentials }, | ||
| url.host, | ||
| objectSearchPrefix+".csv", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it important to add .csv here? Are there any other unrelated files that might be captured by objectSearchPrefix that should be excluded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having this:
protected async extractFilesFromGCS(
gcsConfig: GoogleStorageClientConfig,
bucketName: string,
tableName: string
): Promise<string[]> {
.......
const [files] = await bucket.getFiles({ prefix: `${tableName}/` });Does Databricks create a folder for a table with a csv suffix? Like my_exported_table.cvs/ with a bunch of real csv files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like this:
gs://bucket-name/
└── table-name.csv/
├── _SUCCESS
├── _committed_4308966877412207793
├── _started_4308966877412207793
├── part-00000-tid-4308966877412207793-4aede004-bc41-469a-8d29-7b806221dbf4-140315-1-c000.csv
├── part-00001-tid-4308966877412207793-4aede004-bc41-469a-8d29-7b806221dbf4-140316-1-c000.csv
├── part-00002-tid-4308966877412207793-4aede004-bc41-469a-8d29-7b806221dbf4-140317-1-c000.csv
└── part-00003-tid-4308966877412207793-4aede004-bc41-469a-8d29-7b806221dbf4-140318-1-c000.csv
It's kind of weird actually since the folder name has a .csv suffix in it, like table-name.csv/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CREATE TABLE database.schema.table_name
USING CSV
LOCATION 'gs://bucket-name/table-name.csv'
OPTIONS (escape = '"') AS
(
select ...
)
maybe we don't need the csv suffix in the LOCATION? But i don't have a databricks-aws env to test if it's the same behavior as exporting to GCS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think your're right. We should remove csv posstfix from the location.
| decodeURIComponent(new URL(file).pathname).endsWith('.csv') || | ||
| decodeURIComponent(new URL(file).pathname).endsWith('.csv.gz') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The answers to the questions above might affect these lines. Are they needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes they are needed since those files will also be returned from extractFilesFromGCS, which i am not sure if it will cause any problems on the cubestore side (I observed those files in the logs of cubestore as well)
├── _SUCCESS
── _committed_4308966877412207793
── _started_4308966877412207793
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should not be a problem. Anyway - if we want to make some filtering - this should be done in extractFilesFromGCS in BaseDriver for all.
|
And I think |
|
Regarding adding tests: you can have a look at https://github.com/cube-js/cube/pull/8730/files as an example...It's a bit noisy with other things, but I hope it's still visible how to do it. Have a look at changes in |
yes, I rolled back those changes and I think it's better to include them in another MR. Removed from PR descriptions. |
|
@qiao-x I'll pull this PR and update it with CI tests. Let's see how it goes. |
Check List
Issue Reference this PR resolves
Closes #9393
Description of Changes Made (if issue reference is not provided)
gcsbucket type