Skip to content

Commit 562b11f

Browse files
authored
Databricks Volumes v2 destination connector: update authentication details (#252)
1 parent 7f744a6 commit 562b11f

File tree

6 files changed

+155
-34
lines changed

6 files changed

+155
-34
lines changed

snippets/destination_connectors/databricks_volumes.sh.mdx

Lines changed: 15 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6,21 +6,23 @@
66
unstructured-ingest \
77
local \
88
--input-path $LOCAL_FILE_INPUT_DIR \
9-
--output-dir $LOCAL_FILE_OUTPUT_DIR \
10-
--strategy hi_res \
11-
--chunk-elements \
12-
--embedding-provider langchain-huggingface \
13-
--num-processes 2 \
14-
--verbose \
15-
--work-dir local-input \
169
--partition-by-api \
17-
--api-key $UNSTRUCTURED_API_KEY\
10+
--api-key $UNSTRUCTURED_API_KEY \
1811
--partition-endpoint $UNSTRUCTURED_API_URL \
12+
--strategy hi_res \
1913
--additional-partition-args="{\"split_pdf_page\":\"true\", \"split_pdf_allow_failed\":\"true\", \"split_pdf_concurrency_level\": 15}" \
14+
--chunk-by-api \
15+
--chunking-strategy by_title \
16+
--chunk-api-key $UNSTRUCTURED_API_KEY \
17+
--chunking-endpoint $UNSTRUCTURED_API_URL \
18+
--embedding-provider langchain-huggingface \
19+
--embedding-model-name sentence-transformers/all-mpnet-base-v2 \
2020
databricks-volumes \
21-
--host "$DATABRICKS_HOST" \
22-
--username "$DATABRICKS_USERNAME" \
23-
--password "$DATABRICKS_PASSWORD" \
24-
--volume "$DATABRICKS_VOLUME" \
25-
--catalog "$DATABRICKS_CATALOG"
21+
--host $DATABRICKS_HOST \
22+
--token $DATABRICKS_TOKEN \
23+
--cluster-id $DATABRICKS_CLUSTER_ID \
24+
--catalog $DATABRICKS_CATALOG \
25+
--schema $DATABRICKS_SCHEMA \
26+
--volume $DATABRICKS_VOLUME \
27+
--volume-path $DATABRICKS_VOLUME_PATH
2628
```

snippets/destination_connectors/databricks_volumes.v1.py.mdx

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,13 +26,15 @@ def get_writer() -> Writer:
2626
connector_config=SimpleDatabricksVolumesConfig(
2727
host=os.getenv("DATABRICKS_HOST"),
2828
access_config=DatabricksVolumesAccessConfig(
29-
username=os.getenv("DATABRICKS_USERNAME"),
30-
password=os.getenv("DATABRICKS_PASSWORD")
29+
token=os.getenv("DATABRICKS_TOKEN"),
30+
cluster_id=os.getenv("DATABRICKS_CLUSTER_ID")
3131
),
3232
),
3333
write_config=DatabricksVolumesWriteConfig(
3434
catalog=os.getenv("DATABRICKS_CATALOG"),
35+
schema=os.getenv("DATABRICKS_SCHEMA"),
3536
volume=os.getenv("DATABRICKS_VOLUME"),
37+
volume_path=os.getenv("DATABRICKS_VOLUME_PATH")
3638
),
3739
)
3840

@@ -56,10 +58,13 @@ if __name__ == "__main__":
5658
partition_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
5759
strategy="hi_res",
5860
),
59-
chunking_config=ChunkingConfig(chunk_elements=True),
61+
chunking_config=ChunkingConfig(
62+
chunk_elements=True
63+
chunking_strategy="by_title",
64+
),
6065
embedding_config=EmbeddingConfig(
6166
provider="langchain-huggingface",
62-
api_key=None,
67+
model_name="sentence-transformers/all-mpnet-base-v2",
6368
),
6469
writer=writer,
6570
writer_kwargs={},

snippets/destination_connectors/databricks_volumes.v2.py.mdx

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -37,18 +37,28 @@ if __name__ == "__main__":
3737
"split_pdf_concurrency_level": 15
3838
}
3939
),
40-
chunker_config=ChunkerConfig(chunking_strategy="by_title"),
41-
embedder_config=EmbedderConfig(embedding_provider="langchain-huggingface"),
40+
chunker_config=ChunkerConfig(
41+
chunk_by_api=True,
42+
chunk_api_key=os.getenv("UNSTRUCTURED_API_KEY"),
43+
chunking_endpoint=os.getenv("UNSTRUCTURED_API_URL"),
44+
chunking_strategy="by_title"
45+
),
46+
embedder_config=EmbedderConfig(
47+
embedding_provider="langchain-huggingface",
48+
embedding_model_name="sentence-transformers/all-mpnet-base-v2"
49+
),
4250
destination_connection_config=DatabricksVolumesConnectionConfig(
4351
access_config=DatabricksVolumesAccessConfig(
44-
username=os.getenv("DATABRICKS_USERNAME"),
45-
password=os.getenv("DATABRICKS_PASSWORD")
52+
token=os.getenv("DATABRICKS_TOKEN"),
53+
cluster_id=os.getenv("DATABRICKS_CLUSTER_ID")
4654
),
4755
host=os.getenv("DATABRICKS_HOST")
4856
),
4957
uploader_config=DatabricksVolumesUploaderConfig(
5058
catalog=os.getenv("DATABRICKS_CATALOG"),
51-
volume=os.getenv("DATABRICKS_VOLUME")
59+
schema=os.getenv("DATABRICKS_SCHEMA"),
60+
volume=os.getenv("DATABRICKS_VOLUME"),
61+
volume_path=os.getenv("DATABRICKS_VOLUME_PATH")
5262
)
5363
).run()
5464
```

snippets/general-shared-text/databricks-volumes-cli-api.mdx

Lines changed: 54 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -10,11 +10,61 @@ import AdditionalIngestDependencies from '/snippets/general-shared-text/ingest-d
1010

1111
The following environment variables:
1212

13-
- `DATABRICKS_HOST` - The Databricks compute resource's host name, represented by `--host` (CLI) or `host` (Python).
13+
- `DATABRICKS_HOST` - The Databricks host URL, represented by `--host` (CLI) or `host` (Python).
14+
- `DATABRICKS_CLUSTER_ID` - The Databricks compute resource ID, represented by `--cluster-id` (CLI) or `cluster_id` (Python).
1415
- `DATABRICKS_CATALOG` - The Databricks catalog name for the Volume, represented by `--catalog` (CLI) or `catalog` (Python).
16+
- `DATABRICKS_SCHEMA` - The Databricks schema name for the Volume, represented by `--schema` (CLI) or `schema` (Python). If not specified, `default` is used.
1517
- `DATABRICKS_VOLUME` - The Databricks Volume name, represented by `--volume` (CLI) or `volume` (Python).
18+
- `DATABRICKS_VOLUME_PATH` - Any optional path to access within the volume, specified by `--volume-path` (CLI) or `volume_path` (Python).
1619

17-
Environment variables based on your authentication type, depending which types are supported by your cloud provider. For example, for username and password authentication:
20+
Environment variables based on your authentication type, depending on your cloud provider:
1821

19-
- `DATABRICKS_USERNAME` - The Databricks account user's name, represented by `--username` (CLI) or `username` (Python).
20-
- `DATABRICKS_PASSWORD` - The Databricks account user's password, represented by `--password` (CLI) or `password` (Python).
22+
- For Databricks personal access token authentication (AWS, Azure, and GCP):
23+
24+
- `DATABRICKS_TOKEN` - The personal access token, represented by `--token` (CLI) or `token` (Python).
25+
26+
- For username and password (basic) authentication (AWS only): The user's name and password values.
27+
28+
- `DATABRICKS_USERNAME` - The user's name, represented by `--username` (CLI) or `username` (Python).
29+
- `DATABRICKS_PASSWORD` - The user's password, represented by `--password` (CLI) or `password` (Python).
30+
31+
- For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal.
32+
33+
- `DATABRICKS_CLIENT_ID` - The client ID value for the corresponding service principal, represented by `--client-id` (CLI) or `client_id` (Python).
34+
- `DATABRICKS_CLIENT_SECRET` - The client ID and OAuth secret values for the corresponding service principal, represented by `--client-secret` (CLI) or `client_secret` (Python).
35+
36+
- For OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP): No additional environment variables.
37+
38+
- For Azure managed identities (MSI) authentication (Azure only):
39+
40+
- `ARM_CLIENT_ID` - The client ID value for the corresponding managed identity, represented by `--azure-client-id` (CLI) or `azure_client_id` (Python).
41+
- If the target identity has not already been added to the workspace, then you must also specify the
42+
`DATABRICKS_AZURE_RESOURCE_ID`, represented by `--azure-workspace-resource-id` (CLI) or `azure_workspace_resource_id` (Python).
43+
44+
- For Microsoft Entra ID service principal authentication (Azure only):
45+
46+
- `ARM_TENANT_ID` - The tenant ID value for the corresponding service principal, represented by `--azure-tenant-id` (CLI) or `azure_tenant_id` (Python).
47+
- `ARM_CLIENT_ID` - The client ID value for the corresponding service principal, represented by `--azure-client-id` (CLI) or `azure_client_id` (Python).
48+
- `ARM_CLIENT_SECRET` - The client secret value for the corresponding service principal, represented by `--azure-client-secret` (CLI) or `azure_client_secret` (Python).
49+
- If the service principal has not already been added to the workspace, then you must also specify the
50+
`DATABRICKS_AZURE_RESOURCE_ID`, represented by `--azure-workspace-resource-id` (CLI) or `azure_workspace_resource_id` (Python).
51+
52+
- For Azure CLI authentication (Azure only): No additional environment variables.
53+
54+
- For Microsoft Entra ID user authentication (Azure only):
55+
56+
- `DATABRICKS_TOKEN` - The Entra ID token for the corresponding Entra ID user, represented by `--token` (CLI) or `token` (Python).
57+
58+
- For Google Cloud Platform credentials authentication (GCP only):
59+
60+
- `GOOGLE_CREDENTIALS` - The local path to the corresponding Google Cloud service account's credentials file, represented by `--google-credentials` (CLI) or `google_credentials`
61+
62+
- For Google Cloud Platform ID authentication (GCP only):
63+
64+
- `GOOGLE_SERVICE_ACCOUNT` - The Google Cloud service account's email address, represented by `--google-service-account` (CLI) or `google_service_account` (Python).
65+
66+
- Alternatively, you can store the preceding settings in a local
67+
[Databricks configuration profile](https://docs.databricks.com/en/dev-tools/auth/config-profiles.html) and then just
68+
refer to the profile's name:
69+
70+
- `DATABRICKS_PROFILE` - The name of the Databricks configuration profile, represented by `--profile` (CLI) or `profile` (Python).

snippets/general-shared-text/databricks-volumes-platform.mdx

Lines changed: 26 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,35 @@ Fill in the following fields:
22

33
- **Name** (_required_): A unique name for this connector.
44
- **Host** (_required_): The Databricks workspace host URL.
5-
- **Account ID** : The Databricks account ID, if needed.
6-
- **Username** : The Databricks username, if basic authentication is used.
7-
- **Password** : The associated Databricks password, if basic authentication is used.
8-
- **Token** : The Databricks personal access token, if personal access token authentication is used.
95
- **Cluster ID** : The Databricks cluster ID.
106
- **Catalog** (_required_): The name of the catalog to use.
117
- **Schema** : The name of the associated schema. If not specified, **default** is used.
128
- **Volume** (_required_): The name of the associated volume.
139
- **Volume Path** : Any optional path to access within the volume.
1410
- **Overwrite** Check this box if existing data should be overwritten.
15-
- **Encoding** : Any encoding to be applied to the data in the volume. If not specified, **utf-8**, is used.
11+
- **Encoding** : Any encoding to be applied to the data in the volume. If not specified, **utf-8**, is used.
12+
13+
Also fill in the following fields based on your authentication type, depending on your cloud provider:
14+
15+
- For Databricks personal access token authentication (AWS, Azure, and GCP):
16+
17+
- **Token** : The Databricks personal access token value.
18+
19+
- For username and password (basic) authentication (AWS only):
20+
21+
- **Username** : The Databricks username value.
22+
- **Password** : The associated Databricks password value.
23+
24+
The following authentication types are currently not supported:
25+
26+
- OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP).
27+
- OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP).
28+
- Azure managed identities (MSI) authentication (Azure only).
29+
- Microsoft Entra ID service principal authentication (Azure only).
30+
- Azure CLI authentication (Azure only).
31+
- Microsoft Entra ID user authentication (Azure only).
32+
- Google Cloud Platform credentials authentication (GCP only).
33+
- Google Cloud Platform ID authentication (GCP only).
34+
35+
36+
Lines changed: 36 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,39 @@
11
The Databricks Volumes prerequisites:
22

3-
- The Databricks compute resource's host name. Get the host name for [AWS](https://docs.databricks.com/integrations/compute-details.html), [Azure](https://learn.microsoft.com/azure/databricks/integrations/compute-details), or [GCP](https://docs.gcp.databricks.com/integrations/compute-details.html).
4-
- The Databricks authentication details. For more information, see the documentation for [AWS](https://docs.databricks.com/dev-tools/auth/index.html), [Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/), or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/index.html).
3+
- The Databricks workspace URL. Get the workspace URL for
4+
[AWS](https://docs.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids),
5+
[Azure](https://learn.microsoft.com/azure/databricks/workspace/workspace-details#workspace-instance-names-urls-and-ids),
6+
or [GCP](https://docs.gcp.databricks.com/workspace/workspace-details.html#workspace-instance-names-urls-and-ids).
7+
8+
Examples:
9+
10+
- AWS: `https://<workspace-id>.cloud.databricks.com`
11+
- Azure: `https://adb-<workspace-id>.<random-number>.azuredatabricks.net`
12+
- GCP: `https://<workspace-id>.<random-number>.gcp.databricks.com`
13+
14+
- The Databricks compute resource's ID. Get the compute resource ID for
15+
[AWS](https://docs.databricks.com/integrations/compute-details.html),
16+
[Azure](https://learn.microsoft.com/azure/databricks/integrations/compute-details),
17+
or [GCP](https://docs.gcp.databricks.com/integrations/compute-details.html).
18+
19+
- The Databricks authentication details. For more information, see the documentation for
20+
[AWS](https://docs.databricks.com/dev-tools/auth/index.html),
21+
[Azure](https://learn.microsoft.com/azure/databricks/dev-tools/auth/),
22+
or [GCP](https://docs.gcp.databricks.com/dev-tools/auth/index.html).
23+
24+
More specifically, you will need:
25+
26+
- For Databricks personal access token authentication (AWS, Azure, and GCP): The personal access token's value.
27+
- For username and password (basic) authentication (AWS only): The user's name and password values.
28+
- For OAuth machine-to-machine (M2M) authentication (AWS, Azure, and GCP): The client ID and OAuth secret values for the corresponding service principal.
29+
- For OAuth user-to-machine (U2M) authentication (AWS, Azure, and GCP): No additional values.
30+
- For Azure managed identities (MSI) authentication (Azure only): The client ID value for the corresponding managed identity.
31+
- For Microsoft Entra ID service principal authentication (Azure only): The tenant ID, client ID, and client secret values for the corresponding service principal.
32+
- For Azure CLI authentication (Azure only): No additional values.
33+
- For Microsoft Entra ID user authentication (Azure only): The Entra ID token for the corresponding Entra ID user.
34+
- For Google Cloud Platform credentials authentication (GCP only): The local path to the corresponding Google Cloud service account's credentials file.
35+
- For Google Cloud Platform ID authentication (GCP only): The Google Cloud service account's email address.
36+
537
- The Databricks catalog name for the Volume. Get the catalog name for [AWS](https://docs.databricks.com/catalogs/manage-catalog.html), [Azure](https://learn.microsoft.com/azure/databricks/catalogs/manage-catalog), or [GCP](https://docs.gcp.databricks.com/catalogs/manage-catalog.html).
6-
- The Databricks Volume name. Get the volume name for [AWS](https://docs.databricks.com/files/volumes.html), [Azure](https://learn.microsoft.com/azure/databricks/files/volumes), or [GCP](https://docs.gcp.databricks.com/files/volumes.html).
38+
- The Databricks schema name for the Volume. Get the schema name for [AWS](https://docs.databricks.com/schemas/manage-schema.html), [Azure](https://learn.microsoft.com/azure/databricks/schemas/manage-schema), or [GCP](https://docs.gcp.databricks.com/schemas/manage-schema.html).
39+
- The Databricks Volume name, and optionally any path in that Volume that you want to access directly. Get the Volume information for [AWS](https://docs.databricks.com/files/volumes.html), [Azure](https://learn.microsoft.com/azure/databricks/files/volumes), or [GCP](https://docs.gcp.databricks.com/files/volumes.html).

0 commit comments

Comments
 (0)