Skip to content

Commit 4dacbbd

Browse files
authored
Databricks Volumes source connector: workflow run trigger (#689)
1 parent b54af0a commit 4dacbbd

File tree

2 files changed

+144
-0
lines changed

2 files changed

+144
-0
lines changed

docs.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -274,6 +274,7 @@
274274
"pages": [
275275
"examplecode/tools/s3-events",
276276
"examplecode/tools/azure-storage-events",
277+
"examplecode/tools/databricks-volumes-events",
277278
"examplecode/tools/gcs-events",
278279
"examplecode/tools/google-drive-events",
279280
"examplecode/tools/onedrive-events",
Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
---
2+
title: Databricks Volumes event triggers
3+
---
4+
5+
You can use Databricks Volumes events, such as uploading files to Databricks Volumes, to automatically run Unstructured ETL+ workflows
6+
that rely on those Databricks Volumes as sources. This enables a no-touch approach to having Unstructured automatically process files as they are uploaded to Databricks Volumes.
7+
8+
This example shows how to automate this process by adding a custom job in Lakeflow Jobs for your Databricks workspace in
9+
[AWS](https://docs.databricks.com/aws/jobs/), [Azure](https://learn.microsoft.com/azure/databricks/jobs/), or
10+
[GCP](https://docs.databricks.com/gcp/jobs). This job runs
11+
whenever a file upload event is detected in the specified Databricks Volume. This job uses a custom Databricks notebook to call the [Unstructured Workflow Endpoint](/api-reference/workflow/overview) to automatically run the
12+
specified corresponding Unstructured ETL+ workflow within your Unstructured account.
13+
14+
<Note>
15+
This example uses a custom job in Lakeflow Jobs and a custom Databricks notebookthat you create and maintain.
16+
Any issues with file detection, timing, or job execution could be related to your custom job or notebook,
17+
rather than with Unstructured. If you are getting unexpected or no results, be sure to check your custom
18+
job's run logs first for any informational and error messages.
19+
</Note>
20+
21+
## Requirements
22+
23+
import GetStartedSimpleApiOnly from '/snippets/general-shared-text/get-started-simple-api-only.mdx'
24+
25+
To use this example, you will need the following:
26+
27+
- An Unstructured account, and an Unstructured API key for your account, as follows:
28+
29+
<GetStartedSimpleApiOnly />
30+
31+
- The Unstructured Workflow Endpoint URL for your account, as follows:
32+
33+
1. In the Unstructured UI, click **API Keys** on the sidebar.<br/>
34+
2. Note the value of the **Unstructured Workflow Endpoint** field.
35+
36+
- A Databricks Volumes source connector in your Unstructured account. [Learn how](/ui/sources/databricks-volumes).
37+
- Some available [destination connector](/ui/destinations/overview) in your Unstructured account.
38+
- A workflow that uses the preceding source and destination connectors. [Learn how](/ui/workflows).
39+
40+
## Step 1: Create a notebook to run the Unstructured workflow
41+
42+
1. Sign in to the Databricks workspace within your Databricks account for AWS, Azure, or GCP that
43+
corresponds to the workspace you specified for your Databricks Volumes source connector.
44+
2. On the sidebar, click **+ New > Notebook**.
45+
3. Click the notebook's title and change it to something more descriptive, such as `Unstructured Workflow Runner Notebook`.
46+
4. In the notebook's first cell, add the following code:
47+
48+
```python
49+
!pip install requests
50+
```
51+
52+
5. Click **Edit > Insert cell below**.
53+
6. In this second cell, add the following code:
54+
55+
```python
56+
import requests, os
57+
58+
url = '<unstructured-api-url>' + '/workflows/<workflow-id>/run'
59+
60+
# Option 1 (Recommended): Get your Unstructured API key from Databricks Secrets.
61+
api_key = dbutils.secrets.get(scope="<scope>", key="<key>")
62+
63+
# Option 2: Get your Unstructured API key from an environment variable stored on
64+
# the notebook's attached compute resource.
65+
api_key = os.getenv("UNSTRUCTURED_API_KEY")
66+
67+
headers = {
68+
'accept': 'application/json',
69+
'content-type': 'application/json',
70+
'unstructured-api-key': api_key
71+
}
72+
73+
json_data = {}
74+
75+
try:
76+
response = requests.post(url, headers=headers, json=json_data)
77+
response.raise_for_status()
78+
print(f'Status Code: {response.status_code}')
79+
print('Response:', response.json())
80+
except Exception as e:
81+
print('An error occurred:', e)
82+
```
83+
84+
7. Replace the placeholders in this second cell as follows:
85+
86+
- Replace `<unstructured-api-url>` with the value of the **Unstructured Workflow Endpoint** field earlier from the requirements.
87+
- Replace `<workflow-id>` with the ID of the workflow that you want to run.
88+
- For your Unstructured API key, do one of the following:
89+
90+
- (Recommended) If you want to use Databricks Secrets, replace `<scope>` and `<key>` with the scope and key names for the existing secret that you have already created in Databricks Secrets.
91+
Learn how to work with Databricks Secrets for
92+
[AWS](https://docs.databricks.com/aws/security/secrets/#secrets-overview),
93+
[Azure](https://learn.microsoft.com/azure/databricks/security/secrets/#secrets-overview), or
94+
[GCP](https://docs.databricks.com/gcp/security/secrets#secrets-overview).
95+
- If you want to use environment variables on the attached compute resource, set the `UNSTRUCTURED_API_KEY` to your Unstructured API key value. Learn how for
96+
[AWS](https://docs.databricks.com/aws/compute/configure#environment-variables),
97+
[Azure](https://learn.microsoft.com/azure/databricks/compute/configure#environment-variables), or
98+
[GCP](https://docs.databricks.com/gcp/compute/configure#environment-variables).
99+
100+
101+
## Step 2: Create a job in Lakeflow Jobs to run the notebook
102+
103+
1. With your Databricks workspace still open from the previous step, on the sidebar, click **Jobs & Pipelines**.
104+
2. Expand **Create new**, and then click **Job**.
105+
3. Click the job's title and change it to something more descriptive, such as `Unstructured Workflow Runner Job`.
106+
4. On the **Tasks** tab, enter some **Task name** such as `Run_Unstructured_Workflow_Runner_Notebook`.
107+
5. With **Notebook** selected for **Type**, and with **Workspace** selected for **Source**, use the **Path** dropdown to select the notebook you created in the previous step.
108+
6. For **Cluster**, select the cluster you want to use to run the notebook.
109+
7. Click **Create task**.
110+
8. In the **Job details** pane, under **Schedules & Triggers**, click **Add trigger**.
111+
9. For **Trigger type**, select **File arrival**.
112+
10. For **Storage location**, enter the path to the volume to monitor or, if you are monitoring a folder within that volume, the path to the folder. To get this path, do the following:
113+
114+
a. On the sidebar, click **Catalog**.<br/>
115+
b. In the list of catalogs, expand the catalog that contains the volume you want to monitor.<br/>
116+
c. In the list of schemas (formerly known as databases), expand the schema that contains the volume you want to monitor.<br/>
117+
d. Expand **Volumes**.<br/>
118+
e. Click the volume you want to monitor.<br/>
119+
f. On the **Overview** tab, copy the path to the volume you want to monitor or, if you are monitoring a folder within that volume, click the path to the folder and then copy the path to that folder.<br/>
120+
121+
11. Click **Save**.
122+
123+
## Step 3: Trigger the job
124+
125+
1. With your Databricks workspace still open from the previous step, on the sidebar, click **Catalog**.
126+
2. In the list of catalogs, expand the catalog that contains the volume that is being monitored.
127+
3. In the list of schemas (formerly known as databases), expand the schema that contains the volume that is being monitored.
128+
4. Expand **Volumes**.
129+
5. Click the volume that is being monitored or, if you are monitoring a folder within that volume, click the folder.
130+
6. Click **Upload to this volume**, and follow the on-screen instructions to upload a file to the volume or folder that is being monitored.
131+
132+
## Step 4: View trigger results
133+
134+
1. With your Databricks workspace still open from the previous step, on the sidebar, click **Jobs & Pipelines**.
135+
2. On the **Jobs & pipelines** tab, click the name of the job you created earlier in Step 2.
136+
3. On the **Runs** tab, wait until the current job run shows a **Status** of **Succeeded**.
137+
4. In the Unstructured user interface for your account, click **Jobs** on the sidebar.
138+
5. In the list of jobs, click the newly running job for your workflow.
139+
6. After the job status shows **Finished**, go to your destination location to see the results.
140+
141+
## Step 5 (Optional): Pause the trigger
142+
143+
To stop triggering the job, with your job in Lakeflow Jobs still open earlier from Step 4, in the **Job details** pane, under **Schedules & Triggers**, click **Pause**.

0 commit comments

Comments
 (0)