Skip to content

Commit d734a2b

Browse files
authored
Merge pull request #108144 from Jak-MS/solution-template-databricks-notebook
edit pass: Solution template databricks notebook
2 parents d80c541 + fda9bbc commit d734a2b

File tree

1 file changed

+88
-68
lines changed

1 file changed

+88
-68
lines changed

articles/data-factory/solution-template-databricks-notebook.md

Lines changed: 88 additions & 68 deletions
Original file line numberDiff line numberDiff line change
@@ -15,36 +15,43 @@ ms.date: 03/03/2020
1515

1616
# Transformation with Azure Databricks
1717

18-
In this tutorial, you create an end-to-end pipeline containing **Validation**, **Copy**, and **Notebook** activities in Data Factory.
18+
In this tutorial, you create an end-to-end pipeline that contains the **Validation**, **Copy data**, and **Notebook** activities in Azure Data Factory.
1919

20-
- **Validation** activity is used to ensure the source dataset is ready for downstream consumption, before triggering the copy and analytics job.
20+
- **Validation** ensures that your source dataset is ready for downstream consumption before you trigger the copy and analytics job.
2121

22-
- **Copy** activity copies the source file/ dataset to the sink storage. The sink storage is mounted as DBFS in the Databricks notebook so that the dataset can be directly consumed by Spark.
22+
- **Copy data** duplicates the source dataset to the sink storage, which is mounted as DBFS in the Azure Databricks notebook. In this way, the dataset can be directly consumed by Spark.
2323

24-
- **Databricks Notebook** activity triggers the Databricks notebook that transforms the dataset, and adds it to a processed folder/ SQL DW.
24+
- **Notebook** triggers the Databricks notebook that transforms the dataset. It also adds the dataset to a processed folder or Azure SQL Data Warehouse.
2525

26-
To keep this template simple, the template doesn't create a scheduled trigger. You can add that if necessary.
26+
For simplicity, the template in this tutorial doesn't create a scheduled trigger. You can add one if necessary.
2727

28-
![1](media/solution-template-Databricks-notebook/pipeline-example.png)
28+
![Diagram of the pipeline](media/solution-template-Databricks-notebook/pipeline-example.png)
2929

3030
## Prerequisites
3131

32-
1. Create a **blob storage account** and a container called `sinkdata` to be used as **sink**. Keep a note of the **storage account name**, **container name**, and **access key**, since they are referenced later in the template.
32+
- An Azure Blob storage account with a container called `sinkdata` for use as a sink.
3333

34-
2. Ensure you have an **Azure Databricks workspace** or create a new one.
34+
Make note of the storage account name, container name, and access key. You'll need these values later in the template.
3535

36-
3. **Import the notebook for Transformation**.
37-
1. In your Azure Databricks, reference following screenshots for importing a **Transformation** notebook to the Databricks workspace. It does not have to be in the same location as below, but remember the path that you choose for later.
38-
39-
![2](media/solution-template-Databricks-notebook/import-notebook.png)
40-
41-
1. Select "Import from: **URL**", and enter following URL in the textbox:
42-
43-
* `https://adflabstaging1.blob.core.windows.net/share/Transformations.html`
44-
45-
![3](media/solution-template-Databricks-notebook/import-from-url.png)
36+
- An Azure Databricks workspace.
4637

47-
4. Now let’s update the **Transformation** notebook with your storage connection information. Go to **command 5** (as shown in below code snippet) in the imported notebook above, and replace `<storage name>`and `<access key>` with your own storage connection information. Ensure this account is the same storage account created earlier and contains the `sinkdata` container.
38+
## Import a notebook for Transformation
39+
40+
To import a **Transformation** notebook to your Databricks workspace:
41+
42+
1. Sign in to your Azure Databricks workspace, and then select **Import**.
43+
![Menu command for importing a workspace](media/solution-template-Databricks-notebook/import-notebook.png)
44+
Your workspace path can be different from the one shown, but remember it for later.
45+
1. Select **Import from: URL**. In the text box, enter `https://adflabstaging1.blob.core.windows.net/share/Transformations.html`.
46+
47+
![Selections for importing a notebook](media/solution-template-Databricks-notebook/import-from-url.png)
48+
49+
1. Now let's update the **Transformation** notebook with your storage connection information.
50+
51+
In the imported notebook, go to **command 5** as shown in the following code snippet.
52+
53+
- Replace `<storage name>`and `<access key>` with your own storage connection information.
54+
- Use the storage account with the `sinkdata` container.
4855

4956
```python
5057
# Supply storageName and accessKey values  
@@ -68,95 +75,108 @@ To keep this template simple, the template doesn't create a scheduled trigger. Y
6875
print e \# Otherwise print the whole stack trace.  
6976
```
7077

71-
5. Generate a **Databricks access token** for Data Factory to access Databricks. **Save the access token** for later use in creating a Databricks linked service, which looks something like 'dapi32db32cbb4w6eee18b7d87e45exxxxxx'.
78+
1. Generate a **Databricks access token** for Data Factory to access Databricks.
79+
1. In your Databricks workspace, select your user profile icon in the upper right.
80+
1. Select **User Settings**.
81+
![Menu command for user settings](media/solution-template-Databricks-notebook/user-setting.png)
82+
1. Select **Generate New Token** under the **Access Tokens** tab.
83+
1. Select **Generate**.
7284

73-
![4](media/solution-template-Databricks-notebook/user-setting.png)
85+
!["Generate" button](media/solution-template-Databricks-notebook/generate-new-token.png)
7486

75-
![5](media/solution-template-Databricks-notebook/generate-new-token.png)
87+
*Save the access token* for later use in creating a Databricks linked service. The access token looks something like `dapi32db32cbb4w6eee18b7d87e45exxxxxx`.
7688

7789
## How to use this template
7890

79-
1. Go to **Transformation with Azure Databricks** template. Create new linked services for following connections.
80-
81-
![Connections setting](media/solution-template-Databricks-notebook/connections-preview.png)
91+
1. Go to the **Transformation with Azure Databricks** template and create new linked services for following connections.
8292

83-
1. **Source Blob Connection**for accessing source data.
84-
85-
You can use the public blob storage containing the source files for this sample. Reference following screenshot for configuration. Use below **SAS URL** to connect to source storage (read-only access):
86-
* `https://storagewithdata.blob.core.windows.net/data?sv=2018-03-28&si=read%20and%20list&sr=c&sig=PuyyS6%2FKdB2JxcZN0kPlmHSBlD8uIKyzhBWmWzznkBw%3D`
93+
![Connections setting](media/solution-template-Databricks-notebook/connections-preview.png)
8794

88-
![6](media/solution-template-Databricks-notebook/source-blob-connection.png)
95+
- **Source Blob Connection** - to access the source data.
8996

90-
1. **Destination Blob Connection**for copying data into.
91-
92-
In the sink linked service, select a storage created in the **Prerequisite** 1.
97+
For this exercise, you can use the public blob storage that contains the source files. Reference the following screenshot for the configuration. Use the following **SAS URL** to connect to source storage (read-only access):
9398

94-
![7](media/solution-template-Databricks-notebook/destination-blob-connection.png)
99+
`https://storagewithdata.blob.core.windows.net/data?sv=2018-03-28&si=read%20and%20list&sr=c&sig=PuyyS6%2FKdB2JxcZN0kPlmHSBlD8uIKyzhBWmWzznkBw%3D`
95100

96-
1. **Azure Databricks** for connecting to Databricks cluster.
101+
![Selections for authentication method and SAS URL](media/solution-template-Databricks-notebook/source-blob-connection.png)
97102

98-
Create a Databricks linked service using access key generated in **Prerequisite** 2.c. If you have an *interactive cluster*, you may select that. (This example uses the *New job cluster* option.)
103+
- **Destination Blob Connection** - to store the copied data.
99104

100-
![8](media/solution-template-Databricks-notebook/databricks-connection.png)
105+
In the **New linked service** window, select your sink storage blob.
101106

102-
1. Select **Use this template**, and you would see a pipeline created as shown below:
103-
104-
![Create a pipeline](media/solution-template-Databricks-notebook/new-pipeline.png)
107+
![Sink storage blob as a new linked service](media/solution-template-Databricks-notebook/destination-blob-connection.png)
108+
109+
- **Azure Databricks** - to connect to the Databricks cluster.
110+
111+
Create a Databricks-linked service by using the access key that you generated previously. You can opt to select an *interactive cluster* if you have one. This example uses the **New job cluster** option.
112+
113+
![Selections for connecting to the cluster](media/solution-template-Databricks-notebook/databricks-connection.png)
114+
115+
1. Select **Use this template**. You'll see a pipeline created.
116+
117+
![Create a pipeline](media/solution-template-Databricks-notebook/new-pipeline.png)
105118

106119
## Pipeline introduction and configuration
107120

108-
In the new pipeline created, most settings have been configured automatically with default values. Check out the configurations and update where necessary to suit your own settings. For details, you can check below instructions and screenshots for reference.
121+
In the new pipeline, most settings are configured automatically with default values. Review the configurations of your pipeline and make any necessary changes.
122+
123+
1. In the **Validation** activity **Availability flag**, verify that the source **Dataset** value is set to `SourceAvailabilityDataset` that you created earlier.
109124

110-
1. A Validation activity **Availability flag** is created for doing a Source Availability check. *SourceAvailabilityDataset* created in previous step is selected as Dataset.
125+
![Source dataset value](media/solution-template-Databricks-notebook/validation-settings.png)
111126

112-
![12](media/solution-template-Databricks-notebook/validation-settings.png)
127+
1. In the **Copy data** activity **file-to-blob**, check the **Source** and **Sink** tabs. Change settings if necessary.
113128

114-
1. A Copy activity **file-to-blob** is created for copying dataset from source to sink. Reference the below screenshots for source and sink configurations in the copy activity.
129+
- **Source** tab
130+
![Source tab](media/solution-template-Databricks-notebook/copy-source-settings.png)
115131

116-
![13](media/solution-template-Databricks-notebook/copy-source-settings.png)
132+
- **Sink** tab
133+
![Sink tab](media/solution-template-Databricks-notebook/copy-sink-settings.png)
117134

118-
![14](media/solution-template-Databricks-notebook/copy-sink-settings.png)
135+
1. In the **Notebook** activity **Transformation**, review and update the paths and settings as needed.
119136

120-
1. A Notebook activity **Transformation** is created, and the linked service created in previous step is selected.
121-
![16](media/solution-template-Databricks-notebook/notebook-activity.png)
137+
**Databricks linked service** should be pre-populated with the value from a previous step, as shown:
138+
![Populated value for the Databricks linked service](media/solution-template-Databricks-notebook/notebook-activity.png)
122139

123-
1. Select **Settings** tab. For *Notebook path*, the template defines a path by default. You may need to browse and select the correct notebook path uploaded in **Prerequisite** 2.
140+
To check the **Notebook** settings:
141+
142+
1. Select the **Settings** tab. For **Notebook path**, verify that the default path is correct. You might need to browse and choose the correct notebook path.
124143

125-
![17](media/solution-template-Databricks-notebook/notebook-settings.png)
126-
127-
1. Check out the *Base Parameters* created as shown in the screenshot. They are to be passed to the Databricks notebook from Data Factory.
144+
![Notebook path](media/solution-template-Databricks-notebook/notebook-settings.png)
128145

129-
![Base parameters](media/solution-template-Databricks-notebook/base-parameters.png)
146+
1. Expand the **Base Parameters** selector and verify that the parameters match what is shown in the following screenshot. These parameters are passed to the Databricks notebook from Data Factory.
130147

131-
1. **Pipeline Parameters** is defined as below.
148+
![Base parameters](media/solution-template-Databricks-notebook/base-parameters.png)
132149

133-
![15](media/solution-template-Databricks-notebook/pipeline-parameters.png)
150+
1. Verify that the **Pipeline Parameters** match what is shown in the following screenshot:
151+
![Pipeline parameters](media/solution-template-Databricks-notebook/pipeline-parameters.png)
134152

135-
1. Setting up datasets.
136-
1. **SourceAvailabilityDataset** is created to check if source data is available.
153+
1. Connect to your datasets.
137154

138-
![9](media/solution-template-Databricks-notebook/source-availability-dataset.png)
155+
- **SourceAvailabilityDataset** - to check that the source data is available.
139156

140-
1. **SourceFilesDataset** - for copying the source data.
157+
![Selections for linked service and file path for SourceAvailabilityDataset](media/solution-template-Databricks-notebook/source-availability-dataset.png)
141158

142-
![10](media/solution-template-Databricks-notebook/source-file-dataset.png)
159+
- **SourceFilesDataset** - to access the source data.
143160

144-
1. **DestinationFilesDataset** for copying into the sink/destination location.
161+
![Selections for linked service and file path for SourceFilesDataset](media/solution-template-Databricks-notebook/source-file-dataset.png)
145162

146-
1. Linked service - *sinkBlob_LS* created in previous step.
163+
- **DestinationFilesDataset** - to copy the data into the sink destination location. Use the following values:
147164

148-
2. File path - *sinkdata/staged_sink*.
165+
- **Linked service** - `sinkBlob_LS`, created in a previous step.
149166

150-
![11](media/solution-template-Databricks-notebook/destination-dataset.png)
167+
- **File path** - `sinkdata/staged_sink`.
151168

169+
![Selections for linked service and file path for DestinationFilesDataset](media/solution-template-Databricks-notebook/destination-dataset.png)
152170

153-
1. Select **Debug** to run the pipeline. You can find link to Databricks logs for more detailed Spark logs.
171+
1. Select **Debug** to run the pipeline. You can find the link to Databricks logs for more detailed Spark logs.
154172

155-
![18](media/solution-template-Databricks-notebook/pipeline-run-output.png)
173+
![Link to Databricks logs from output](media/solution-template-Databricks-notebook/pipeline-run-output.png)
156174

157-
You can also verify the data file using storage explorer. (For correlating with Data Factory pipeline runs, this example appends the pipeline run ID from data factory to the output folder. This way you can track back the files generated via each run.)
175+
You can also verify the data file by using Azure Storage Explorer.
158176

159-
![19](media/solution-template-Databricks-notebook/verify-data-files.png)
177+
> [!NOTE]
178+
> For correlating with Data Factory pipeline runs, this example appends the pipeline run ID from the data factory to the output folder. This helps keep track of files generated by each run.
179+
> ![Appended pipeline run ID](media/solution-template-Databricks-notebook/verify-data-files.png)
160180

161181
## Next steps
162182

0 commit comments

Comments
 (0)