Skip to content

Commit 724bbb7

Browse files
authored
Update solution-template-databricks-notebook.md
1 parent 05afbc2 commit 724bbb7

File tree

1 file changed

+35
-35
lines changed

1 file changed

+35
-35
lines changed

articles/data-factory/solution-template-databricks-notebook.md

Lines changed: 35 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -15,21 +15,21 @@ ms.date: 03/03/2020
1515

1616
# Transformation with Azure Databricks
1717

18-
In this tutorial, you create an end-to-end pipeline containing the **Validation**, **Copy data**, and **Notebook** activities in Data Factory.
18+
In this tutorial, you create an end-to-end pipeline that contains the **Validation**, **Copy data**, and **Notebook** activities in Azure Data Factory.
1919

2020
- **Validation** ensures that your source dataset is ready for downstream consumption before you trigger the copy and analytics job.
2121

22-
- **Copy data** duplicates the source dataset to the sink storage, which is mounted as DBFS in the Databricks notebook. In this way, the dataset can be directly consumed by Spark.
22+
- **Copy data** duplicates the source dataset to the sink storage, which is mounted as DBFS in the Azure Databricks notebook. In this way, the dataset can be directly consumed by Spark.
2323

24-
- **Notebook** triggers the Databricks notebook that transforms the dataset. It also adds the dataset to a processed folder or SQL Data Warehouse.
24+
- **Notebook** triggers the Databricks notebook that transforms the dataset. It also adds the dataset to a processed folder or Azure SQL Data Warehouse.
2525

2626
For simplicity, the template in this tutorial doesn't create a scheduled trigger. You can add one if necessary.
2727

28-
![1](media/solution-template-Databricks-notebook/pipeline-example.png)
28+
![Diagram of the pipeline](media/solution-template-Databricks-notebook/pipeline-example.png)
2929

3030
## Prerequisites
3131

32-
- A **blob storage account** with a container called `sinkdata` for use as **sink**
32+
- An **Azure Blob storage account** with a container called `sinkdata` for use as **sink**
3333

3434
Make note of the **storage account name**, **container name**, and **access key**. You'll need these values later in the template.
3535

@@ -40,11 +40,11 @@ For simplicity, the template in this tutorial doesn't create a scheduled trigger
4040
To import a **Transformation** notebook to your Databricks workspace:
4141

4242
1. Sign in to your Azure Databricks workspace, and then select **Import**.
43-
![2](media/solution-template-Databricks-notebook/import-notebook.png)
43+
![Menu command for importing a workspace](media/solution-template-Databricks-notebook/import-notebook.png)
4444
Your workspace path can be different from the one shown, but remember it for later.
45-
1. Select **Import from: URL**. In the textbox, enter `https://adflabstaging1.blob.core.windows.net/share/Transformations.html`
45+
1. Select **Import from: URL**. In the text box, enter `https://adflabstaging1.blob.core.windows.net/share/Transformations.html`.
4646

47-
![3](media/solution-template-Databricks-notebook/import-from-url.png)
47+
![Selections for importing a notebook](media/solution-template-Databricks-notebook/import-from-url.png)
4848

4949
1. Now let's update the **Transformation** notebook with your storage connection information.
5050

@@ -78,13 +78,13 @@ To import a **Transformation** notebook to your Databricks workspace:
7878
1. Generate a **Databricks access token** for Data Factory to access Databricks.
7979
1. In your Databricks workspace, select your user profile icon in the upper right.
8080
1. Select **User Settings**.
81-
![4](media/solution-template-Databricks-notebook/user-setting.png)
81+
![Menu command for user settings](media/solution-template-Databricks-notebook/user-setting.png)
8282
1. Select **Generate New Token** under the **Access Tokens** tab.
8383
1. Select **Generate**.
8484

85-
![5](media/solution-template-Databricks-notebook/generate-new-token.png)
85+
!["Generate" button](media/solution-template-Databricks-notebook/generate-new-token.png)
8686

87-
**Save the access token** for later use in creating a Databricks linked service. The access token looks something like `dapi32db32cbb4w6eee18b7d87e45exxxxxx`.
87+
*Save the access token* for later use in creating a Databricks linked service. The access token looks something like `dapi32db32cbb4w6eee18b7d87e45exxxxxx`.
8888

8989
## How to use this template
9090

@@ -94,89 +94,89 @@ To import a **Transformation** notebook to your Databricks workspace:
9494

9595
- **Source Blob Connection** – to access the source data.
9696

97-
For this exercise, you can use the public blob storage that contains the source files. Reference following screenshot for configuration. Use the following **SAS URL** to connect to source storage (read-only access):
97+
For this exercise, you can use the public blob storage that contains the source files. Reference the following screenshot for the configuration. Use the following **SAS URL** to connect to source storage (read-only access):
9898

9999
`https://storagewithdata.blob.core.windows.net/data?sv=2018-03-28&si=read%20and%20list&sr=c&sig=PuyyS6%2FKdB2JxcZN0kPlmHSBlD8uIKyzhBWmWzznkBw%3D`
100100

101-
![6](media/solution-template-Databricks-notebook/source-blob-connection.png)
101+
![Selections for authentication method and SAS URL](media/solution-template-Databricks-notebook/source-blob-connection.png)
102102

103103
- **Destination Blob Connection** – to store the copied data.
104104

105105
In the **New linked service** window, select your sink storage blob.
106106

107-
![7](media/solution-template-Databricks-notebook/destination-blob-connection.png)
107+
![Sink storage blob as a new linked service](media/solution-template-Databricks-notebook/destination-blob-connection.png)
108108

109109
- **Azure Databricks** – to connect to the Databricks cluster.
110110

111-
Create a Databricks-linked service using the access key you generated previously. You may opt to select an *interactive cluster* if you have one. This example uses the **New job cluster** option.
111+
Create a Databricks-linked service by using the access key that you generated previously. You can opt to select an *interactive cluster* if you have one. This example uses the **New job cluster** option.
112112

113-
![8](media/solution-template-Databricks-notebook/databricks-connection.png)
113+
![Selections for connecting to the cluster](media/solution-template-Databricks-notebook/databricks-connection.png)
114114

115115
1. Select **Use this template**. You'll see a pipeline created.
116116

117117
![Create a pipeline](media/solution-template-Databricks-notebook/new-pipeline.png)
118118

119119
## Pipeline introduction and configuration
120120

121-
In the new pipeline, most settings are configured automatically with default values. Review the configurations of your pipeline and make any necessary changes.
121+
In the new pipeline, most settings are configured automatically with default values. Review the configurations of your pipeline and make any necessary changes.
122122

123-
1. In the **Validation** activity **Availability flag**, verify that the source **Dataset** value is set to `SourceAvailabilityDataset` that you created earlier.
123+
1. In the **Validation** activity **Availability flag**, verify that the source **Dataset** value is set to `SourceAvailabilityDataset` that you created earlier.
124124

125-
![12](media/solution-template-Databricks-notebook/validation-settings.png)
125+
![Source dataset value](media/solution-template-Databricks-notebook/validation-settings.png)
126126

127127
1. In the **Copy data** activity **file-to-blob**, check the source and sink tabs. Change settings if necessary.
128128

129-
- Source tab
130-
![13](media/solution-template-Databricks-notebook/copy-source-settings.png)
129+
- **Source** tab
130+
![Source tab](media/solution-template-Databricks-notebook/copy-source-settings.png)
131131

132-
- Sink tab
133-
![14](media/solution-template-Databricks-notebook/copy-sink-settings.png)
132+
- **Sink** tab
133+
![Sink tab](media/solution-template-Databricks-notebook/copy-sink-settings.png)
134134

135135
1. In the **Notebook** activity **Transformation**, review and update the paths and settings as needed.
136136

137-
The **Databricks linked service** should be pre-populated with the value from a previous step, as shown:
138-
![16](media/solution-template-Databricks-notebook/notebook-activity.png)
137+
**Databricks linked service** should be pre-populated with the value from a previous step, as shown:
138+
![Populated value for the Databricks linked service](media/solution-template-Databricks-notebook/notebook-activity.png)
139139

140140
To check the **Notebook** settings:
141141

142-
1. Select the **Settings** tab. For **Notebook path**, verify that the default path is correct. You may need to browse and choose the correct notebook path.
142+
1. Select the **Settings** tab. For **Notebook path**, verify that the default path is correct. You might need to browse and choose the correct notebook path.
143143

144-
![17](media/solution-template-Databricks-notebook/notebook-settings.png)
144+
![Notebook path](media/solution-template-Databricks-notebook/notebook-settings.png)
145145

146146
1. Expand the **Base Parameters** selector and verify that the parameters match what is shown in the following screenshot. These parameters are passed to the Databricks notebook from Data Factory.
147147

148148
![Base parameters](media/solution-template-Databricks-notebook/base-parameters.png)
149149

150150
1. Verify that the **Pipeline Parameters** match what is shown in the following screenshot:
151-
![15](media/solution-template-Databricks-notebook/pipeline-parameters.png)
151+
![Pipeline parameters](media/solution-template-Databricks-notebook/pipeline-parameters.png)
152152

153153
1. Connect to your datasets.
154154

155155
- **SourceAvailabilityDataset** - to check that the source data is available.
156156

157-
![9](media/solution-template-Databricks-notebook/source-availability-dataset.png)
157+
![Selections for linked service and file path for SourceAvailabilityDataset](media/solution-template-Databricks-notebook/source-availability-dataset.png)
158158

159159
- **SourceFilesDataset** - to access the source data.
160160

161-
![10](media/solution-template-Databricks-notebook/source-file-dataset.png)
161+
![Selections for linked service and file path for SourceFilesDataset](media/solution-template-Databricks-notebook/source-file-dataset.png)
162162

163163
- **DestinationFilesDataset** – to copy the data into the sink destination location. Use the following values:
164164

165165
- **Linked service** - `sinkBlob_LS`, created in a previous step.
166166

167167
- **File path** - `sinkdata/staged_sink`.
168168

169-
![11](media/solution-template-Databricks-notebook/destination-dataset.png)
169+
![Selections for linked service and file path for DestinationFilesDataset](media/solution-template-Databricks-notebook/destination-dataset.png)
170170

171171
1. Select **Debug** to run the pipeline. You can find the link to Databricks logs for more detailed Spark logs.
172172

173-
![18](media/solution-template-Databricks-notebook/pipeline-run-output.png)
173+
![Link to Databricks logs from output](media/solution-template-Databricks-notebook/pipeline-run-output.png)
174174

175-
You can also verify the data file using storage explorer.
175+
You can also verify the data file by using Azure Storage Explorer.
176176

177177
> [!NOTE]
178-
> For correlating with Data Factory pipeline runs, this example appends the pipeline run ID from data factory to the output folder. This helps keep track of files generated by each run.
179-
> ![19](media/solution-template-Databricks-notebook/verify-data-files.png)
178+
> For correlating with Data Factory pipeline runs, this example appends the pipeline run ID from the data factory to the output folder. This helps keep track of files generated by each run.
179+
> ![Appended pipeline run ID](media/solution-template-Databricks-notebook/verify-data-files.png)
180180

181181
## Next steps
182182

0 commit comments

Comments
 (0)