You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/data-factory/solution-template-databricks-notebook.md
+88-68Lines changed: 88 additions & 68 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,36 +15,43 @@ ms.date: 03/03/2020
15
15
16
16
# Transformation with Azure Databricks
17
17
18
-
In this tutorial, you create an end-to-end pipeline containing **Validation**, **Copy**, and **Notebook** activities in Data Factory.
18
+
In this tutorial, you create an end-to-end pipeline that contains the **Validation**, **Copy data**, and **Notebook** activities in Azure Data Factory.
19
19
20
-
-**Validation**activity is used to ensure the source dataset is ready for downstream consumption, before triggering the copy and analytics job.
20
+
-**Validation**ensures that your source dataset is ready for downstream consumption before you trigger the copy and analytics job.
21
21
22
-
-**Copy**activity copies the source file/ dataset to the sink storage. The sink storage is mounted as DBFS in the Databricks notebook so that the dataset can be directly consumed by Spark.
22
+
-**Copy data**duplicates the source dataset to the sink storage, which is mounted as DBFS in the Azure Databricks notebook. In this way, the dataset can be directly consumed by Spark.
23
23
24
-
-**Databricks Notebook**activity triggers the Databricks notebook that transforms the dataset, and adds it to a processed folder/ SQL DW.
24
+
-**Notebook** triggers the Databricks notebook that transforms the dataset. It also adds the dataset to a processed folder or Azure SQL Data Warehouse.
25
25
26
-
To keep this template simple, the template doesn't create a scheduled trigger. You can add that if necessary.
26
+
For simplicity, the template in this tutorial doesn't create a scheduled trigger. You can add one if necessary.

29
29
30
30
## Prerequisites
31
31
32
-
1. Create a **blob storage account** and a container called `sinkdata`to be used as **sink**. Keep a note of the **storage account name**, **container name**, and **access key**, since they are referenced later in the template.
32
+
- An Azure Blob storage account with a container called `sinkdata`for use as a sink.
33
33
34
-
2. Ensure you have an **Azure Databricks workspace** or create a new one.
34
+
Make note of the storage account name, container name, and access key. You'll need these values later in the template.
35
35
36
-
3.**Import the notebook for Transformation**.
37
-
1. In your Azure Databricks, reference following screenshots for importing a **Transformation** notebook to the Databricks workspace. It does not have to be in the same location as below, but remember the path that you choose for later.
4. Now let’s update the **Transformation** notebook with your storage connection information. Go to **command 5** (as shown in below code snippet) in the imported notebook above, and replace `<storage name>`and `<access key>` with your own storage connection information. Ensure this account is the same storage account created earlier and contains the `sinkdata` container.
38
+
## Import a notebook for Transformation
39
+
40
+
To import a **Transformation** notebook to your Databricks workspace:
41
+
42
+
1. Sign in to your Azure Databricks workspace, and then select **Import**.
43
+

44
+
Your workspace path can be different from the one shown, but remember it for later.
45
+
1. Select **Import from: URL**. In the text box, enter `https://adflabstaging1.blob.core.windows.net/share/Transformations.html`.
46
+
47
+

48
+
49
+
1. Now let's update the **Transformation** notebook with your storage connection information.
50
+
51
+
In the imported notebook, go to **command 5** as shown in the following code snippet.
52
+
53
+
- Replace `<storage name>`and `<access key>` with your own storage connection information.
54
+
- Use the storage account with the `sinkdata` container.
48
55
49
56
```python
50
57
# Supply storageName and accessKey values
@@ -68,95 +75,108 @@ To keep this template simple, the template doesn't create a scheduled trigger. Y
68
75
print e \# Otherwise print the whole stack trace.
69
76
```
70
77
71
-
5. Generate a **Databricks access token**for Data Factory to access Databricks. **Save the access token**for later use in creating a Databricks linked service, which looks something like 'dapi32db32cbb4w6eee18b7d87e45exxxxxx'.
78
+
1. Generate a **Databricks access token**for Data Factory to access Databricks.
79
+
1. In your Databricks workspace, select your user profile icon in the upper right.
80
+
1. Select **User Settings**.
81
+

82
+
1. Select **Generate New Token** under the **Access Tokens** tab.
*Save the access token*for later use in creating a Databricks linked service. The access token looks something like `dapi32db32cbb4w6eee18b7d87e45exxxxxx`.
76
88
77
89
## How to use this template
78
90
79
-
1. Go to **Transformation with Azure Databricks** template. Create new linked services for following connections.
1. Go to the **Transformation with Azure Databricks** template and create new linked services for following connections.
82
92
83
-
1. **Source Blob Connection** – for accessing source data.
84
-
85
-
You can use the public blob storage containing the source files for this sample. Reference following screenshot for configuration. Use below **SASURL** to connect to source storage (read-only access):
-**Source Blob Connection**- to access the source data.
89
96
90
-
1. **Destination Blob Connection** – for copying data into.
91
-
92
-
In the sink linked service, select a storage created in the **Prerequisite**1.
97
+
For this exercise, you can use the public blob storage that contains the source files. Reference the following screenshot for the configuration. Use the following **SASURL** to connect to source storage (read-only access):
Create a Databricks linked service using access key generated in**Prerequisite**2.c. If you have an *interactive cluster*, you may select that. (This example uses the *New job cluster* option.)
103
+
-**Destination Blob Connection**- to store the copied data.
In the **New linked service** window, select your sink storage blob.
101
106
102
-
1. Select **Use this template**, and you would see a pipeline created as shown below:
103
-
104
-

107
+

108
+
109
+
-**Azure Databricks**- to connect to the Databricks cluster.
110
+
111
+
Create a Databricks-linked service by using the access key that you generated previously. You can opt to select an *interactive cluster*if you have one. This example uses the **New job cluster** option.
112
+
113
+

114
+
115
+
1. Select **Use this template**. You'll see a pipeline created.
116
+
117
+

105
118
106
119
## Pipeline introduction and configuration
107
120
108
-
In the new pipeline created, most settings have been configured automatically with default values. Check out the configurations and update where necessary to suit your own settings. For details, you can check below instructions and screenshots for reference.
121
+
In the new pipeline, most settings are configured automatically with default values. Review the configurations of your pipeline and make any necessary changes.
122
+
123
+
1. In the **Validation** activity **Availability flag**, verify that the source **Dataset** value isset to `SourceAvailabilityDataset` that you created earlier.
109
124
110
-
1. A Validation activity **Availability flag**is created for doing a Source Availability check. *SourceAvailabilityDataset* created in previous step is selected as Dataset.
1. In the **Copy data** activity **file-to-blob**, check the **Source**and**Sink** tabs. Change settingsif necessary.
113
128
114
-
1. A Copy activity **file-to-blob**is created for copying dataset from source to sink. Reference the below screenshots for source and sink configurations in the copy activity.
**Databricks linked service** should be pre-populated withthe value from a previous step, as shown:
138
+

122
139
123
-
1. Select **Settings** tab. For *Notebook path*, the template defines a path by default. You may need to browse and select the correct notebook path uploaded in**Prerequisite**2.
140
+
To check the **Notebook** settings:
141
+
142
+
1. Select the **Settings** tab. For **Notebook path**, verify that the default path is correct. You might need to browse and choose the correct notebook path.
1. Expand the **Base Parameters** selector and verify that the parameters match what is shown in the following screenshot. These parameters are passed to the Databricksnotebookfrom Data Factory.
-**SourceAvailabilityDataset**- to check that the source data is available.
139
156
140
-
1. **SourceFilesDataset**-for copying the source data.
157
+


156
174
157
-
You can also verify the data file using storage explorer. (For correlating with Data Factory pipeline runs, this example appends the pipeline run IDfrom data factory to the output folder. This way you can track back the files generated via each run.)
175
+
You can also verify the data fileby using Azure Storage Explorer.
> For correlating with Data Factory pipeline runs, this example appends the pipeline run IDfrom the data factory to the output folder. This helps keep track of files generated by each run.
179
+
> 
0 commit comments