Merge pull request #108144 from Jak-MS/solution-template-databricks-notebook

ShawnJackson · web-flow · commit d734a2b7feb5 · 2020-03-27T17:58:20.000-05:00
edit pass: Solution template databricks notebook
diff --git a/articles/data-factory/solution-template-databricks-notebook.md b/articles/data-factory/solution-template-databricks-notebook.md
@@ -15,36 +15,43 @@ ms.date: 03/03/2020
 
 # Transformation with Azure Databricks
 
-In this tutorial, you create an end-to-end pipeline containing **Validation**, **Copy**, and **Notebook** activities in Data Factory.
+In this tutorial, you create an end-to-end pipeline that contains the **Validation**, **Copy data**, and **Notebook** activities in Azure Data Factory.
 
--   **Validation** activity is used to ensure the source dataset is ready for downstream consumption, before triggering the copy and analytics job.
+- **Validation** ensures that your source dataset is ready for downstream consumption before you trigger the copy and analytics job.
 
--   **Copy** activity copies the source file/ dataset to the sink storage. The sink storage is mounted as DBFS in the Databricks notebook so that the dataset can be directly consumed by Spark.
+- **Copy data** duplicates the source dataset to the sink storage, which is mounted as DBFS in the Azure Databricks notebook. In this way, the dataset can be directly consumed by Spark.
 
--   **Databricks Notebook** activity triggers the Databricks notebook that transforms the dataset, and adds it to a processed folder/ SQL DW.
+- **Notebook** triggers the Databricks notebook that transforms the dataset. It also adds the dataset to a processed folder or Azure SQL Data Warehouse.
 
-To keep this template simple, the template doesn't create a scheduled trigger. You can add that if necessary.
+For simplicity, the template in this tutorial doesn't create a scheduled trigger. You can add one if necessary.
 
-![1](media/solution-template-Databricks-notebook/pipeline-example.png)
+![Diagram of the pipeline](media/solution-template-Databricks-notebook/pipeline-example.png)
 
 ## Prerequisites
 
-1. Create a **blob storage account** and a container called `sinkdata` to be used as **sink**. Keep a note of the **storage account name**, **container name**, and **access key**, since they are referenced later in the template.
+- An Azure Blob storage account with a container called `sinkdata` for use as a sink.
 
-2. Ensure you have an **Azure Databricks workspace** or create a new one.
+  Make note of the storage account name, container name, and access key. You'll need these values later in the template.
 
-3. **Import the notebook for Transformation**. 
-    1. In your Azure Databricks, reference following screenshots for importing a **Transformation** notebook to the Databricks workspace. It does not have to be in the same location as below, but remember the path that you choose for later.
-   
-       ![2](media/solution-template-Databricks-notebook/import-notebook.png)    
-    
-    1. Select "Import from: **URL**", and enter following URL in the textbox:
-    
-       * `https://adflabstaging1.blob.core.windows.net/share/Transformations.html`
-        
-       ![3](media/solution-template-Databricks-notebook/import-from-url.png)    
+- An Azure Databricks workspace.
 
-4. Now let’s update the **Transformation** notebook with your storage connection information. Go to **command 5** (as shown in below code snippet) in the imported notebook above, and replace `<storage name>`and `<access key>` with your own storage connection information. Ensure this account is the same storage account created earlier and contains the `sinkdata` container.
+## Import a notebook for Transformation
+
+To import a **Transformation** notebook to your Databricks workspace:
+
+1. Sign in to your Azure Databricks workspace, and then select **Import**.
+       ![Menu command for importing a workspace](media/solution-template-Databricks-notebook/import-notebook.png)
+   Your workspace path can be different from the one shown, but remember it for later.
+1. Select **Import from: URL**. In the text box, enter `https://adflabstaging1.blob.core.windows.net/share/Transformations.html`.
+
+   ![Selections for importing a notebook](media/solution-template-Databricks-notebook/import-from-url.png)
+
+1. Now let's update the **Transformation** notebook with your storage connection information.
+
+   In the imported notebook, go to **command 5** as shown in the following code snippet.
+
+   - Replace `<storage name>`and `<access key>` with your own storage connection information.
+   - Use the storage account with the `sinkdata` container.
 
     ```python
     # Supply storageName and accessKey values  
@@ -68,95 +75,108 @@ To keep this template simple, the template doesn't create a scheduled trigger. Y
       print e \# Otherwise print the whole stack trace.  
     ```
 
-5.  Generate a **Databricks access token** for Data Factory to access Databricks. **Save the access token** for later use in creating a Databricks linked service, which looks something like 'dapi32db32cbb4w6eee18b7d87e45exxxxxx'.
+1. Generate a **Databricks access token** for Data Factory to access Databricks.
+   1. In your Databricks workspace, select your user profile icon in the upper right.
+   1. Select **User Settings**.
+    ![Menu command for user settings](media/solution-template-Databricks-notebook/user-setting.png)
+   1. Select **Generate New Token** under the **Access Tokens** tab.
+   1. Select **Generate**.
 
-    ![4](media/solution-template-Databricks-notebook/user-setting.png)
+    !["Generate" button](media/solution-template-Databricks-notebook/generate-new-token.png)
 
-    ![5](media/solution-template-Databricks-notebook/generate-new-token.png)
+   *Save the access token* for later use in creating a Databricks linked service. The access token looks something like `dapi32db32cbb4w6eee18b7d87e45exxxxxx`.
 
 ## How to use this template
 
-1.  Go to **Transformation with Azure Databricks** template. Create new linked services for following connections. 
-    
-    ![Connections setting](media/solution-template-Databricks-notebook/connections-preview.png)
+1. Go to the **Transformation with Azure Databricks** template and create new linked services for following connections.
 
-    1.  **Source Blob Connection** – for accessing source data. 
-        
-        You can use the public blob storage containing the source files for this sample. Reference following screenshot for configuration. Use below **SAS URL** to connect to source storage (read-only access): 
-        * `https://storagewithdata.blob.core.windows.net/data?sv=2018-03-28&si=read%20and%20list&sr=c&sig=PuyyS6%2FKdB2JxcZN0kPlmHSBlD8uIKyzhBWmWzznkBw%3D`
+   ![Connections setting](media/solution-template-Databricks-notebook/connections-preview.png)
 
-        ![6](media/solution-template-Databricks-notebook/source-blob-connection.png)
+    - **Source Blob Connection** - to access the source data.
 
-    1.  **Destination Blob Connection** – for copying data into. 
-        
-        In the sink linked service, select a storage created in the **Prerequisite** 1.
+       For this exercise, you can use the public blob storage that contains the source files. Reference the following screenshot for the configuration. Use the following **SAS URL** to connect to source storage (read-only access):
 
-        ![7](media/solution-template-Databricks-notebook/destination-blob-connection.png)
+       `https://storagewithdata.blob.core.windows.net/data?sv=2018-03-28&si=read%20and%20list&sr=c&sig=PuyyS6%2FKdB2JxcZN0kPlmHSBlD8uIKyzhBWmWzznkBw%3D`
 
-    1.  **Azure Databricks** – for connecting to Databricks cluster.
+        ![Selections for authentication method and SAS URL](media/solution-template-Databricks-notebook/source-blob-connection.png)
 
-        Create a Databricks linked service using access key generated in **Prerequisite** 2.c. If you have an *interactive cluster*, you may select that. (This example uses the *New job cluster* option.)
+    - **Destination Blob Connection** - to store the copied data.
 
-        ![8](media/solution-template-Databricks-notebook/databricks-connection.png)
+       In the **New linked service** window, select your sink storage blob.
 
-1. Select **Use this template**, and you would see a pipeline created as shown below:
-    
-    ![Create a pipeline](media/solution-template-Databricks-notebook/new-pipeline.png)   
+       ![Sink storage blob as a new linked service](media/solution-template-Databricks-notebook/destination-blob-connection.png)
+
+    - **Azure Databricks** - to connect to the Databricks cluster.
+
+        Create a Databricks-linked service by using the access key that you generated previously. You can opt to select an *interactive cluster* if you have one. This example uses the **New job cluster** option.
+
+        ![Selections for connecting to the cluster](media/solution-template-Databricks-notebook/databricks-connection.png)
+
+1. Select **Use this template**. You'll see a pipeline created.
+
+    ![Create a pipeline](media/solution-template-Databricks-notebook/new-pipeline.png)
 
 ## Pipeline introduction and configuration
 
-In the new pipeline created, most settings have been configured automatically with default values. Check out the configurations and update where necessary to suit your own settings. For details, you can check below instructions and screenshots for reference.
+In the new pipeline, most settings are configured automatically with default values. Review the configurations of your pipeline and make any necessary changes.
+
+1. In the **Validation** activity **Availability flag**, verify that the source **Dataset** value is set to `SourceAvailabilityDataset` that you created earlier.
 
-1.  A Validation activity **Availability flag** is created for doing a Source Availability check. *SourceAvailabilityDataset* created in previous step is selected as Dataset.
+   ![Source dataset value](media/solution-template-Databricks-notebook/validation-settings.png)
 
-    ![12](media/solution-template-Databricks-notebook/validation-settings.png)
+1. In the **Copy data** activity **file-to-blob**, check the **Source** and **Sink** tabs. Change settings if necessary.
 
-1.  A Copy activity **file-to-blob** is created for copying dataset from source to sink. Reference the below screenshots for source and sink configurations in the copy activity.
+   - **Source** tab
+   ![Source tab](media/solution-template-Databricks-notebook/copy-source-settings.png)
 
-    ![13](media/solution-template-Databricks-notebook/copy-source-settings.png)
+   - **Sink** tab
+   ![Sink tab](media/solution-template-Databricks-notebook/copy-sink-settings.png)
 
-    ![14](media/solution-template-Databricks-notebook/copy-sink-settings.png)
+1. In the **Notebook** activity **Transformation**, review and update the paths and settings as needed.
 
-1.  A Notebook activity **Transformation** is created, and the linked service created in previous step is selected.
-    ![16](media/solution-template-Databricks-notebook/notebook-activity.png)
+   **Databricks linked service** should be pre-populated with the value from a previous step, as shown:
+   ![Populated value for the Databricks linked service](media/solution-template-Databricks-notebook/notebook-activity.png)
 
-     1. Select **Settings** tab. For *Notebook path*, the template defines a path by default. You may need to browse and select the correct notebook path uploaded in **Prerequisite** 2. 
+   To check the **Notebook** settings:
+  
+    1. Select the **Settings** tab. For **Notebook path**, verify that the default path is correct. You might need to browse and choose the correct notebook path.
 
-         ![17](media/solution-template-Databricks-notebook/notebook-settings.png)
-    
-     1. Check out the *Base Parameters* created as shown in the screenshot. They are to be passed to the Databricks notebook from Data Factory. 
+       ![Notebook path](media/solution-template-Databricks-notebook/notebook-settings.png)
 
-         ![Base parameters](media/solution-template-Databricks-notebook/base-parameters.png)
+    1. Expand the **Base Parameters** selector and verify that the parameters match what is shown in the following screenshot. These parameters are passed to the Databricks notebook from Data Factory.
 
-1.  **Pipeline Parameters** is defined as below.
+       ![Base parameters](media/solution-template-Databricks-notebook/base-parameters.png)
 
-    ![15](media/solution-template-Databricks-notebook/pipeline-parameters.png)
+1. Verify that the **Pipeline Parameters** match what is shown in the following screenshot:
+  ![Pipeline parameters](media/solution-template-Databricks-notebook/pipeline-parameters.png)
 
-1. Setting up datasets.
-    1.  **SourceAvailabilityDataset** is created to check if source data is available.
+1. Connect to your datasets.
 
-        ![9](media/solution-template-Databricks-notebook/source-availability-dataset.png)
+   - **SourceAvailabilityDataset** - to check that the source data is available.
 
-    1.  **SourceFilesDataset** - for copying the source data.
+     ![Selections for linked service and file path for SourceAvailabilityDataset](media/solution-template-Databricks-notebook/source-availability-dataset.png)
 
-        ![10](media/solution-template-Databricks-notebook/source-file-dataset.png)
+   - **SourceFilesDataset** - to access the source data.
 
-    1.  **DestinationFilesDataset** – for copying into the sink/destination location.
+       ![Selections for linked service and file path for SourceFilesDataset](media/solution-template-Databricks-notebook/source-file-dataset.png)
 
-        1.  Linked service - *sinkBlob_LS* created in previous step.
+   - **DestinationFilesDataset** - to copy the data into the sink destination location. Use the following values:
 
-        2.  File path - *sinkdata/staged_sink*.
+     - **Linked service** - `sinkBlob_LS`, created in a previous step.
 
-            ![11](media/solution-template-Databricks-notebook/destination-dataset.png)
+     - **File path** - `sinkdata/staged_sink`.
 
+       ![Selections for linked service and file path for DestinationFilesDataset](media/solution-template-Databricks-notebook/destination-dataset.png)
 
-1.  Select **Debug** to run the pipeline. You can find link to Databricks logs for more detailed Spark logs.
+1. Select **Debug** to run the pipeline. You can find the link to Databricks logs for more detailed Spark logs.
 
-    ![18](media/solution-template-Databricks-notebook/pipeline-run-output.png)
+    ![Link to Databricks logs from output](media/solution-template-Databricks-notebook/pipeline-run-output.png)
 
-    You can also verify the data file using storage explorer. (For correlating with Data Factory pipeline runs, this example appends the pipeline run ID from data factory to the output folder. This way you can track back the files generated via each run.)
+    You can also verify the data file by using Azure Storage Explorer.
 
-    ![19](media/solution-template-Databricks-notebook/verify-data-files.png)
+    > [!NOTE]
+    > For correlating with Data Factory pipeline runs, this example appends the pipeline run ID from the data factory to the output folder. This helps keep track of files generated by each run.
+    > ![Appended pipeline run ID](media/solution-template-Databricks-notebook/verify-data-files.png)
 
 ## Next steps