Merge pull request #108888 from dagiro/ETL

American-Dipper · web-flow · commit 7d3b5365aded · 2020-03-24T15:52:02.000-07:00
ETL
diff --git a/articles/hdinsight/hdinsight-sales-insights-etl.md b/articles/hdinsight/hdinsight-sales-insights-etl.md
@@ -1,82 +1,118 @@
 ---
-title: 'Tutorial: Create an end-to-end ETL pipeline to derive sales insights'
+title: 'Tutorial: Create an end-to-end ETL pipeline to derive sales insights in Azure HDInsight'
 description: Learn how to use create ETL pipelines with Azure HDInsight to derive insights from sales data by using Spark on-demand clusters and Power BI.
 author: hrasheed-msft
+ms.author: hrasheed
 ms.reviewer: jasonh
 ms.service: hdinsight
-ms.custom: hdinsightactive
 ms.topic: tutorial
-ms.date: 09/30/2019
-ms.author: hrasheed
+ms.custom: hdinsightactive
+ms.date: 03/24/2020
 ---
-# Tutorial: Create an end-to-end data pipeline to derive sales insights
 
-In this tutorial, you'll build an end-to-end data pipeline that performs extract, transform, and load (ETL) operations. The pipeline will use Apache Spark and Apache Hive clusters running on Azure HDInsight for querying and manipulating the data. You'll also use technologies like Azure Data Lake Storage Gen2 for data storage, and Power BI for visualization.
+# Tutorial: Create an end-to-end data pipeline to derive sales insights in Azure HDInsight
+
+In this tutorial, you'll build an end-to-end data pipeline that performs extract, transform, and load (ETL) operations. The pipeline will use [Apache Spark](./spark/apache-spark-overview.md) and Apache Hive clusters running on Azure HDInsight for querying and manipulating the data. You'll also use technologies like Azure Data Lake Storage Gen2 for data storage, and Power BI for visualization.
 
 This data pipeline combines the data from various stores, removes any unwanted data, appends new data, and loads all this back to your storage to visualize business insights. Read more about ETL pipelines in [Extract, transform, and load (ETL) at scale](./hadoop/apache-hadoop-etl-at-scale.md).
 
 ![ETL architecture](./media/hdinsight-sales-insights-etl/architecture.png)
 
+If you don't have an Azure subscription, create a [free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F) before you begin.
+
 ## Prerequisites
 
-If you don't have an Azure subscription, create a [free account](https://azure.microsoft.com/free/) before you begin.
+* Azure CLI. See [Install the Azure CLI](https://docs.microsoft.com/cli/azure/install-azure-cli).
+
+* A member of the [Azure built-in role - owner](../role-based-access-control/built-in-roles.md).
 
-Download [Power BI Desktop](https://www.microsoft.com/download/details.aspx?id=45331) to visualize business insights at the end of this tutorial.
+* [Power BI Desktop](https://www.microsoft.com/download/details.aspx?id=45331) to visualize business insights at the end of this tutorial.
 
 ## Create resources
 
 ### Clone the repository with scripts and data
 
 1. Sign in to the [Azure portal](https://portal.azure.com).
+
 1. Open Azure Cloud Shell from the top menu bar. Select your subscription for creating a file share if Cloud Shell prompts you.
 
    ![Open Azure Cloud Shell](./media/hdinsight-sales-insights-etl/hdinsight-sales-insights-etl-click-cloud-shell.png)
+
 1. In the **Select environment** drop-down menu, choose **Bash**.
-1. List your subscriptions by typing the command `az account list --output table`. Note the ID of the subscription that you will use for this project.
-1. Set the subscription you will use for this project and set the subscriptionID variable which will be used later.
+
+1. Ensure you're a member of the Azure role [owner](../role-based-access-control/built-in-roles.md). Replace `user@contoso.com` with your account and then enter the command:
 
     ```azurecli
-    subscriptionID="<SUBSCRIPTION ID>"
+    az role assignment list \
+    --assignee "user@contoso.com" \
+    --role "Owner"
+    ```
+
+    If no record is returned, you aren't a member and won't be able to complete this tutorial.
+
+1. List your subscriptions entering the command:
+
+    ```azurecli
+    az account list --output table
+    ```
+
+    Note the ID of the subscription that you'll use for this project.
+
+1. Set the subscription you'll use for this project. Replace `SUBSCRIPTIONID` with the actual value, then enter the command.
+
+    ```azurecli
+    subscriptionID="SUBSCRIPTIONID"
     az account set --subscription $subscriptionID
     ```
 
-1. Create a new resource group for the project and set the resourceGroup variable which will be used later.
+1. Create a new resource group for the project. Replace `RESOURCEGROUP` with the desired name, then enter the command.
 
     ```azurecli
-    resourceGroup="<RESOURCE GROUP NAME>"
+    resourceGroup="RESOURCEGROUP"
     az group create --name $resourceGroup --location westus
     ```
 
-1. Download the data and scripts for this tutorial from the [HDInsight sales insights ETL repository](https://github.com/Azure-Samples/hdinsight-sales-insights-etl) by entering the following commands in Cloud Shell:
+1. Download the data and scripts for this tutorial from the [HDInsight sales insights ETL repository](https://github.com/Azure-Samples/hdinsight-sales-insights-etl).  Enter the following command:
 
-    ```console
+    ```bash
     git clone https://github.com/Azure-Samples/hdinsight-sales-insights-etl.git
     cd hdinsight-sales-insights-etl
     ```
 
-1. Enter `ls` at the shell prompt to verify that the following files and directories have been created:
+1. Ensure `salesdata scripts templates` have been created. Verify with the following command:
 
-   ```output
-   salesdata scripts templates
+   ```bash
+   ls
    ```
 
-### Deploy Azure resources needed for the pipeline 
+### Deploy Azure resources needed for the pipeline
+
+1. Add execute permissions for all of the scripts by entering:
+
+    ```bash
+    chmod +x scripts/*.sh
+    ````
+
+1. Execute the script. Replace `RESOURCE_GROUP_NAME` and `LOCATION` with the relevant values, then enter the command:
+
+    ```bash
+    ./scripts/resources.sh RESOURCE_GROUP_NAME LOCATION
+    ```
 
-1. Add execute permissions for all of the scripts by typing `chmod +x scripts/*.sh`.
-1. Use the command `./scripts/resources.sh <RESOURCE_GROUP_NAME> <LOCATION>` to run the script to deploy the following resources in Azure:
+    The command will deploy the following resources:
 
-   1. An Azure Blob storage account. This account will hold the company sales data.
-   2. An Azure Data Lake Storage Gen2 account. This account will serve as the storage account for both HDInsight clusters. Read more about HDInsight and Data Lake Storage Gen2 in [Azure HDInsight integration with Data Lake Storage Gen2](https://azure.microsoft.com/blog/azure-hdinsight-integration-with-data-lake-storage-gen-2-preview-acl-and-security-update/).
-   3. A user-assigned managed identity. This account gives the HDInsight clusters access to the Data Lake Storage Gen2 account.
-   4. An Apache Spark cluster. This cluster will be used to clean up and transform the raw data.
-   5. An Apache Hive Interactive Query cluster. This cluster will allow querying the sales data and visualizing it with Power BI.
-   6. An Azure virtual network supported by network security group (NSG) rules. This virtual network allows the clusters to communicate and secures their communications. 
+    * An Azure Blob storage account. This account will hold the company sales data.
+    * An Azure Data Lake Storage Gen2 account. This account will serve as the storage account for both HDInsight clusters. Read more about HDInsight and Data Lake Storage Gen2 in [Azure HDInsight integration with Data Lake Storage Gen2](https://azure.microsoft.com/blog/azure-hdinsight-integration-with-data-lake-storage-gen-2-preview-acl-and-security-update/).
+    * A user-assigned managed identity. This account gives the HDInsight clusters access to the Data Lake Storage Gen2 account.
+    * An Apache Spark cluster. This cluster will be used to clean up and transform the raw data.
+    * An Apache Hive [Interactive Query](./interactive-query/apache-interactive-query-get-started.md) cluster. This cluster will allow querying the sales data and visualizing it with Power BI.
+    * An Azure virtual network supported by network security group (NSG) rules. This virtual network allows the clusters to communicate and secures their communications.
 
 Cluster creation can take around 20 minutes.
 
-The `resources.sh` script contains the following commands. It is not required for you to run these commands if you already executed the script in the previous step.
+The `resources.sh` script contains the following commands. It isn't required for you to run these commands if you already executed the script in the previous step.
 
-* `az group deployment create` - This command uses an Azure Resource Manager template (`resourcestemplate.json`) to create the specified resources with the desired configuration. 
+* `az group deployment create` - This command uses an Azure Resource Manager template (`resourcestemplate.json`) to create the specified resources with the desired configuration.
 
     ```azurecli
     az group deployment create --name ResourcesDeployment \
@@ -97,37 +133,64 @@ The default password for SSH access to the clusters is `Thisisapassword1`. If yo
 ### Verify deployment and collect resource information
 
 1. If you want to check the status of your deployment, go to the resource group on the Azure portal. Select **Deployments** under **Settings**. Select the name of your deployment, `ResourcesDeployment`. Here you can see the resources that have successfully deployed and the resources that are still in progress.
-1. After the deployment has finished, go to the Azure portal > **Resource groups** > <RESOURCE_GROUP_NAME>.
-1. Locate the new Azure storage account that was created for storing the sales files. The name of the storage account begins with `blob` and then contains a random string. Do the following:
-   1. Make a note of the storage account name for later use.
-   1. Select the name of the Blob storage account.
-   1. On the left side of the portal under **Settings**, select **Access keys**.
-   1. Copy the string in the **Key1** box and save it for later use.
-1. Locate the Data Lake Storage Gen2 account that was created as storage for the HDInsight clusters. This account is located in the same resource group as the Blob storage account, but begins with `adlsgen2`. Do the following:
-   1. Make a note of the name of the Data Lake Storage Gen2  account.
-   1. Select the name of the Data Lake Storage Gen2 account.
-   1. On the left side of the portal, under **Settings**, select **Access keys**.
-   1. Copy the string in the **Key1** box and save it for later use.
-
-> [!Note]
-> After you know the names of the storage accounts, you can get the account keys by using the following command at the Azure Cloud Shell prompt:
-> ```azurecli
-> az storage account keys list \
->    --account-name <STORAGE NAME> \
->    --resource-group $rg \
->    --output table
-> ```
+
+1. To view the names of the clusters, enter the following command:
+
+    ```azurecli
+    sparkCluster=$(az hdinsight list \
+        --resource-group $resourceGroup \
+        --query "[?contains(name,'spark')].{clusterName:name}" -o tsv)
+
+    llapCluster=$(az hdinsight list \
+        --resource-group $resourceGroup \
+        --query "[?contains(name,'llap')].{clusterName:name}" -o tsv)
+
+    echo $sparkCluster
+    echo $llapCluster
+    ```
+
+1. To view the Azure storage account and access key, enter the following command:
+
+    ```azurecli
+    blobStorageName=$(cat resourcesoutputs.json | jq -r '.properties.outputs.blobStorageName.value')
+
+    blobKey=$(az storage account keys list \
+        --account-name $blobStorageName \
+        --resource-group $resourceGroup \
+        --query [0].value -o tsv)
+
+    echo $blobStorageName
+    echo $blobKey
+    ```
+
+1. To view the Data Lake Storage Gen2 account and access key, enter the following command:
+
+    ```azurecli
+    ADLSGen2StorageName=$(cat resourcesoutputs.json | jq -r '.properties.outputs.adlsGen2StorageName.value')
+
+    adlsKey=$(az storage account keys list \
+        --account-name $ADLSGen2StorageName \
+        --resource-group $resourceGroup \
+        --query [0].value -o tsv)
+
+    echo $ADLSGen2StorageName
+    echo $adlsKey
+    ```
 
 ### Create a data factory
 
-Azure Data Factory is a tool that helps automate Azure Pipelines. It's not the only way to accomplish these tasks, but it's a great way to automate the processes. For more information on Azure Data Factory, see the [Azure Data Factory documentation](https://azure.microsoft.com/services/data-factory/). 
+Azure Data Factory is a tool that helps automate Azure Pipelines. It's not the only way to accomplish these tasks, but it's a great way to automate the processes. For more information on Azure Data Factory, see the [Azure Data Factory documentation](https://azure.microsoft.com/services/data-factory/).
 
-This data factory will have one pipeline with two activities: 
+This data factory will have one pipeline with two activities:
 
-- The first activity will copy the data from Azure Blob storage to the Data Lake Storage Gen 2 storage account to mimic data ingestion.
-- The second activity will transform the data in the Spark cluster. The script transforms the data by removing unwanted columns. It also appends a new column that calculates the revenue that a single transaction generates.
+* The first activity will copy the data from Azure Blob storage to the Data Lake Storage Gen 2 storage account to mimic data ingestion.
+* The second activity will transform the data in the Spark cluster. The script transforms the data by removing unwanted columns. It also appends a new column that calculates the revenue that a single transaction generates.
 
-To set up your Azure Data Factory pipeline, run the `adf.sh` script, by typing `./adf.sh`.
+To set up your Azure Data Factory pipeline, execute the following command:
+
+```bash
+./scripts/adf.sh
+```
 
 This script does the following things:
 
@@ -154,50 +217,52 @@ The first activity in the Data Factory pipeline that you've created moves the da
 
 To trigger the pipelines, you can either:
 
-- Run the following commands to trigger the Data Factory pipelines in PowerShell: 
+* Trigger the Data Factory pipelines in PowerShell. Replace `DataFactoryName` with the actual Data Factory name, then run the following commands:
 
     ```powershell
-    Invoke-AzDataFactoryV2Pipeline -DataFactory $df -PipelineName "CopyPipeline_k8z" 
-    Invoke-AzDataFactoryV2Pipeline -DataFactory $df -PipelineName "sparkTransformPipeline"
+    Invoke-AzDataFactoryV2Pipeline -DataFactory DataFactoryName -PipelineName "CopyPipeline_k8z"
+    Invoke-AzDataFactoryV2Pipeline -DataFactory DataFactoryName -PipelineName "sparkTransformPipeline"
     ```
 
-- Open the data factory and select **Author & Monitor**. Trigger the copy pipeline and then the Spark pipeline from the portal. For information on triggering pipelines through the portal, see [Create on-demand Apache Hadoop clusters in HDInsight using Azure Data Factory](hdinsight-hadoop-create-linux-clusters-adf.md#trigger-a-pipeline).
+    Or
+
+* Open the data factory and select **Author & Monitor**. Trigger the copy pipeline and then the Spark pipeline from the portal. For information on triggering pipelines through the portal, see [Create on-demand Apache Hadoop clusters in HDInsight using Azure Data Factory](hdinsight-hadoop-create-linux-clusters-adf.md#trigger-a-pipeline).
 
 To verify that the pipelines have run, you can take either of the following steps:
 
-- Go to the **Monitor** section in your data factory through the portal.
-- In Azure Storage Explorer, go to your Data Lake Storage Gen 2 storage account. Go to the `files` file system, and then go to the `transformed` folder and check its contents to see if the pipeline succeeded.
+* Go to the **Monitor** section in your data factory through the portal.
+* In Azure Storage Explorer, go to your Data Lake Storage Gen 2 storage account. Go to the `files` file system, and then go to the `transformed` folder and check its contents to see if the pipeline succeeded.
 
 For other ways to transform data by using HDInsight, see [this article on using Jupyter Notebook](/azure/hdinsight/spark/apache-spark-load-data-run-query).
 
 ### Create a table on the Interactive Query cluster to view data on Power BI
 
-1. Copy the `query.hql` file to the LLAP cluster by using SCP:
+1. Copy the `query.hql` file to the LLAP cluster by using SCP. Replace `LLAPCLUSTERNAME` with the actual name, then enter the command:
 
-    ```console
-    scp scripts/query.hql sshuser@<clustername>-ssh.azurehdinsight.net:/home/sshuser/
+    ```bash
+    scp scripts/query.hql sshuser@LLAPCLUSTERNAME-ssh.azurehdinsight.net:/home/sshuser/
     ```
 
-2. Use SSH to access the LLAP cluster by using the following command, and then enter your password. If you haven't altered the `resourcesparameters.json` file, the password is `Thisisapassword1`.
+2. Use SSH to access the LLAP cluster. Replace `LLAPCLUSTERNAME` with the actual name, then enter the command. If you haven't altered the `resourcesparameters.json` file, the password is `Thisisapassword1`.
 
-    ```console
-    ssh sshuser@<clustername>-ssh.azurehdinsight.net
+    ```bash
+    ssh sshuser@LLAPCLUSTERNAME-ssh.azurehdinsight.net
     ```
 
 3. Use the following command to run the script:
 
-    ```console
+    ```bash
     beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http' -f query.hql
     ```
 
-This script will create a managed table on the Interactive Query cluster that you can access from Power BI. 
+This script will create a managed table on the Interactive Query cluster that you can access from Power BI.
 
 ### Create a Power BI dashboard from sales data
 
 1. Open Power BI Desktop.
 1. Select **Get Data**.
 1. Search for **HDInsight Interactive Query cluster**.
-1. Paste the URI for your cluster there. It should be in the format `https://<LLAP CLUSTER NAME>.azurehdinsight.net`.
+1. Paste the URI for your cluster there. It should be in the format `https://LLAPCLUSTERNAME.azurehdinsight.net`.
 
    Enter `default` for the database.
 1. Enter the username and password that you use to access the cluster.