You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
title: 'Tutorial: Create an end-to-end ETL pipeline to derive sales insights'
2
+
title: 'Tutorial: Create an end-to-end ETL pipeline to derive sales insights in Azure HDInsight'
3
3
description: Learn how to use create ETL pipelines with Azure HDInsight to derive insights from sales data by using Spark on-demand clusters and Power BI.
4
4
author: hrasheed-msft
5
+
ms.author: hrasheed
5
6
ms.reviewer: jasonh
6
7
ms.service: hdinsight
7
-
ms.custom: hdinsightactive
8
8
ms.topic: tutorial
9
-
ms.date: 09/30/2019
10
-
ms.author: hrasheed
9
+
ms.custom: hdinsightactive
10
+
ms.date: 03/24/2020
11
11
---
12
-
# Tutorial: Create an end-to-end data pipeline to derive sales insights
13
12
14
-
In this tutorial, you'll build an end-to-end data pipeline that performs extract, transform, and load (ETL) operations. The pipeline will use Apache Spark and Apache Hive clusters running on Azure HDInsight for querying and manipulating the data. You'll also use technologies like Azure Data Lake Storage Gen2 for data storage, and Power BI for visualization.
13
+
# Tutorial: Create an end-to-end data pipeline to derive sales insights in Azure HDInsight
14
+
15
+
In this tutorial, you'll build an end-to-end data pipeline that performs extract, transform, and load (ETL) operations. The pipeline will use [Apache Spark](./spark/apache-spark-overview.md) and Apache Hive clusters running on Azure HDInsight for querying and manipulating the data. You'll also use technologies like Azure Data Lake Storage Gen2 for data storage, and Power BI for visualization.
15
16
16
17
This data pipeline combines the data from various stores, removes any unwanted data, appends new data, and loads all this back to your storage to visualize business insights. Read more about ETL pipelines in [Extract, transform, and load (ETL) at scale](./hadoop/apache-hadoop-etl-at-scale.md).
1. In the **Select environment** drop-down menu, choose **Bash**.
35
-
1. List your subscriptions by typing the command `az account list --output table`. Note the ID of the subscription that you will use for this project.
36
-
1.Set the subscription you will use for this project and set the subscriptionID variable which will be used later.
42
+
43
+
1.Ensure you're a member of the Azure role [owner](../role-based-access-control/built-in-roles.md). Replace `[email protected]` with your account and then enter the command:
If no record is returned, you aren't a member and won't be able to complete this tutorial.
52
+
53
+
1. List your subscriptions entering the command:
54
+
55
+
```azurecli
56
+
az account list --output table
57
+
```
58
+
59
+
Note the ID of the subscription that you'll use for this project.
60
+
61
+
1. Set the subscription you'll use for this project. Replace `SUBSCRIPTIONID` with the actual value, then enter the command.
62
+
63
+
```azurecli
64
+
subscriptionID="SUBSCRIPTIONID"
40
65
az account set --subscription $subscriptionID
41
66
```
42
67
43
-
1. Create a new resource group for the project and set the resourceGroup variable which will be used later.
68
+
1. Create a new resource group for the project. Replace `RESOURCEGROUP` with the desired name, then enter the command.
44
69
45
70
```azurecli
46
-
resourceGroup="<RESOURCE GROUP NAME>"
71
+
resourceGroup="RESOURCEGROUP"
47
72
az group create --name $resourceGroup --location westus
48
73
```
49
74
50
-
1. Download the data and scripts for this tutorial from the [HDInsight sales insights ETL repository](https://github.com/Azure-Samples/hdinsight-sales-insights-etl) by entering the following commands in Cloud Shell:
75
+
1. Download the data and scripts for this tutorial from the [HDInsight sales insights ETL repository](https://github.com/Azure-Samples/hdinsight-sales-insights-etl). Enter the following command:
1. Add execute permissions for all of the scripts by typing `chmod +x scripts/*.sh`.
66
-
1. Use the command `./scripts/resources.sh <RESOURCE_GROUP_NAME> <LOCATION>` to run the script to deploy the following resources in Azure:
102
+
The command will deploy the following resources:
67
103
68
-
1. An Azure Blob storage account. This account will hold the company sales data.
69
-
2. An Azure Data Lake Storage Gen2 account. This account will serve as the storage account for both HDInsight clusters. Read more about HDInsight and Data Lake Storage Gen2 in [Azure HDInsight integration with Data Lake Storage Gen2](https://azure.microsoft.com/blog/azure-hdinsight-integration-with-data-lake-storage-gen-2-preview-acl-and-security-update/).
70
-
3. A user-assigned managed identity. This account gives the HDInsight clusters access to the Data Lake Storage Gen2 account.
71
-
4. An Apache Spark cluster. This cluster will be used to clean up and transform the raw data.
72
-
5.An Apache Hive Interactive Query cluster. This cluster will allow querying the sales data and visualizing it with Power BI.
73
-
6.An Azure virtual network supported by network security group (NSG) rules. This virtual network allows the clusters to communicate and secures their communications.
104
+
* An Azure Blob storage account. This account will hold the company sales data.
105
+
* An Azure Data Lake Storage Gen2 account. This account will serve as the storage account forboth HDInsight clusters. Read more about HDInsight and Data Lake Storage Gen2in [Azure HDInsight integration with Data Lake Storage Gen2](https://azure.microsoft.com/blog/azure-hdinsight-integration-with-data-lake-storage-gen-2-preview-acl-and-security-update/).
106
+
* A user-assigned managed identity. This account gives the HDInsight clusters access to the Data Lake Storage Gen2 account.
107
+
* An Apache Spark cluster. This cluster will be used to clean up and transform the raw data.
108
+
*An Apache Hive [Interactive Query](./interactive-query/apache-interactive-query-get-started.md) cluster. This cluster will allow querying the sales data and visualizing it with Power BI.
109
+
*An Azure virtual network supported by network security group (NSG) rules. This virtual network allows the clusters to communicate and secures their communications.
74
110
75
111
Cluster creation can take around 20 minutes.
76
112
77
-
The `resources.sh` script contains the following commands. It is not required for you to run these commands if you already executed the script in the previous step.
113
+
The `resources.sh` script contains the following commands. It isn't required for you to run these commands if you already executed the script in the previous step.
78
114
79
-
*`az group deployment create` - This command uses an Azure Resource Manager template (`resourcestemplate.json`) to create the specified resources with the desired configuration.
115
+
* `az group deployment create` - This command uses an Azure Resource Manager template (`resourcestemplate.json`) to create the specified resources with the desired configuration.
80
116
81
117
```azurecli
82
118
az group deployment create --name ResourcesDeployment \
@@ -97,37 +133,64 @@ The default password for SSH access to the clusters is `Thisisapassword1`. If yo
97
133
### Verify deployment and collect resource information
98
134
99
135
1. If you want to check the status of your deployment, go to the resource group on the Azure portal. Select **Deployments** under **Settings**. Select the name of your deployment, `ResourcesDeployment`. Here you can see the resources that have successfully deployed and the resources that are still in progress.
100
-
1. After the deployment has finished, go to the Azure portal > **Resource groups** > <RESOURCE_GROUP_NAME>.
101
-
1. Locate the new Azure storage account that was created for storing the sales files. The name of the storage account begins with `blob` and then contains a random string. Do the following:
102
-
1. Make a note of the storage account name for later use.
103
-
1. Select the name of the Blob storage account.
104
-
1. On the left side of the portal under **Settings**, select **Access keys**.
105
-
1. Copy the string in the **Key1** box and save it for later use.
106
-
1. Locate the Data Lake Storage Gen2 account that was created as storage for the HDInsight clusters. This account is located in the same resource group as the Blob storage account, but begins with `adlsgen2`. Do the following:
107
-
1. Make a note of the name of the Data Lake Storage Gen2 account.
108
-
1. Select the name of the Data Lake Storage Gen2 account.
109
-
1. On the left side of the portal, under **Settings**, select **Access keys**.
110
-
1. Copy the string in the **Key1** box and save it for later use.
111
-
112
-
> [!Note]
113
-
> After you know the names of the storage accounts, you can get the account keys by using the following command at the Azure Cloud Shell prompt:
114
-
> ```azurecli
115
-
> az storage account keys list \
116
-
> --account-name <STORAGE NAME> \
117
-
> --resource-group $rg \
118
-
> --output table
119
-
> ```
136
+
137
+
1. To view the names of the clusters, enter the following command:
Azure Data Factory is a tool that helps automate Azure Pipelines. It's not the only way to accomplish these tasks, but it's a great way to automate the processes. For more information on Azure Data Factory, see the [Azure Data Factory documentation](https://azure.microsoft.com/services/data-factory/).
182
+
Azure Data Factory is a tool that helps automate Azure Pipelines. It's not the only way to accomplish these tasks, but it's a great way to automate the processes. For more information on Azure Data Factory, see the [Azure Data Factory documentation](https://azure.microsoft.com/services/data-factory/).
124
183
125
-
This data factory will have one pipeline with two activities:
184
+
This data factory will have one pipeline with two activities:
126
185
127
-
- The first activity will copy the data from Azure Blob storage to the Data Lake Storage Gen 2 storage account to mimic data ingestion.
128
-
- The second activity will transform the data in the Spark cluster. The script transforms the data by removing unwanted columns. It also appends a new column that calculates the revenue that a single transaction generates.
186
+
* The first activity will copy the data from Azure Blob storage to the Data Lake Storage Gen 2 storage account to mimic data ingestion.
187
+
* The second activity will transform the data in the Spark cluster. The script transforms the data by removing unwanted columns. It also appends a new column that calculates the revenue that a single transaction generates.
129
188
130
-
To set up your Azure Data Factory pipeline, run the `adf.sh` script, by typing `./adf.sh`.
189
+
To set up your Azure Data Factory pipeline, execute the following command:
190
+
191
+
```bash
192
+
./scripts/adf.sh
193
+
```
131
194
132
195
This script does the following things:
133
196
@@ -154,50 +217,52 @@ The first activity in the Data Factory pipeline that you've created moves the da
154
217
155
218
To trigger the pipelines, you can either:
156
219
157
-
- Run the following commands to trigger the Data Factory pipelines in PowerShell:
220
+
* Trigger the Data Factory pipelines in PowerShell. Replace `DataFactoryName` with the actual Data Factory name, then run the following commands:
- Open the data factory and select **Author & Monitor**. Trigger the copy pipeline and then the Spark pipeline from the portal. For information on triggering pipelines through the portal, see [Create on-demand Apache Hadoop clusters in HDInsight using Azure Data Factory](hdinsight-hadoop-create-linux-clusters-adf.md#trigger-a-pipeline).
227
+
Or
228
+
229
+
* Open the data factory and select**Author& Monitor**. Trigger the copy pipeline and then the Spark pipeline from the portal. For information on triggering pipelines through the portal, see [Create on-demand Apache Hadoop clusters in HDInsight using Azure Data Factory](hdinsight-hadoop-create-linux-clusters-adf.md#trigger-a-pipeline).
165
230
166
231
To verify that the pipelines have run, you can take either of the following steps:
167
232
168
-
- Go to the **Monitor** section in your data factory through the portal.
169
-
- In Azure Storage Explorer, go to your Data Lake Storage Gen 2 storage account. Go to the `files` file system, and then go to the `transformed` folder and check its contents to see if the pipeline succeeded.
233
+
* Go to the **Monitor** section in your data factory through the portal.
234
+
* In Azure Storage Explorer, go to your Data Lake Storage Gen 2 storage account. Go to the `files` file system, and then go to the `transformed` folder and check its contents to see if the pipeline succeeded.
170
235
171
236
For other ways to transform data by using HDInsight, see [this article on using Jupyter Notebook](/azure/hdinsight/spark/apache-spark-load-data-run-query).
172
237
173
238
### Create a table on the Interactive Query cluster to view data on Power BI
174
239
175
-
1. Copy the `query.hql` file to the LLAP cluster by using SCP:
240
+
1. Copy the `query.hql` file to the LLAP cluster by using SCP. Replace `LLAPCLUSTERNAME` with the actual name, then enter the command:
2. Use SSH to access the LLAP cluster by using the following command, and then enter your password. If you haven't altered the `resourcesparameters.json` file, the password is `Thisisapassword1`.
246
+
2. Use SSH to access the LLAP cluster. Replace `LLAPCLUSTERNAME` with the actual name, then enter the command. If you haven't altered the `resourcesparameters.json` file, the password is `Thisisapassword1`.
0 commit comments