Skip to content

Commit 7d3b536

Browse files
Merge pull request #108888 from dagiro/ETL
ETL
2 parents 22e0196 + f1eca66 commit 7d3b536

File tree

1 file changed

+134
-69
lines changed

1 file changed

+134
-69
lines changed

articles/hdinsight/hdinsight-sales-insights-etl.md

Lines changed: 134 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -1,82 +1,118 @@
11
---
2-
title: 'Tutorial: Create an end-to-end ETL pipeline to derive sales insights'
2+
title: 'Tutorial: Create an end-to-end ETL pipeline to derive sales insights in Azure HDInsight'
33
description: Learn how to use create ETL pipelines with Azure HDInsight to derive insights from sales data by using Spark on-demand clusters and Power BI.
44
author: hrasheed-msft
5+
ms.author: hrasheed
56
ms.reviewer: jasonh
67
ms.service: hdinsight
7-
ms.custom: hdinsightactive
88
ms.topic: tutorial
9-
ms.date: 09/30/2019
10-
ms.author: hrasheed
9+
ms.custom: hdinsightactive
10+
ms.date: 03/24/2020
1111
---
12-
# Tutorial: Create an end-to-end data pipeline to derive sales insights
1312

14-
In this tutorial, you'll build an end-to-end data pipeline that performs extract, transform, and load (ETL) operations. The pipeline will use Apache Spark and Apache Hive clusters running on Azure HDInsight for querying and manipulating the data. You'll also use technologies like Azure Data Lake Storage Gen2 for data storage, and Power BI for visualization.
13+
# Tutorial: Create an end-to-end data pipeline to derive sales insights in Azure HDInsight
14+
15+
In this tutorial, you'll build an end-to-end data pipeline that performs extract, transform, and load (ETL) operations. The pipeline will use [Apache Spark](./spark/apache-spark-overview.md) and Apache Hive clusters running on Azure HDInsight for querying and manipulating the data. You'll also use technologies like Azure Data Lake Storage Gen2 for data storage, and Power BI for visualization.
1516

1617
This data pipeline combines the data from various stores, removes any unwanted data, appends new data, and loads all this back to your storage to visualize business insights. Read more about ETL pipelines in [Extract, transform, and load (ETL) at scale](./hadoop/apache-hadoop-etl-at-scale.md).
1718

1819
![ETL architecture](./media/hdinsight-sales-insights-etl/architecture.png)
1920

21+
If you don't have an Azure subscription, create a [free account](https://azure.microsoft.com/free/?WT.mc_id=A261C142F) before you begin.
22+
2023
## Prerequisites
2124

22-
If you don't have an Azure subscription, create a [free account](https://azure.microsoft.com/free/) before you begin.
25+
* Azure CLI. See [Install the Azure CLI](https://docs.microsoft.com/cli/azure/install-azure-cli).
26+
27+
* A member of the [Azure built-in role - owner](../role-based-access-control/built-in-roles.md).
2328

24-
Download [Power BI Desktop](https://www.microsoft.com/download/details.aspx?id=45331) to visualize business insights at the end of this tutorial.
29+
* [Power BI Desktop](https://www.microsoft.com/download/details.aspx?id=45331) to visualize business insights at the end of this tutorial.
2530

2631
## Create resources
2732

2833
### Clone the repository with scripts and data
2934

3035
1. Sign in to the [Azure portal](https://portal.azure.com).
36+
3137
1. Open Azure Cloud Shell from the top menu bar. Select your subscription for creating a file share if Cloud Shell prompts you.
3238

3339
![Open Azure Cloud Shell](./media/hdinsight-sales-insights-etl/hdinsight-sales-insights-etl-click-cloud-shell.png)
40+
3441
1. In the **Select environment** drop-down menu, choose **Bash**.
35-
1. List your subscriptions by typing the command `az account list --output table`. Note the ID of the subscription that you will use for this project.
36-
1. Set the subscription you will use for this project and set the subscriptionID variable which will be used later.
42+
43+
1. Ensure you're a member of the Azure role [owner](../role-based-access-control/built-in-roles.md). Replace `[email protected]` with your account and then enter the command:
3744

3845
```azurecli
39-
subscriptionID="<SUBSCRIPTION ID>"
46+
az role assignment list \
47+
--assignee "[email protected]" \
48+
--role "Owner"
49+
```
50+
51+
If no record is returned, you aren't a member and won't be able to complete this tutorial.
52+
53+
1. List your subscriptions entering the command:
54+
55+
```azurecli
56+
az account list --output table
57+
```
58+
59+
Note the ID of the subscription that you'll use for this project.
60+
61+
1. Set the subscription you'll use for this project. Replace `SUBSCRIPTIONID` with the actual value, then enter the command.
62+
63+
```azurecli
64+
subscriptionID="SUBSCRIPTIONID"
4065
az account set --subscription $subscriptionID
4166
```
4267
43-
1. Create a new resource group for the project and set the resourceGroup variable which will be used later.
68+
1. Create a new resource group for the project. Replace `RESOURCEGROUP` with the desired name, then enter the command.
4469
4570
```azurecli
46-
resourceGroup="<RESOURCE GROUP NAME>"
71+
resourceGroup="RESOURCEGROUP"
4772
az group create --name $resourceGroup --location westus
4873
```
4974
50-
1. Download the data and scripts for this tutorial from the [HDInsight sales insights ETL repository](https://github.com/Azure-Samples/hdinsight-sales-insights-etl) by entering the following commands in Cloud Shell:
75+
1. Download the data and scripts for this tutorial from the [HDInsight sales insights ETL repository](https://github.com/Azure-Samples/hdinsight-sales-insights-etl). Enter the following command:
5176
52-
```console
77+
```bash
5378
git clone https://github.com/Azure-Samples/hdinsight-sales-insights-etl.git
5479
cd hdinsight-sales-insights-etl
5580
```
5681
57-
1. Enter `ls` at the shell prompt to verify that the following files and directories have been created:
82+
1. Ensure `salesdata scripts templates` have been created. Verify with the following command:
5883
59-
```output
60-
salesdata scripts templates
84+
```bash
85+
ls
6186
```
6287

63-
### Deploy Azure resources needed for the pipeline
88+
### Deploy Azure resources needed for the pipeline
89+
90+
1. Add execute permissions for all of the scripts by entering:
91+
92+
```bash
93+
chmod +x scripts/*.sh
94+
````
95+
96+
1. Execute the script. Replace `RESOURCE_GROUP_NAME` and `LOCATION` with the relevant values, then enter the command:
97+
98+
```bash
99+
./scripts/resources.sh RESOURCE_GROUP_NAME LOCATION
100+
```
64101

65-
1. Add execute permissions for all of the scripts by typing `chmod +x scripts/*.sh`.
66-
1. Use the command `./scripts/resources.sh <RESOURCE_GROUP_NAME> <LOCATION>` to run the script to deploy the following resources in Azure:
102+
The command will deploy the following resources:
67103

68-
1. An Azure Blob storage account. This account will hold the company sales data.
69-
2. An Azure Data Lake Storage Gen2 account. This account will serve as the storage account for both HDInsight clusters. Read more about HDInsight and Data Lake Storage Gen2 in [Azure HDInsight integration with Data Lake Storage Gen2](https://azure.microsoft.com/blog/azure-hdinsight-integration-with-data-lake-storage-gen-2-preview-acl-and-security-update/).
70-
3. A user-assigned managed identity. This account gives the HDInsight clusters access to the Data Lake Storage Gen2 account.
71-
4. An Apache Spark cluster. This cluster will be used to clean up and transform the raw data.
72-
5. An Apache Hive Interactive Query cluster. This cluster will allow querying the sales data and visualizing it with Power BI.
73-
6. An Azure virtual network supported by network security group (NSG) rules. This virtual network allows the clusters to communicate and secures their communications.
104+
* An Azure Blob storage account. This account will hold the company sales data.
105+
* An Azure Data Lake Storage Gen2 account. This account will serve as the storage account for both HDInsight clusters. Read more about HDInsight and Data Lake Storage Gen2 in [Azure HDInsight integration with Data Lake Storage Gen2](https://azure.microsoft.com/blog/azure-hdinsight-integration-with-data-lake-storage-gen-2-preview-acl-and-security-update/).
106+
* A user-assigned managed identity. This account gives the HDInsight clusters access to the Data Lake Storage Gen2 account.
107+
* An Apache Spark cluster. This cluster will be used to clean up and transform the raw data.
108+
* An Apache Hive [Interactive Query](./interactive-query/apache-interactive-query-get-started.md) cluster. This cluster will allow querying the sales data and visualizing it with Power BI.
109+
* An Azure virtual network supported by network security group (NSG) rules. This virtual network allows the clusters to communicate and secures their communications.
74110

75111
Cluster creation can take around 20 minutes.
76112

77-
The `resources.sh` script contains the following commands. It is not required for you to run these commands if you already executed the script in the previous step.
113+
The `resources.sh` script contains the following commands. It isn't required for you to run these commands if you already executed the script in the previous step.
78114
79-
* `az group deployment create` - This command uses an Azure Resource Manager template (`resourcestemplate.json`) to create the specified resources with the desired configuration.
115+
* `az group deployment create` - This command uses an Azure Resource Manager template (`resourcestemplate.json`) to create the specified resources with the desired configuration.
80116
81117
```azurecli
82118
az group deployment create --name ResourcesDeployment \
@@ -97,37 +133,64 @@ The default password for SSH access to the clusters is `Thisisapassword1`. If yo
97133
### Verify deployment and collect resource information
98134
99135
1. If you want to check the status of your deployment, go to the resource group on the Azure portal. Select **Deployments** under **Settings**. Select the name of your deployment, `ResourcesDeployment`. Here you can see the resources that have successfully deployed and the resources that are still in progress.
100-
1. After the deployment has finished, go to the Azure portal > **Resource groups** > <RESOURCE_GROUP_NAME>.
101-
1. Locate the new Azure storage account that was created for storing the sales files. The name of the storage account begins with `blob` and then contains a random string. Do the following:
102-
1. Make a note of the storage account name for later use.
103-
1. Select the name of the Blob storage account.
104-
1. On the left side of the portal under **Settings**, select **Access keys**.
105-
1. Copy the string in the **Key1** box and save it for later use.
106-
1. Locate the Data Lake Storage Gen2 account that was created as storage for the HDInsight clusters. This account is located in the same resource group as the Blob storage account, but begins with `adlsgen2`. Do the following:
107-
1. Make a note of the name of the Data Lake Storage Gen2 account.
108-
1. Select the name of the Data Lake Storage Gen2 account.
109-
1. On the left side of the portal, under **Settings**, select **Access keys**.
110-
1. Copy the string in the **Key1** box and save it for later use.
111-
112-
> [!Note]
113-
> After you know the names of the storage accounts, you can get the account keys by using the following command at the Azure Cloud Shell prompt:
114-
> ```azurecli
115-
> az storage account keys list \
116-
> --account-name <STORAGE NAME> \
117-
> --resource-group $rg \
118-
> --output table
119-
> ```
136+
137+
1. To view the names of the clusters, enter the following command:
138+
139+
```azurecli
140+
sparkCluster=$(az hdinsight list \
141+
--resource-group $resourceGroup \
142+
--query "[?contains(name,'spark')].{clusterName:name}" -o tsv)
143+
144+
llapCluster=$(az hdinsight list \
145+
--resource-group $resourceGroup \
146+
--query "[?contains(name,'llap')].{clusterName:name}" -o tsv)
147+
148+
echo $sparkCluster
149+
echo $llapCluster
150+
```
151+
152+
1. To view the Azure storage account and access key, enter the following command:
153+
154+
```azurecli
155+
blobStorageName=$(cat resourcesoutputs.json | jq -r '.properties.outputs.blobStorageName.value')
156+
157+
blobKey=$(az storage account keys list \
158+
--account-name $blobStorageName \
159+
--resource-group $resourceGroup \
160+
--query [0].value -o tsv)
161+
162+
echo $blobStorageName
163+
echo $blobKey
164+
```
165+
166+
1. To view the Data Lake Storage Gen2 account and access key, enter the following command:
167+
168+
```azurecli
169+
ADLSGen2StorageName=$(cat resourcesoutputs.json | jq -r '.properties.outputs.adlsGen2StorageName.value')
170+
171+
adlsKey=$(az storage account keys list \
172+
--account-name $ADLSGen2StorageName \
173+
--resource-group $resourceGroup \
174+
--query [0].value -o tsv)
175+
176+
echo $ADLSGen2StorageName
177+
echo $adlsKey
178+
```
120179
121180
### Create a data factory
122181
123-
Azure Data Factory is a tool that helps automate Azure Pipelines. It's not the only way to accomplish these tasks, but it's a great way to automate the processes. For more information on Azure Data Factory, see the [Azure Data Factory documentation](https://azure.microsoft.com/services/data-factory/).
182+
Azure Data Factory is a tool that helps automate Azure Pipelines. It's not the only way to accomplish these tasks, but it's a great way to automate the processes. For more information on Azure Data Factory, see the [Azure Data Factory documentation](https://azure.microsoft.com/services/data-factory/).
124183
125-
This data factory will have one pipeline with two activities:
184+
This data factory will have one pipeline with two activities:
126185
127-
- The first activity will copy the data from Azure Blob storage to the Data Lake Storage Gen 2 storage account to mimic data ingestion.
128-
- The second activity will transform the data in the Spark cluster. The script transforms the data by removing unwanted columns. It also appends a new column that calculates the revenue that a single transaction generates.
186+
* The first activity will copy the data from Azure Blob storage to the Data Lake Storage Gen 2 storage account to mimic data ingestion.
187+
* The second activity will transform the data in the Spark cluster. The script transforms the data by removing unwanted columns. It also appends a new column that calculates the revenue that a single transaction generates.
129188
130-
To set up your Azure Data Factory pipeline, run the `adf.sh` script, by typing `./adf.sh`.
189+
To set up your Azure Data Factory pipeline, execute the following command:
190+
191+
```bash
192+
./scripts/adf.sh
193+
```
131194
132195
This script does the following things:
133196
@@ -154,50 +217,52 @@ The first activity in the Data Factory pipeline that you've created moves the da
154217

155218
To trigger the pipelines, you can either:
156219

157-
- Run the following commands to trigger the Data Factory pipelines in PowerShell:
220+
* Trigger the Data Factory pipelines in PowerShell. Replace `DataFactoryName` with the actual Data Factory name, then run the following commands:
158221

159222
```powershell
160-
Invoke-AzDataFactoryV2Pipeline -DataFactory $df -PipelineName "CopyPipeline_k8z"
161-
Invoke-AzDataFactoryV2Pipeline -DataFactory $df -PipelineName "sparkTransformPipeline"
223+
Invoke-AzDataFactoryV2Pipeline -DataFactory DataFactoryName -PipelineName "CopyPipeline_k8z"
224+
Invoke-AzDataFactoryV2Pipeline -DataFactory DataFactoryName -PipelineName "sparkTransformPipeline"
162225
```
163226

164-
- Open the data factory and select **Author & Monitor**. Trigger the copy pipeline and then the Spark pipeline from the portal. For information on triggering pipelines through the portal, see [Create on-demand Apache Hadoop clusters in HDInsight using Azure Data Factory](hdinsight-hadoop-create-linux-clusters-adf.md#trigger-a-pipeline).
227+
Or
228+
229+
* Open the data factory and select **Author & Monitor**. Trigger the copy pipeline and then the Spark pipeline from the portal. For information on triggering pipelines through the portal, see [Create on-demand Apache Hadoop clusters in HDInsight using Azure Data Factory](hdinsight-hadoop-create-linux-clusters-adf.md#trigger-a-pipeline).
165230

166231
To verify that the pipelines have run, you can take either of the following steps:
167232

168-
- Go to the **Monitor** section in your data factory through the portal.
169-
- In Azure Storage Explorer, go to your Data Lake Storage Gen 2 storage account. Go to the `files` file system, and then go to the `transformed` folder and check its contents to see if the pipeline succeeded.
233+
* Go to the **Monitor** section in your data factory through the portal.
234+
* In Azure Storage Explorer, go to your Data Lake Storage Gen 2 storage account. Go to the `files` file system, and then go to the `transformed` folder and check its contents to see if the pipeline succeeded.
170235

171236
For other ways to transform data by using HDInsight, see [this article on using Jupyter Notebook](/azure/hdinsight/spark/apache-spark-load-data-run-query).
172237

173238
### Create a table on the Interactive Query cluster to view data on Power BI
174239

175-
1. Copy the `query.hql` file to the LLAP cluster by using SCP:
240+
1. Copy the `query.hql` file to the LLAP cluster by using SCP. Replace `LLAPCLUSTERNAME` with the actual name, then enter the command:
176241

177-
```console
178-
scp scripts/query.hql sshuser@<clustername>-ssh.azurehdinsight.net:/home/sshuser/
242+
```bash
243+
scp scripts/query.hql sshuser@LLAPCLUSTERNAME-ssh.azurehdinsight.net:/home/sshuser/
179244
```
180245

181-
2. Use SSH to access the LLAP cluster by using the following command, and then enter your password. If you haven't altered the `resourcesparameters.json` file, the password is `Thisisapassword1`.
246+
2. Use SSH to access the LLAP cluster. Replace `LLAPCLUSTERNAME` with the actual name, then enter the command. If you haven't altered the `resourcesparameters.json` file, the password is `Thisisapassword1`.
182247
183-
```console
184-
ssh sshuser@<clustername>-ssh.azurehdinsight.net
248+
```bash
249+
ssh sshuser@LLAPCLUSTERNAME-ssh.azurehdinsight.net
185250
```
186251
187252
3. Use the following command to run the script:
188253
189-
```console
254+
```bash
190255
beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http' -f query.hql
191256
```
192257
193-
This script will create a managed table on the Interactive Query cluster that you can access from Power BI.
258+
This script will create a managed table on the Interactive Query cluster that you can access from Power BI.
194259
195260
### Create a Power BI dashboard from sales data
196261
197262
1. Open Power BI Desktop.
198263
1. Select **Get Data**.
199264
1. Search for **HDInsight Interactive Query cluster**.
200-
1. Paste the URI for your cluster there. It should be in the format `https://<LLAP CLUSTER NAME>.azurehdinsight.net`.
265+
1. Paste the URI for your cluster there. It should be in the format `https://LLAPCLUSTERNAME.azurehdinsight.net`.
201266
202267
Enter `default` for the database.
203268
1. Enter the username and password that you use to access the cluster.

0 commit comments

Comments
 (0)