You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/hdinsight-sales-insights-etl.md
+94-96Lines changed: 94 additions & 96 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ ms.reviewer: jasonh
7
7
ms.service: hdinsight
8
8
ms.topic: tutorial
9
9
ms.custom: hdinsightactive
10
-
ms.date: 03/24/2020
10
+
ms.date: 03/27/2020
11
11
---
12
12
13
13
# Tutorial: Create an end-to-end data pipeline to derive sales insights in Azure HDInsight
@@ -22,23 +22,28 @@ If you don't have an Azure subscription, create a [free account](https://azure.m
22
22
23
23
## Prerequisites
24
24
25
-
* Azure CLI. See [Install the Azure CLI](https://docs.microsoft.com/cli/azure/install-azure-cli).
25
+
* Azure CLI - at least version 2.2.0. See [Install the Azure CLI](https://docs.microsoft.com/cli/azure/install-azure-cli).
26
+
27
+
* jq, a command-line JSON processor. See [https://stedolan.github.io/jq/](https://stedolan.github.io/jq/).
26
28
27
29
* A member of the [Azure built-in role - owner](../role-based-access-control/built-in-roles.md).
28
30
29
-
*[Power BI Desktop](https://www.microsoft.com/download/details.aspx?id=45331) to visualize business insights at the end of this tutorial.
31
+
* If using PowerShell to trigger the Data Factory pipeline, you'll need the [Az Module](https://docs.microsoft.com/powershell/azure/overview).
32
+
33
+
*[Power BI Desktop](https://aka.ms/pbiSingleInstaller) to visualize business insights at the end of this tutorial.
30
34
31
35
## Create resources
32
36
33
37
### Clone the repository with scripts and data
34
38
35
-
1.Sign in to the [Azure portal](https://portal.azure.com).
39
+
1.Log in to your Azure subscription. If you plan to use Azure Cloud Shell, then select **Try it** in the upper-right corner of the code block. Else, enter the command below:
36
40
37
-
1. Open Azure Cloud Shell from the top menu bar. Select your subscription for creating a file share if Cloud Shell prompts you.
1. In the **Select environment** drop-down menu, choose **Bash**.
44
+
# If you have multiple subscriptions, set the one to use
45
+
# az account set --subscription "SUBSCRIPTIONID"
46
+
```
42
47
43
48
1. Ensure you're a member of the Azure role [owner](../role-based-access-control/built-in-roles.md). Replace `[email protected]` with your account and then enter the command:
44
49
@@ -50,29 +55,7 @@ If you don't have an Azure subscription, create a [free account](https://azure.m
50
55
51
56
If no record is returned, you aren't a member and won't be able to complete this tutorial.
52
57
53
-
1. List your subscriptions entering the command:
54
-
55
-
```azurecli
56
-
az account list --output table
57
-
```
58
-
59
-
Note the ID of the subscription that you'll use for this project.
60
-
61
-
1. Set the subscription you'll use for this project. Replace `SUBSCRIPTIONID` with the actual value, then enter the command.
62
-
63
-
```azurecli
64
-
subscriptionID="SUBSCRIPTIONID"
65
-
az account set --subscription $subscriptionID
66
-
```
67
-
68
-
1. Create a new resource group for the project. Replace `RESOURCEGROUP` with the desired name, then enter the command.
69
-
70
-
```azurecli
71
-
resourceGroup="RESOURCEGROUP"
72
-
az group create --name $resourceGroup --location westus
73
-
```
74
-
75
-
1. Download the data and scripts for this tutorial from the [HDInsight sales insights ETL repository](https://github.com/Azure-Samples/hdinsight-sales-insights-etl). Enter the following command:
58
+
1. Download the data and scripts for this tutorial from the [HDInsight sales insights ETL repository](https://github.com/Azure-Samples/hdinsight-sales-insights-etl). Enter the following command:
@@ -110,49 +99,26 @@ If you don't have an Azure subscription, create a [free account](https://azure.m
110
99
111
100
Cluster creation can take around 20 minutes.
112
101
113
-
The `resources.sh` script contains the following commands. It isn't required for you to run these commands if you already executed the script in the previous step.
114
-
115
-
* `az group deployment create` - This command uses an Azure Resource Manager template (`resourcestemplate.json`) to create the specified resources with the desired configuration.
116
-
117
-
```azurecli
118
-
az group deployment create --name ResourcesDeployment \
119
-
--resource-group $resourceGroup \
120
-
--template-file resourcestemplate.json \
121
-
--parameters "@resourceparameters.json"
122
-
```
123
-
124
-
* `az storage blob upload-batch` - This command uploads the sales data .csv files into the newly created Blob storage account by using this command:
The default password for SSH access to the clusters is `Thisisapassword1`. If you want to change the password, go to the `resourcesparameters.json` file and change the password for the `sparksshPassword`, `sparkClusterLoginPassword`, `llapClusterLoginPassword`, and `llapsshPassword` parameters.
102
+
The default password for SSH access to the clusters is `Thisisapassword1`. If you want to change the password, go to the `./templates/resourcesparameters_remainder.json` file and change the password for the `sparksshPassword`, `sparkClusterLoginPassword`, `llapClusterLoginPassword`, and `llapsshPassword` parameters.
132
103
133
104
### Verify deployment and collect resource information
134
105
135
-
1. If you want to check the status of your deployment, go to the resource group on the Azure portal. Select **Deployments** under **Settings**. Select the name of your deployment, `ResourcesDeployment`. Here you can see the resources that have successfully deployed and the resources that are still in progress.
106
+
1. If you want to check the status of your deployment, go to the resource group on the Azure portal. Under**Settings**, select**Deployments**,thenyour deployment. Here you can see the resources that have successfully deployed and the resources that are still in progress.
136
107
137
108
1. To view the names of the clusters, enter the following command:
@@ -200,35 +169,44 @@ This script does the following things:
200
169
1. Obtains storage keys for the Data Lake Storage Gen2 and Blob storage accounts.
201
170
1. Creates another resource deployment to create an Azure Data Factory pipeline, with its associated linked services and activities. It passes the storage keys as parameters to the template file so that the linked services can access the storage accounts correctly.
202
171
203
-
The Data Factory pipeline is deployed through the following command:
204
-
205
-
```azurecli-interactive
206
-
az group deployment create --name ADFDeployment \
207
-
--resource-group $resourceGroup \
208
-
--template-file adftemplate.json \
209
-
--parameters "@adfparameters.json"
210
-
```
211
-
212
172
## Run the data pipeline
213
173
214
174
### Trigger the Data Factory activities
215
175
216
176
The first activity in the Data Factory pipeline that you've created moves the data from Blob storage to Data Lake Storage Gen2. The second activity applies the Spark transformations on the data and saves the transformed .csv files to a new location. The entire pipeline might take a few minutes to finish.
217
177
218
-
To trigger the pipelines, you can either:
178
+
To retrieve the Data Factory name, enter the following command:
* Trigger the Data Factory pipelinesin PowerShell. Replace `DataFactoryName` with the actual Data Factory name, then run the following commands:
186
+
* Trigger the Data Factory pipeline in PowerShell. Replace `RESOURCEGROUP`, and `DataFactoryName` with the appropriate values, then run the following commands:
Re-execute `Get-AzDataFactoryV2PipelineRun` as needed to monitor progress.
204
+
227
205
Or
228
206
229
-
* Open the data factory and select**Author& Monitor**. Trigger the copy pipeline and then the Spark pipeline from the portal. For information on triggering pipelines through the portal, see [Create on-demand Apache Hadoop clusters in HDInsight using Azure Data Factory](hdinsight-hadoop-create-linux-clusters-adf.md#trigger-a-pipeline).
207
+
* Open the data factory and select **Author & Monitor**. Trigger the `IngestAndTransform` pipeline from the portal. For information on triggering pipelines through the portal, see [Create on-demand Apache Hadoop clusters in HDInsight using Azure Data Factory](hdinsight-hadoop-create-linux-clusters-adf.md#trigger-a-pipeline).
230
208
231
-
To verify that the pipelines have run, you can take either of the following steps:
209
+
To verify that the pipeline has run, you can take either of the following steps:
232
210
233
211
* Go to the **Monitor** section in your data factory through the portal.
234
212
* In Azure Storage Explorer, go to your Data Lake Storage Gen 2 storage account. Go to the `files` file system, and then go to the `transformed` folder and check its contents to see if the pipeline succeeded.
@@ -237,37 +215,48 @@ For other ways to transform data by using HDInsight, see [this article on using
237
215
238
216
### Create a table on the Interactive Query cluster to view data on Power BI
239
217
240
-
1. Copy the `query.hql` file to the LLAP cluster by using SCP. Replace `LLAPCLUSTERNAME` with the actual name, then enter the command:
218
+
1. Copy the `query.hql` file to the LLAP cluster by using SCP. Enter the command:
2. Use SSH to access the LLAP cluster. Replace `LLAPCLUSTERNAME` with the actual name, then enter the command. If you haven't altered the `resourcesparameters.json` file, the password is `Thisisapassword1`.
225
+
Reminder: The default password is `Thisisapassword1`.
226
+
227
+
1. Use SSH to access the LLAP cluster. Enter the command:
This script will create a managed table on the Interactive Query cluster that you can access from Power BI.
239
+
This script will create a managed table on the Interactive Query cluster that you can access from Power BI.
259
240
260
241
### Create a Power BI dashboard from sales data
261
242
262
243
1. Open Power BI Desktop.
263
-
1. Select **Get Data**.
264
-
1. Search for **HDInsight Interactive Query cluster**.
265
-
1. Paste the URI for your cluster there. It should be in the format `https://LLAPCLUSTERNAME.azurehdinsight.net`.
266
244
267
-
Enter `default` for the database.
268
-
1. Enter the username and password that you use to access the cluster.
245
+
1. From the menu, navigate to **Get data** > **More...** > **Azure** > **HDInsight Interactive Query**.
246
+
247
+
1. Select **Connect**.
269
248
270
-
After the data is loaded, you can experiment with the dashboard that you want to create. See the following links to get started with Power BI dashboards:
249
+
1. From the **HDInsight Interactive Query** dialog:
250
+
1. In the **Server** text box, enter the name of your LLAP cluster in the format of `https://LLAPCLUSTERNAME.azurehdinsight.net`.
251
+
1. In the **database** text box, enter `default`.
252
+
1. Select **OK**.
253
+
254
+
1. From the **AzureHive** dialog:
255
+
1. In the **User name** text box, enter `admin`.
256
+
1. In the **Password** text box, enter `Thisisapassword1`.
257
+
1. Select **Connect**.
258
+
259
+
1. From **Navigator**, select `sales`, and/or `sales_raw` to preview the data. After the data is loaded, you can experiment with the dashboard that you want to create. See the following links to get started with Power BI dashboards:
271
260
272
261
* [Introduction to dashboards for Power BI designers](https://docs.microsoft.com/power-bi/service-dashboards)
273
262
* [Tutorial: Get started with the Power BI service](https://docs.microsoft.com/power-bi/service-get-started)
@@ -276,9 +265,18 @@ After the data is loaded, you can experiment with the dashboard that you want to
276
265
277
266
If you're not going to continue to use this application, delete all resources by using the following command so that you aren't charged for them.
278
267
279
-
```azurecli-interactive
280
-
az group delete -n $resourceGroup
281
-
```
268
+
1. To remove the resource group, enter the command:
269
+
270
+
```azurecli
271
+
az group delete -n $resourceGroup
272
+
```
273
+
274
+
1. To remove the service principal, enter the commands:
0 commit comments