Skip to content

Commit 7ff22e7

Browse files
authored
Merge pull request #109358 from dagiro/ETL2
ETL2
2 parents 37fd5a8 + fdf91f7 commit 7ff22e7

File tree

1 file changed

+100
-97
lines changed

1 file changed

+100
-97
lines changed

articles/hdinsight/hdinsight-sales-insights-etl.md

Lines changed: 100 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.topic: tutorial
99
ms.custom: hdinsightactive
10-
ms.date: 03/24/2020
10+
ms.date: 04/15/2020
1111
---
1212

1313
# Tutorial: Create an end-to-end data pipeline to derive sales insights in Azure HDInsight
@@ -22,23 +22,28 @@ If you don't have an Azure subscription, create a [free account](https://azure.m
2222

2323
## Prerequisites
2424

25-
* Azure CLI. See [Install the Azure CLI](https://docs.microsoft.com/cli/azure/install-azure-cli).
25+
* Azure CLI - at least version 2.2.0. See [Install the Azure CLI](https://docs.microsoft.com/cli/azure/install-azure-cli).
26+
27+
* jq, a command-line JSON processor. See [https://stedolan.github.io/jq/](https://stedolan.github.io/jq/).
2628

2729
* A member of the [Azure built-in role - owner](../role-based-access-control/built-in-roles.md).
2830

29-
* [Power BI Desktop](https://www.microsoft.com/download/details.aspx?id=45331) to visualize business insights at the end of this tutorial.
31+
* If using PowerShell to trigger the Data Factory pipeline, you'll need the [Az Module](https://docs.microsoft.com/powershell/azure/overview).
32+
33+
* [Power BI Desktop](https://aka.ms/pbiSingleInstaller) to visualize business insights at the end of this tutorial.
3034

3135
## Create resources
3236

3337
### Clone the repository with scripts and data
3438

35-
1. Sign in to the [Azure portal](https://portal.azure.com).
39+
1. Log in to your Azure subscription. If you plan to use Azure Cloud Shell, then select **Try it** in the upper-right corner of the code block. Else, enter the command below:
3640

37-
1. Open Azure Cloud Shell from the top menu bar. Select your subscription for creating a file share if Cloud Shell prompts you.
41+
```azurecli-interactive
42+
az login
3843
39-
![Open Azure Cloud Shell](./media/hdinsight-sales-insights-etl/hdinsight-sales-insights-etl-click-cloud-shell.png)
40-
41-
1. In the **Select environment** drop-down menu, choose **Bash**.
44+
# If you have multiple subscriptions, set the one to use
45+
# az account set --subscription "SUBSCRIPTIONID"
46+
```
4247
4348
1. Ensure you're a member of the Azure role [owner](../role-based-access-control/built-in-roles.md). Replace `[email protected]` with your account and then enter the command:
4449
@@ -50,29 +55,7 @@ If you don't have an Azure subscription, create a [free account](https://azure.m
5055
5156
If no record is returned, you aren't a member and won't be able to complete this tutorial.
5257
53-
1. List your subscriptions entering the command:
54-
55-
```azurecli
56-
az account list --output table
57-
```
58-
59-
Note the ID of the subscription that you'll use for this project.
60-
61-
1. Set the subscription you'll use for this project. Replace `SUBSCRIPTIONID` with the actual value, then enter the command.
62-
63-
```azurecli
64-
subscriptionID="SUBSCRIPTIONID"
65-
az account set --subscription $subscriptionID
66-
```
67-
68-
1. Create a new resource group for the project. Replace `RESOURCEGROUP` with the desired name, then enter the command.
69-
70-
```azurecli
71-
resourceGroup="RESOURCEGROUP"
72-
az group create --name $resourceGroup --location westus
73-
```
74-
75-
1. Download the data and scripts for this tutorial from the [HDInsight sales insights ETL repository](https://github.com/Azure-Samples/hdinsight-sales-insights-etl). Enter the following command:
58+
1. Download the data and scripts for this tutorial from the [HDInsight sales insights ETL repository](https://github.com/Azure-Samples/hdinsight-sales-insights-etl). Enter the following command:
7659
7760
```bash
7861
git clone https://github.com/Azure-Samples/hdinsight-sales-insights-etl.git
@@ -93,12 +76,20 @@ If you don't have an Azure subscription, create a [free account](https://azure.m
9376
chmod +x scripts/*.sh
9477
````
9578

96-
1. Execute the script. Replace `RESOURCE_GROUP_NAME` and `LOCATION` with the relevant values, then enter the command:
79+
1. Set variable for resource group. Replace `RESOURCE_GROUP_NAME` with the name of an existing or new resource group, then enter the command:
80+
81+
```bash
82+
resourceGroup="RESOURCE_GROUP_NAME"
83+
```
84+
85+
1. Execute the script. Replace `LOCATION` with a desired value, then enter the command:
9786

9887
```bash
99-
./scripts/resources.sh RESOURCE_GROUP_NAME LOCATION
88+
./scripts/resources.sh $resourceGroup LOCATION
10089
```
10190

91+
If you're not sure which region to specify, you can retrieve a list of supported regions for your subscription with the [az account list-locations](https://docs.microsoft.com/cli/azure/account?view=azure-cli-latest#az-account-list-locations) command.
92+
10293
The command will deploy the following resources:
10394
10495
* An Azure Blob storage account. This account will hold the company sales data.
@@ -110,49 +101,26 @@ If you don't have an Azure subscription, create a [free account](https://azure.m
110101
111102
Cluster creation can take around 20 minutes.
112103
113-
The `resources.sh` script contains the following commands. It isn't required for you to run these commands if you already executed the script in the previous step.
114-
115-
* `az group deployment create` - This command uses an Azure Resource Manager template (`resourcestemplate.json`) to create the specified resources with the desired configuration.
116-
117-
```azurecli
118-
az group deployment create --name ResourcesDeployment \
119-
--resource-group $resourceGroup \
120-
--template-file resourcestemplate.json \
121-
--parameters "@resourceparameters.json"
122-
```
123-
124-
* `az storage blob upload-batch` - This command uploads the sales data .csv files into the newly created Blob storage account by using this command:
125-
126-
```azurecli
127-
az storage blob upload-batch -d rawdata \
128-
--account-name <BLOB STORAGE NAME> -s ./ --pattern *.csv
129-
```
130-
131-
The default password for SSH access to the clusters is `Thisisapassword1`. If you want to change the password, go to the `resourcesparameters.json` file and change the password for the `sparksshPassword`, `sparkClusterLoginPassword`, `llapClusterLoginPassword`, and `llapsshPassword` parameters.
104+
The default password for SSH access to the clusters is `Thisisapassword1`. If you want to change the password, go to the `./templates/resourcesparameters_remainder.json` file and change the password for the `sparksshPassword`, `sparkClusterLoginPassword`, `llapClusterLoginPassword`, and `llapsshPassword` parameters.
132105
133106
### Verify deployment and collect resource information
134107
135-
1. If you want to check the status of your deployment, go to the resource group on the Azure portal. Select **Deployments** under **Settings**. Select the name of your deployment, `ResourcesDeployment`. Here you can see the resources that have successfully deployed and the resources that are still in progress.
108+
1. If you want to check the status of your deployment, go to the resource group on the Azure portal. Under **Settings**, select **Deployments**, then your deployment. Here you can see the resources that have successfully deployed and the resources that are still in progress.
136109
137110
1. To view the names of the clusters, enter the following command:
138111
139-
```azurecli
140-
sparkCluster=$(az hdinsight list \
141-
--resource-group $resourceGroup \
142-
--query "[?contains(name,'spark')].{clusterName:name}" -o tsv)
143-
144-
llapCluster=$(az hdinsight list \
145-
--resource-group $resourceGroup \
146-
--query "[?contains(name,'llap')].{clusterName:name}" -o tsv)
112+
```bash
113+
sparkClusterName=$(cat resourcesoutputs_remainder.json | jq -r '.properties.outputs.sparkClusterName.value')
114+
llapClusterName=$(cat resourcesoutputs_remainder.json | jq -r '.properties.outputs.llapClusterName.value')
147115
148-
echo $sparkCluster
149-
echo $llapCluster
116+
echo "Spark Cluster" $sparkClusterName
117+
echo "LLAP cluster" $llapClusterName
150118
```
151119
152120
1. To view the Azure storage account and access key, enter the following command:
153121
154122
```azurecli
155-
blobStorageName=$(cat resourcesoutputs.json | jq -r '.properties.outputs.blobStorageName.value')
123+
blobStorageName=$(cat resourcesoutputs_storage.json | jq -r '.properties.outputs.blobStorageName.value')
156124
157125
blobKey=$(az storage account keys list \
158126
--account-name $blobStorageName \
@@ -166,7 +134,7 @@ The default password for SSH access to the clusters is `Thisisapassword1`. If yo
166134
1. To view the Data Lake Storage Gen2 account and access key, enter the following command:
167135
168136
```azurecli
169-
ADLSGen2StorageName=$(cat resourcesoutputs.json | jq -r '.properties.outputs.adlsGen2StorageName.value')
137+
ADLSGen2StorageName=$(cat resourcesoutputs_storage.json | jq -r '.properties.outputs.adlsGen2StorageName.value')
170138
171139
adlsKey=$(az storage account keys list \
172140
--account-name $ADLSGen2StorageName \
@@ -186,10 +154,13 @@ This data factory will have one pipeline with two activities:
186154
* The first activity will copy the data from Azure Blob storage to the Data Lake Storage Gen 2 storage account to mimic data ingestion.
187155
* The second activity will transform the data in the Spark cluster. The script transforms the data by removing unwanted columns. It also appends a new column that calculates the revenue that a single transaction generates.
188156
189-
To set up your Azure Data Factory pipeline, execute the following command:
157+
To set up your Azure Data Factory pipeline, execute the command below. You should still be at the `hdinsight-sales-insights-etl` directory.
190158
191159
```bash
192-
./scripts/adf.sh
160+
blobStorageName=$(cat resourcesoutputs_storage.json | jq -r '.properties.outputs.blobStorageName.value')
161+
ADLSGen2StorageName=$(cat resourcesoutputs_storage.json | jq -r '.properties.outputs.adlsGen2StorageName.value')
162+
163+
./scripts/adf.sh $resourceGroup $ADLSGen2StorageName $blobStorageName
193164
```
194165
195166
This script does the following things:
@@ -200,35 +171,47 @@ This script does the following things:
200171
1. Obtains storage keys for the Data Lake Storage Gen2 and Blob storage accounts.
201172
1. Creates another resource deployment to create an Azure Data Factory pipeline, with its associated linked services and activities. It passes the storage keys as parameters to the template file so that the linked services can access the storage accounts correctly.
202173
203-
The Data Factory pipeline is deployed through the following command:
204-
205-
```azurecli-interactive
206-
az group deployment create --name ADFDeployment \
207-
--resource-group $resourceGroup \
208-
--template-file adftemplate.json \
209-
--parameters "@adfparameters.json"
210-
```
211-
212174
## Run the data pipeline
213175
214176
### Trigger the Data Factory activities
215177
216178
The first activity in the Data Factory pipeline that you've created moves the data from Blob storage to Data Lake Storage Gen2. The second activity applies the Spark transformations on the data and saves the transformed .csv files to a new location. The entire pipeline might take a few minutes to finish.
217179

218-
To trigger the pipelines, you can either:
180+
To retrieve the Data Factory name, enter the following command:
219181

220-
* Trigger the Data Factory pipelines in PowerShell. Replace `DataFactoryName` with the actual Data Factory name, then run the following commands:
182+
```azurecli
183+
cat resourcesoutputs_adf.json | jq -r '.properties.outputs.factoryName.value'
184+
```
185+
186+
To trigger the pipeline, you can either:
187+
188+
* Trigger the Data Factory pipeline in PowerShell. Replace `RESOURCEGROUP`, and `DataFactoryName` with the appropriate values, then run the following commands:
221189

222190
```powershell
223-
Invoke-AzDataFactoryV2Pipeline -DataFactory DataFactoryName -PipelineName "CopyPipeline_k8z"
224-
Invoke-AzDataFactoryV2Pipeline -DataFactory DataFactoryName -PipelineName "sparkTransformPipeline"
191+
# If you have multiple subscriptions, set the one to use
192+
# Select-AzSubscription -SubscriptionId "<SUBSCRIPTIONID>"
193+
194+
$resourceGroup="RESOURCEGROUP"
195+
$dataFactory="DataFactoryName"
196+
197+
$pipeline =Invoke-AzDataFactoryV2Pipeline `
198+
-ResourceGroupName $resourceGroup `
199+
-DataFactory $dataFactory `
200+
-PipelineName "IngestAndTransform"
201+
202+
Get-AzDataFactoryV2PipelineRun `
203+
-ResourceGroupName $resourceGroup `
204+
-DataFactoryName $dataFactory `
205+
-PipelineRunId $pipeline
225206
```
226207

208+
Re-execute `Get-AzDataFactoryV2PipelineRun` as needed to monitor progress.
209+
227210
Or
228211

229-
* Open the data factory and select **Author & Monitor**. Trigger the copy pipeline and then the Spark pipeline from the portal. For information on triggering pipelines through the portal, see [Create on-demand Apache Hadoop clusters in HDInsight using Azure Data Factory](hdinsight-hadoop-create-linux-clusters-adf.md#trigger-a-pipeline).
212+
* Open the data factory and select **Author & Monitor**. Trigger the `IngestAndTransform` pipeline from the portal. For information on triggering pipelines through the portal, see [Create on-demand Apache Hadoop clusters in HDInsight using Azure Data Factory](hdinsight-hadoop-create-linux-clusters-adf.md#trigger-a-pipeline).
230213

231-
To verify that the pipelines have run, you can take either of the following steps:
214+
To verify that the pipeline has run, you can take either of the following steps:
232215

233216
* Go to the **Monitor** section in your data factory through the portal.
234217
* In Azure Storage Explorer, go to your Data Lake Storage Gen 2 storage account. Go to the `files` file system, and then go to the `transformed` folder and check its contents to see if the pipeline succeeded.
@@ -237,37 +220,48 @@ For other ways to transform data by using HDInsight, see [this article on using
237220

238221
### Create a table on the Interactive Query cluster to view data on Power BI
239222

240-
1. Copy the `query.hql` file to the LLAP cluster by using SCP. Replace `LLAPCLUSTERNAME` with the actual name, then enter the command:
223+
1. Copy the `query.hql` file to the LLAP cluster by using SCP. Enter the command:
241224

242225
```bash
243-
scp scripts/query.hql [email protected]:/home/sshuser/
226+
llapClusterName=$(cat resourcesoutputs_remainder.json | jq -r '.properties.outputs.llapClusterName.value')
227+
scp scripts/query.hql sshuser@$llapClusterName-ssh.azurehdinsight.net:/home/sshuser/
244228
```
245229

246-
2. Use SSH to access the LLAP cluster. Replace `LLAPCLUSTERNAME` with the actual name, then enter the command. If you haven't altered the `resourcesparameters.json` file, the password is `Thisisapassword1`.
230+
Reminder: The default password is `Thisisapassword1`.
231+
232+
1. Use SSH to access the LLAP cluster. Enter the command:
247233

248234
```bash
249-
ssh sshuser@LLAPCLUSTERNAME-ssh.azurehdinsight.net
235+
ssh sshuser@$llapClusterName-ssh.azurehdinsight.net
250236
```
251237

252-
3. Use the following command to run the script:
238+
1. Use the following command to run the script:
253239

254240
```bash
255241
beeline -u 'jdbc:hive2://localhost:10001/;transportMode=http' -f query.hql
256242
```
257243

258-
This script will create a managed table on the Interactive Query cluster that you can access from Power BI.
244+
This script will create a managed table on the Interactive Query cluster that you can access from Power BI.
259245

260246
### Create a Power BI dashboard from sales data
261247

262248
1. Open Power BI Desktop.
263-
1. Select **Get Data**.
264-
1. Search for **HDInsight Interactive Query cluster**.
265-
1. Paste the URI for your cluster there. It should be in the format `https://LLAPCLUSTERNAME.azurehdinsight.net`.
266249

267-
Enter `default` for the database.
268-
1. Enter the username and password that you use to access the cluster.
250+
1. From the menu, navigate to **Get data** > **More...** > **Azure** > **HDInsight Interactive Query**.
269251

270-
After the data is loaded, you can experiment with the dashboard that you want to create. See the following links to get started with Power BI dashboards:
252+
1. Select **Connect**.
253+
254+
1. From the **HDInsight Interactive Query** dialog:
255+
1. In the **Server** text box, enter the name of your LLAP cluster in the format of `https://LLAPCLUSTERNAME.azurehdinsight.net`.
256+
1. In the **database** text box, enter `default`.
257+
1. Select **OK**.
258+
259+
1. From the **AzureHive** dialog:
260+
1. In the **User name** text box, enter `admin`.
261+
1. In the **Password** text box, enter `Thisisapassword1`.
262+
1. Select **Connect**.
263+
264+
1. From **Navigator**, select `sales`, and/or `sales_raw` to preview the data. After the data is loaded, you can experiment with the dashboard that you want to create. See the following links to get started with Power BI dashboards:
271265

272266
* [Introduction to dashboards for Power BI designers](https://docs.microsoft.com/power-bi/service-dashboards)
273267
* [Tutorial: Get started with the Power BI service](https://docs.microsoft.com/power-bi/service-get-started)
@@ -276,9 +270,18 @@ After the data is loaded, you can experiment with the dashboard that you want to
276270

277271
If you're not going to continue to use this application, delete all resources by using the following command so that you aren't charged for them.
278272

279-
```azurecli-interactive
280-
az group delete -n $resourceGroup
281-
```
273+
1. To remove the resource group, enter the command:
274+
275+
```azurecli
276+
az group delete -n $resourceGroup
277+
```
278+
279+
1. To remove the service principal, enter the commands:
280+
281+
```azurecli
282+
servicePrincipal=$(cat serviceprincipal.json | jq -r '.name')
283+
az ad sp delete --id $servicePrincipal
284+
```
282285

283286
## Next steps
284287

0 commit comments

Comments
 (0)