You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In this tutorial, you learn how to create a [Apache Hadoop](https://hadoop.apache.org/) cluster, on demand, in Azure HDInsight using Azure Data Factory. You then use data pipelines in Azure Data Factory to run Hive jobs and delete the cluster. By the end of this tutorial, you learn how to operationalize a big data job run where cluster creation, job run, and cluster deletion are performed on a schedule.
17
18
18
-
This tutorial covers the following tasks:
19
+
This tutorial covers the following tasks:
19
20
20
21
> [!div class="checklist"]
21
22
> * Create an Azure storage account
@@ -33,11 +34,11 @@ If you don't have an Azure subscription, [create a free account](https://azure.m
33
34
34
35
* The PowerShell [Az Module](https://docs.microsoft.com/powershell/azure/overview) installed.
35
36
36
-
* An Azure Active Directory service principal. Once you have created the service principal, be sure to retrieve the **application ID** and **authentication key** using the instructions in the linked article. You need these values later in this tutorial. Also, make sure the service principal is a member of the *Contributor* role of the subscription or the resource group in which the cluster is created. For instructions to retrieve the required values and assign the right roles, see [Create an Azure Active Directory service principal](../active-directory/develop/howto-create-service-principal-portal.md).
37
+
* An Azure Active Directory service principal. Once you've created the service principal, be sure to retrieve the **application ID** and **authentication key** using the instructions in the linked article. You need these values later in this tutorial. Also, make sure the service principal is a member of the *Contributor* role of the subscription or the resource group in which the cluster is created. For instructions to retrieve the required values and assign the right roles, see [Create an Azure Active Directory service principal](../active-directory/develop/howto-create-service-principal-portal.md).
37
38
38
39
## Create preliminary Azure objects
39
40
40
-
In this section, you create various objects that will be used for the HDInsight cluster you create on-demand. The created storage account will contain the sample [HiveQL](https://cwiki.apache.org/confluence/display/Hive/LanguageManual) script (`partitionweblogs.hql`) that you use to simulate a sample [Apache Hive](https://hive.apache.org/) job that runs on the cluster.
41
+
In this section, you create various objects that will be used for the HDInsight cluster you create on-demand. The created storage account will contain the sample [HiveQL](https://cwiki.apache.org/confluence/display/Hive/LanguageManual) script, `partitionweblogs.hql`, that you use to simulate a sample [Apache Hive](https://hive.apache.org/) job that runs on the cluster.
41
42
42
43
This section uses an Azure PowerShell script to create the storage account and copy over the required files within the storage account. The Azure PowerShell sample script in this section performs the following tasks:
43
44
@@ -47,9 +48,6 @@ This section uses an Azure PowerShell script to create the storage account and c
47
48
4. Creates a Blob container in the storage account
48
49
5. Copies the sample HiveQL script (**partitionweblogs.hql**) the Blob container. The script is available at [https://hditutorialdata.blob.core.windows.net/adfhiveactivity/script/partitionweblogs.hql](https://hditutorialdata.blob.core.windows.net/adfhiveactivity/script/partitionweblogs.hql). The sample script is already available in another public Blob container. The PowerShell script below makes a copy of these files into the Azure Storage account it creates.
49
50
50
-
> [!WARNING]
51
-
> Storage account kind `BlobStorage` cannot be used for HDInsight clusters.
52
-
53
51
**To create a storage account and copy the files using Azure PowerShell:**
54
52
55
53
> [!IMPORTANT]
@@ -77,6 +75,10 @@ if(-not($sub))
77
75
{
78
76
Connect-AzAccount
79
77
}
78
+
79
+
# If you have multiple subscriptions, set the one to use
@@ -123,11 +125,13 @@ Write-Host "`nCopying files ..." -ForegroundColor Green
123
125
124
126
$blobs = Get-AzStorageBlob `
125
127
-Context $sourceContext `
126
-
-Container $sourceContainerName
128
+
-Container $sourceContainerName `
129
+
-Blob "hivescripts\hivescript.hql"
127
130
128
131
$blobs|Start-AzStorageBlobCopy `
129
132
-DestContext $destContext `
130
-
-DestContainer $destContainerName
133
+
-DestContainer $destContainerName `
134
+
-DestBlob "hivescripts\partitionweblogs.hql"
131
135
132
136
Write-Host "`nCopied files ..." -ForegroundColor Green
133
137
Get-AzStorageBlob `
@@ -146,16 +150,16 @@ Write-host "`nScript completed" -ForegroundColor Green
146
150
**To verify the storage account creation**
147
151
148
152
1. Sign on to the [Azure portal](https://portal.azure.com).
149
-
2. Select **Resource groups**on the left pane.
150
-
3. Select the resource group name you created in your PowerShell script. Use the filter if you have too many resource groups listed.
151
-
4. On the **Resources**tile, you see one resource listed unless you share the resource group with other projects. That resource is the storage account with the name you specified earlier. Select the storage account name.
152
-
5. Select the **Blobs**tiles.
153
-
6. Select the **adfgetstarted** container. You see a folder called **hivescripts**.
154
-
7. Open the folder and make sure it contains the sample script file, **partitionweblogs.hql**.
153
+
1. From the left, navigate to **All services**> **General** > **Resource groups**.
154
+
1. Select the resource group name you created in your PowerShell script. Use the filter if you have too many resource groups listed.
155
+
1. From the **Overview**view, you see one resource listed unless you share the resource group with other projects. That resource is the storage account with the name you specified earlier. Select the storage account name.
156
+
1. Select the **Containers**tile.
157
+
1. Select the **adfgetstarted** container. You see a folder called **hivescripts**.
158
+
1. Open the folder and make sure it contains the sample script file, **partitionweblogs.hql**.
155
159
156
160
## Understand the Azure Data Factory activity
157
161
158
-
[Azure Data Factory](../data-factory/introduction.md) orchestrates and automates the movement and transformation of data. Azure Data Factory can create an HDInsight Hadoop cluster just-in-time to process an input data slice and delete the cluster when the processing is complete.
162
+
[Azure Data Factory](../data-factory/introduction.md) orchestrates and automates the movement and transformation of data. Azure Data Factory can create an HDInsight Hadoop cluster just-in-time to process an input data slice and delete the cluster when the processing is complete.
159
163
160
164
In Azure Data Factory, a data factory can have one or more data pipelines. A data pipeline has one or more activities. There are two types of activities:
161
165
@@ -190,12 +194,13 @@ In this article, you configure the Hive activity to create an on-demand HDInsigh
190
194
|Resource group | Select **Use existing** and then select the resource group you created using the PowerShell script. |
191
195
|Version | Leave at **V2**. |
192
196
|Location | The location is automatically set to the location you specified while creating the resource group earlier. For this tutorial, the location is set to **East US**. |
197
+
|Enable GIT|Uncheck this box.|
193
198
194
199

195
200
196
201
4. Select **Create**. Creating a data factory might take anywhere between 2 to 4 minutes.
197
202
198
-
5. Once the data factory is created, you will receive a **Deployment succeeded** notification with a **Go to resource** button. Select **Go to resource** to open the Data Factory default view.
203
+
5. Once the data factory is created, you'll receive a **Deployment succeeded** notification with a **Go to resource** button. Select **Go to resource** to open the Data Factory default view.
199
204
200
205
6. Select **Author & Monitor** to launch the Azure Data Factory authoring and monitoring portal.
201
206
@@ -230,7 +235,7 @@ In this section, you author two linked services within your data factory.
230
235
|Azure subscription |Select your subscription from the drop-down list.|
231
236
|Storage account name |Select the Azure Storage account you created as part of the PowerShell script.|
232
237
233
-
Then select **Finish**.
238
+
Select **Test connection** and if successful, then select **Create**.
234
239
235
240

236
241
@@ -258,13 +263,12 @@ In this section, you author two linked services within your data factory.
258
263
| Cluster name prefix | Provide a value that will be prefixed to all the cluster types that are created by the data factory. |
259
264
|Subscription |Select your subscription from the drop-down list.|
260
265
| Select resource group | Select the resource group you created as part of the PowerShell script you used earlier.|
261
-
|Select region | Select a region from the drop-down list.|
262
266
| OS type/Cluster SSH user name | Enter an SSH user name, commonly `sshuser`. |
263
267
| OS type/Cluster SSH password | Provide a password for the SSH user |
264
268
| OS type/Cluster user name | Enter a cluster user name, commonly `admin`. |
265
-
| OS type/Cluster user password | Provide a password for the cluster user. |
269
+
| OS type/Cluster password | Provide a password for the cluster user. |
266
270
267
-
Then select **Finish**.
271
+
Then select **Create**.
268
272
269
273

270
274
@@ -278,7 +282,7 @@ In this section, you author two linked services within your data factory.
278
282
279
283

280
284
281
-
3. Make sure you have the Hive activity selected, select the **HDI Cluster** tab, and from the **HDInsight Linked Service** drop-down list, select the linked service you created earlier, **HDinightLinkedService**, for HDInsight.
285
+
3. Make sure you have the Hive activity selected, select the **HDI Cluster** tab, and from the **HDInsight Linked Service** drop-down list, select the linked service you created earlier, **HDInsightLinkedService**, for HDInsight.
282
286
283
287

284
288
@@ -318,7 +322,7 @@ In this section, you author two linked services within your data factory.
318
322
319
323
1. Select **Refresh** to refresh the status.
320
324
321
-
1. You can also select the **View Activity Runs** icon to see the activity run associated with the pipeline. In the screenshot below, you see only one activity run since there is only one activity in the pipeline you created. To switch back to the previous view, select **Pipelines** towards the top of the page.
325
+
1. You can also select the **View Activity Runs** icon to see the activity run associated with the pipeline. In the screenshot below, you see only one activity run since there's only one activity in the pipeline you created. To switch back to the previous view, select **Pipelines** towards the top of the page.
322
326
323
327

324
328
@@ -336,7 +340,7 @@ In this section, you author two linked services within your data factory.
336
340
337
341
## Clean up resources
338
342
339
-
With the on-demand HDInsight cluster creation, you do not need to explicitly delete the HDInsight cluster. The cluster is deleted based on the configuration you provided while creating the pipeline. However, even after the cluster is deleted, the storage accounts associated with the cluster continue to exist. This behavior is by design so that you can keep your data intact. However, if you do not want to persist the data, you may delete the storage account you created.
343
+
With the on-demand HDInsight cluster creation, you don't need to explicitly delete the HDInsight cluster. The cluster is deleted based on the configuration you provided while creating the pipeline. However, even after the cluster is deleted, the storage accounts associated with the cluster continue to exist. This behavior is by design so that you can keep your data intact. However, if you don't want to persist the data, you may delete the storage account you created.
340
344
341
345
Alternatively, you can delete the entire resource group that you created for this tutorial. This deletes the storage account and the Azure Data Factory that you created.
0 commit comments