Skip to content

Commit a2d54d2

Browse files
committed
freshness17
1 parent 24db687 commit a2d54d2

6 files changed

+28
-24
lines changed

articles/hdinsight/hdinsight-hadoop-create-linux-clusters-adf.md

Lines changed: 28 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,17 @@ ms.author: hrasheed
66
ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.topic: tutorial
9-
ms.date: 04/18/2019
9+
ms.date: 10/09/2019
1010
#Customer intent: As a data worker, I need to create a Hadoop cluster and run Hive jobs on demand
1111
---
1212

1313
# Tutorial: Create on-demand Apache Hadoop clusters in HDInsight using Azure Data Factory
14+
1415
[!INCLUDE [selector](../../includes/hdinsight-create-linux-cluster-selector.md)]
1516

1617
In this tutorial, you learn how to create a [Apache Hadoop](https://hadoop.apache.org/) cluster, on demand, in Azure HDInsight using Azure Data Factory. You then use data pipelines in Azure Data Factory to run Hive jobs and delete the cluster. By the end of this tutorial, you learn how to operationalize a big data job run where cluster creation, job run, and cluster deletion are performed on a schedule.
1718

18-
This tutorial covers the following tasks:
19+
This tutorial covers the following tasks:
1920

2021
> [!div class="checklist"]
2122
> * Create an Azure storage account
@@ -33,11 +34,11 @@ If you don't have an Azure subscription, [create a free account](https://azure.m
3334

3435
* The PowerShell [Az Module](https://docs.microsoft.com/powershell/azure/overview) installed.
3536

36-
* An Azure Active Directory service principal. Once you have created the service principal, be sure to retrieve the **application ID** and **authentication key** using the instructions in the linked article. You need these values later in this tutorial. Also, make sure the service principal is a member of the *Contributor* role of the subscription or the resource group in which the cluster is created. For instructions to retrieve the required values and assign the right roles, see [Create an Azure Active Directory service principal](../active-directory/develop/howto-create-service-principal-portal.md).
37+
* An Azure Active Directory service principal. Once you've created the service principal, be sure to retrieve the **application ID** and **authentication key** using the instructions in the linked article. You need these values later in this tutorial. Also, make sure the service principal is a member of the *Contributor* role of the subscription or the resource group in which the cluster is created. For instructions to retrieve the required values and assign the right roles, see [Create an Azure Active Directory service principal](../active-directory/develop/howto-create-service-principal-portal.md).
3738

3839
## Create preliminary Azure objects
3940

40-
In this section, you create various objects that will be used for the HDInsight cluster you create on-demand. The created storage account will contain the sample [HiveQL](https://cwiki.apache.org/confluence/display/Hive/LanguageManual) script (`partitionweblogs.hql`) that you use to simulate a sample [Apache Hive](https://hive.apache.org/) job that runs on the cluster.
41+
In this section, you create various objects that will be used for the HDInsight cluster you create on-demand. The created storage account will contain the sample [HiveQL](https://cwiki.apache.org/confluence/display/Hive/LanguageManual) script, `partitionweblogs.hql`, that you use to simulate a sample [Apache Hive](https://hive.apache.org/) job that runs on the cluster.
4142

4243
This section uses an Azure PowerShell script to create the storage account and copy over the required files within the storage account. The Azure PowerShell sample script in this section performs the following tasks:
4344

@@ -47,9 +48,6 @@ This section uses an Azure PowerShell script to create the storage account and c
4748
4. Creates a Blob container in the storage account
4849
5. Copies the sample HiveQL script (**partitionweblogs.hql**) the Blob container. The script is available at [https://hditutorialdata.blob.core.windows.net/adfhiveactivity/script/partitionweblogs.hql](https://hditutorialdata.blob.core.windows.net/adfhiveactivity/script/partitionweblogs.hql). The sample script is already available in another public Blob container. The PowerShell script below makes a copy of these files into the Azure Storage account it creates.
4950

50-
> [!WARNING]
51-
> Storage account kind `BlobStorage` cannot be used for HDInsight clusters.
52-
5351
**To create a storage account and copy the files using Azure PowerShell:**
5452

5553
> [!IMPORTANT]
@@ -77,6 +75,10 @@ if(-not($sub))
7775
{
7876
Connect-AzAccount
7977
}
78+
79+
# If you have multiple subscriptions, set the one to use
80+
# Select-AzSubscription -SubscriptionId "<SUBSCRIPTIONID>"
81+
8082
#endregion
8183
8284
####################################
@@ -123,11 +125,13 @@ Write-Host "`nCopying files ..." -ForegroundColor Green
123125
124126
$blobs = Get-AzStorageBlob `
125127
-Context $sourceContext `
126-
-Container $sourceContainerName
128+
-Container $sourceContainerName `
129+
-Blob "hivescripts\hivescript.hql"
127130
128131
$blobs|Start-AzStorageBlobCopy `
129132
-DestContext $destContext `
130-
-DestContainer $destContainerName
133+
-DestContainer $destContainerName `
134+
-DestBlob "hivescripts\partitionweblogs.hql"
131135
132136
Write-Host "`nCopied files ..." -ForegroundColor Green
133137
Get-AzStorageBlob `
@@ -146,16 +150,16 @@ Write-host "`nScript completed" -ForegroundColor Green
146150
**To verify the storage account creation**
147151

148152
1. Sign on to the [Azure portal](https://portal.azure.com).
149-
2. Select **Resource groups** on the left pane.
150-
3. Select the resource group name you created in your PowerShell script. Use the filter if you have too many resource groups listed.
151-
4. On the **Resources** tile, you see one resource listed unless you share the resource group with other projects. That resource is the storage account with the name you specified earlier. Select the storage account name.
152-
5. Select the **Blobs** tiles.
153-
6. Select the **adfgetstarted** container. You see a folder called **hivescripts**.
154-
7. Open the folder and make sure it contains the sample script file, **partitionweblogs.hql**.
153+
1. From the left, navigate to **All services** > **General** > **Resource groups**.
154+
1. Select the resource group name you created in your PowerShell script. Use the filter if you have too many resource groups listed.
155+
1. From the **Overview** view, you see one resource listed unless you share the resource group with other projects. That resource is the storage account with the name you specified earlier. Select the storage account name.
156+
1. Select the **Containers** tile.
157+
1. Select the **adfgetstarted** container. You see a folder called **hivescripts**.
158+
1. Open the folder and make sure it contains the sample script file, **partitionweblogs.hql**.
155159

156160
## Understand the Azure Data Factory activity
157161

158-
[Azure Data Factory](../data-factory/introduction.md) orchestrates and automates the movement and transformation of data. Azure Data Factory can create an HDInsight Hadoop cluster just-in-time to process an input data slice and delete the cluster when the processing is complete.
162+
[Azure Data Factory](../data-factory/introduction.md) orchestrates and automates the movement and transformation of data. Azure Data Factory can create an HDInsight Hadoop cluster just-in-time to process an input data slice and delete the cluster when the processing is complete.
159163

160164
In Azure Data Factory, a data factory can have one or more data pipelines. A data pipeline has one or more activities. There are two types of activities:
161165

@@ -190,12 +194,13 @@ In this article, you configure the Hive activity to create an on-demand HDInsigh
190194
|Resource group | Select **Use existing** and then select the resource group you created using the PowerShell script. |
191195
|Version | Leave at **V2**. |
192196
|Location | The location is automatically set to the location you specified while creating the resource group earlier. For this tutorial, the location is set to **East US**. |
197+
|Enable GIT|Uncheck this box.|
193198

194199
![Create Azure Data Factory using Azure portal](./media/hdinsight-hadoop-create-linux-clusters-adf/create-data-factory-portal.png "Create Azure Data Factory using Azure portal")
195200

196201
4. Select **Create**. Creating a data factory might take anywhere between 2 to 4 minutes.
197202

198-
5. Once the data factory is created, you will receive a **Deployment succeeded** notification with a **Go to resource** button. Select **Go to resource** to open the Data Factory default view.
203+
5. Once the data factory is created, you'll receive a **Deployment succeeded** notification with a **Go to resource** button. Select **Go to resource** to open the Data Factory default view.
199204

200205
6. Select **Author & Monitor** to launch the Azure Data Factory authoring and monitoring portal.
201206

@@ -230,7 +235,7 @@ In this section, you author two linked services within your data factory.
230235
|Azure subscription |Select your subscription from the drop-down list.|
231236
|Storage account name |Select the Azure Storage account you created as part of the PowerShell script.|
232237

233-
Then select **Finish**.
238+
Select **Test connection** and if successful, then select **Create**.
234239

235240
![Provide name for Azure Storage linked service](./media/hdinsight-hadoop-create-linux-clusters-adf/hdinsight-data-factory-storage-linked-service-details.png "Provide name for Azure Storage linked service")
236241

@@ -258,13 +263,12 @@ In this section, you author two linked services within your data factory.
258263
| Cluster name prefix | Provide a value that will be prefixed to all the cluster types that are created by the data factory. |
259264
|Subscription |Select your subscription from the drop-down list.|
260265
| Select resource group | Select the resource group you created as part of the PowerShell script you used earlier.|
261-
|Select region | Select a region from the drop-down list.|
262266
| OS type/Cluster SSH user name | Enter an SSH user name, commonly `sshuser`. |
263267
| OS type/Cluster SSH password | Provide a password for the SSH user |
264268
| OS type/Cluster user name | Enter a cluster user name, commonly `admin`. |
265-
| OS type/Cluster user password | Provide a password for the cluster user. |
269+
| OS type/Cluster password | Provide a password for the cluster user. |
266270

267-
Then select **Finish**.
271+
Then select **Create**.
268272

269273
![Provide values for HDInsight linked service](./media/hdinsight-hadoop-create-linux-clusters-adf/hdinsight-data-factory-linked-service-details.png "Provide values for HDInsight linked service")
270274

@@ -278,7 +282,7 @@ In this section, you author two linked services within your data factory.
278282

279283
![Add activities to Data Factory pipeline](./media/hdinsight-hadoop-create-linux-clusters-adf/hdinsight-data-factory-add-hive-pipeline.png "Add activities to Data Factory pipeline")
280284

281-
3. Make sure you have the Hive activity selected, select the **HDI Cluster** tab, and from the **HDInsight Linked Service** drop-down list, select the linked service you created earlier, **HDinightLinkedService**, for HDInsight.
285+
3. Make sure you have the Hive activity selected, select the **HDI Cluster** tab, and from the **HDInsight Linked Service** drop-down list, select the linked service you created earlier, **HDInsightLinkedService**, for HDInsight.
282286

283287
![Provide HDInsight cluster details for the pipeline](./media/hdinsight-hadoop-create-linux-clusters-adf/hdinsight-hive-activity-select-hdinsight-linked-service.png "Provide HDInsight cluster details for the pipeline")
284288

@@ -318,7 +322,7 @@ In this section, you author two linked services within your data factory.
318322

319323
1. Select **Refresh** to refresh the status.
320324

321-
1. You can also select the **View Activity Runs** icon to see the activity run associated with the pipeline. In the screenshot below, you see only one activity run since there is only one activity in the pipeline you created. To switch back to the previous view, select **Pipelines** towards the top of the page.
325+
1. You can also select the **View Activity Runs** icon to see the activity run associated with the pipeline. In the screenshot below, you see only one activity run since there's only one activity in the pipeline you created. To switch back to the previous view, select **Pipelines** towards the top of the page.
322326

323327
![Monitor the Azure Data Factory pipeline activity](./media/hdinsight-hadoop-create-linux-clusters-adf/hdinsight-data-factory-monitor-pipeline-activity.png "Monitor the Azure Data Factory pipeline activity")
324328

@@ -336,7 +340,7 @@ In this section, you author two linked services within your data factory.
336340

337341
## Clean up resources
338342

339-
With the on-demand HDInsight cluster creation, you do not need to explicitly delete the HDInsight cluster. The cluster is deleted based on the configuration you provided while creating the pipeline. However, even after the cluster is deleted, the storage accounts associated with the cluster continue to exist. This behavior is by design so that you can keep your data intact. However, if you do not want to persist the data, you may delete the storage account you created.
343+
With the on-demand HDInsight cluster creation, you don't need to explicitly delete the HDInsight cluster. The cluster is deleted based on the configuration you provided while creating the pipeline. However, even after the cluster is deleted, the storage accounts associated with the cluster continue to exist. This behavior is by design so that you can keep your data intact. However, if you don't want to persist the data, you may delete the storage account you created.
340344

341345
Alternatively, you can delete the entire resource group that you created for this tutorial. This deletes the storage account and the Azure Data Factory that you created.
342346

45.2 KB
Loading
Loading
Loading
57.6 KB
Loading
Loading

0 commit comments

Comments
 (0)