Skip to content

Commit b6cbe5d

Browse files
authored
Merge pull request #112692 from dagiro/freshness_c45
freshness_c45
2 parents 96b0250 + 523083a commit b6cbe5d

File tree

1 file changed

+15
-15
lines changed

1 file changed

+15
-15
lines changed

articles/hdinsight/hdinsight-hadoop-create-linux-clusters-adf.md

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,15 @@ ms.author: hrasheed
66
ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.topic: tutorial
9-
ms.date: 03/18/2020
9+
ms.date: 04/24/2020
1010
#Customer intent: As a data worker, I need to create a Hadoop cluster and run Hive jobs on demand
1111
---
1212

1313
# Tutorial: Create on-demand Apache Hadoop clusters in HDInsight using Azure Data Factory
1414

1515
[!INCLUDE [selector](../../includes/hdinsight-create-linux-cluster-selector.md)]
1616

17-
In this tutorial, you learn how to create an [Apache Hadoop](./hadoop/apache-hadoop-introduction.md) cluster, on demand, in Azure HDInsight using Azure Data Factory. You then use data pipelines in Azure Data Factory to run Hive jobs and delete the cluster. By the end of this tutorial, you learn how to operationalize a big data job run where cluster creation, job run, and cluster deletion are performed on a schedule.
17+
In this tutorial, you learn how to create an [Apache Hadoop](./hadoop/apache-hadoop-introduction.md) cluster, on demand, in Azure HDInsight using Azure Data Factory. You then use data pipelines in Azure Data Factory to run Hive jobs and delete the cluster. By the end of this tutorial, you learn how to `operationalize` a big data job run where cluster creation, job run, and cluster deletion are done on a schedule.
1818

1919
This tutorial covers the following tasks:
2020

@@ -38,9 +38,9 @@ If you don't have an Azure subscription, [create a free account](https://azure.m
3838

3939
## Create preliminary Azure objects
4040

41-
In this section, you create various objects that will be used for the HDInsight cluster you create on-demand. The created storage account will contain the sample [HiveQL](https://cwiki.apache.org/confluence/display/Hive/LanguageManual) script, `partitionweblogs.hql`, that you use to simulate a sample Apache Hive job that runs on the cluster.
41+
In this section, you create various objects that will be used for the HDInsight cluster you create on-demand. The created storage account will contain the sample HiveQL script, `partitionweblogs.hql`, that you use to simulate a sample Apache Hive job that runs on the cluster.
4242

43-
This section uses an Azure PowerShell script to create the storage account and copy over the required files within the storage account. The Azure PowerShell sample script in this section performs the following tasks:
43+
This section uses an Azure PowerShell script to create the storage account and copy over the required files within the storage account. The Azure PowerShell sample script in this section does the following tasks:
4444

4545
1. Signs in to Azure.
4646
2. Creates an Azure resource group.
@@ -154,7 +154,7 @@ Write-host "`nScript completed" -ForegroundColor Green
154154
1. Select the resource group name you created in your PowerShell script. Use the filter if you have too many resource groups listed.
155155
1. From the **Overview** view, you see one resource listed unless you share the resource group with other projects. That resource is the storage account with the name you specified earlier. Select the storage account name.
156156
1. Select the **Containers** tile.
157-
1. Select the **adfgetstarted** container. You see a folder called **hivescripts**.
157+
1. Select the **adfgetstarted** container. You see a folder called **`hivescripts`**.
158158
1. Open the folder and make sure it contains the sample script file, **partitionweblogs.hql**.
159159

160160
## Understand the Azure Data Factory activity
@@ -170,7 +170,7 @@ In this article, you configure the Hive activity to create an on-demand HDInsigh
170170

171171
1. An HDInsight Hadoop cluster is automatically created for you just-in-time to process the slice.
172172

173-
2. The input data is processed by running a HiveQL script on the cluster. In this tutorial, the HiveQL script associated with the hive activity performs the following actions:
173+
2. The input data is processed by running a HiveQL script on the cluster. In this tutorial, the HiveQL script associated with the hive activity does the following actions:
174174

175175
* Uses the existing table (*hivesampletable*) to create another table **HiveSampleOut**.
176176
* Populates the **HiveSampleOut** table with only specific columns from the original *hivesampletable*.
@@ -181,7 +181,7 @@ In this article, you configure the Hive activity to create an on-demand HDInsigh
181181

182182
1. Sign in to the [Azure portal](https://portal.azure.com/).
183183

184-
2. From the left menu, navigate to **+ Create a resource** > **Analytics** > **Data Factory**.
184+
2. From the left menu, navigate to **`+ Create a resource`** > **Analytics** > **Data Factory**.
185185

186186
![Azure Data Factory on the portal](./media/hdinsight-hadoop-create-linux-clusters-adf/data-factory-azure-portal.png "Azure Data Factory on the portal")
187187

@@ -260,7 +260,7 @@ In this section, you author two linked services within your data factory.
260260
| Time to live | Provide the duration for which you want the HDInsight cluster to be available before being automatically deleted.|
261261
| Service principal ID | Provide the application ID of the Azure Active Directory service principal you created as part of the prerequisites. |
262262
| Service principal key | Provide the authentication key for the Azure Active Directory service principal. |
263-
| Cluster name prefix | Provide a value that will be prefixed to all the cluster types that are created by the data factory. |
263+
| Cluster name prefix | Provide a value that will be prefixed to all the cluster types created by the data factory. |
264264
|Subscription |Select your subscription from the drop-down list.|
265265
| Select resource group | Select the resource group you created as part of the PowerShell script you used earlier.|
266266
| OS type/Cluster SSH user name | Enter an SSH user name, commonly `sshuser`. |
@@ -282,7 +282,7 @@ In this section, you author two linked services within your data factory.
282282

283283
![Add activities to Data Factory pipeline](./media/hdinsight-hadoop-create-linux-clusters-adf/hdinsight-data-factory-add-hive-pipeline.png "Add activities to Data Factory pipeline")
284284

285-
1. Make sure you have the Hive activity selected, select the **HDI Cluster** tab, and from the **HDInsight Linked Service** drop-down list, select the linked service you created earlier, **HDInsightLinkedService**, for HDInsight.
285+
1. Make sure you have the Hive activity selected, select the **HDI Cluster** tab. And from the **HDInsight Linked Service** drop-down list, select the linked service you created earlier, **HDInsightLinkedService**, for HDInsight.
286286

287287
![Provide HDInsight cluster details for the pipeline](./media/hdinsight-hadoop-create-linux-clusters-adf/hdinsight-hive-activity-select-hdinsight-linked-service.png "Provide HDInsight cluster details for the pipeline")
288288

@@ -294,9 +294,9 @@ In this section, you author two linked services within your data factory.
294294

295295
![Provide Hive script details for the pipeline](./media/hdinsight-hadoop-create-linux-clusters-adf/hdinsight-data-factory-provide-script-path.png "Provide Hive script details for the pipeline")
296296

297-
1. Under **Advanced** > **Parameters**, select **Auto-fill from script**. This option looks for any parameters in the Hive script that require values at runtime.
297+
1. Under **Advanced** > **Parameters**, select **`Auto-fill from script`**. This option looks for any parameters in the Hive script that require values at runtime.
298298

299-
1. In the **value** text box, add the existing folder in the format `wasbs://adfgetstarted@<StorageAccount>.blob.core.windows.net/outputfolder/`. The path is case-sensitive. This is the path where the output of the script will be stored. The `wasbs` schema is necessary because storage accounts now have secure transfer required enabled by default.
299+
1. In the **value** text box, add the existing folder in the format `wasbs://adfgetstarted@<StorageAccount>.blob.core.windows.net/outputfolder/`. The path is case-sensitive. This path is where the output of the script will be stored. The `wasbs` schema is necessary because storage accounts now have secure transfer required enabled by default.
300300

301301
![Provide parameters for the Hive script](./media/hdinsight-hadoop-create-linux-clusters-adf/hdinsight-data-factory-provide-script-parameters.png "Provide parameters for the Hive script")
302302

@@ -342,9 +342,9 @@ In this section, you author two linked services within your data factory.
342342

343343
## Clean up resources
344344

345-
With the on-demand HDInsight cluster creation, you don't need to explicitly delete the HDInsight cluster. The cluster is deleted based on the configuration you provided while creating the pipeline. However, even after the cluster is deleted, the storage accounts associated with the cluster continue to exist. This behavior is by design so that you can keep your data intact. However, if you don't want to persist the data, you may delete the storage account you created.
345+
With the on-demand HDInsight cluster creation, you don't need to explicitly delete the HDInsight cluster. The cluster is deleted based on the configuration you provided while creating the pipeline. Even after the cluster is deleted, the storage accounts associated with the cluster continue to exist. This behavior is by design so that you can keep your data intact. However, if you don't want to persist the data, you may delete the storage account you created.
346346

347-
Alternatively, you can delete the entire resource group that you created for this tutorial. This deletes the storage account and the Azure Data Factory that you created.
347+
Or, you can delete the entire resource group that you created for this tutorial. This process deletes the storage account and the Azure Data Factory that you created.
348348

349349
### Delete the resource group
350350

@@ -354,13 +354,13 @@ Alternatively, you can delete the entire resource group that you created for thi
354354
1. On the **Resources** tile, you shall have the default storage account and the data factory listed unless you share the resource group with other projects.
355355
1. Select **Delete resource group**. Doing so deletes the storage account and the data stored in the storage account.
356356

357-
![Azure portal delete resource group](./media/hdinsight-hadoop-create-linux-clusters-adf/delete-resource-group.png "Delete resource group")
357+
![`Azure portal delete resource group`](./media/hdinsight-hadoop-create-linux-clusters-adf/delete-resource-group.png "Delete resource group")
358358

359359
1. Enter the resource group name to confirm deletion, and then select **Delete**.
360360

361361
## Next steps
362362

363-
In this article, you learned how to use Azure Data Factory to create on-demand HDInsight cluster and run [Apache Hive](https://hive.apache.org/) jobs. Advance to the next article to learn how to create HDInsight clusters with custom configuration.
363+
In this article, you learned how to use Azure Data Factory to create on-demand HDInsight cluster and run Apache Hive jobs. Advance to the next article to learn how to create HDInsight clusters with custom configuration.
364364

365365
> [!div class="nextstepaction"]
366366
> [Create Azure HDInsight clusters with custom configuration](hdinsight-hadoop-provision-linux-clusters.md)

0 commit comments

Comments
 (0)