Skip to content

Commit 4974b5f

Browse files
authored
Merge pull request #205469 from sreekzz/docs-editor/hdinsight-hadoop-provision-lin-1658383908
Add Disk attach feature
2 parents 1dd0a59 + 683fa54 commit 4974b5f

File tree

3 files changed

+55
-42
lines changed

3 files changed

+55
-42
lines changed

articles/hdinsight/hdinsight-hadoop-provision-linux-clusters.md

Lines changed: 55 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,21 @@
11
---
22
title: Set up clusters in HDInsight with Apache Hadoop, Apache Spark, Apache Kafka, and more
3-
description: Set up Hadoop, Kafka, Spark, HBase, or Storm clusters for HDInsight from a browser, the Azure classic CLI, Azure PowerShell, REST, or SDK.
3+
description: Set up Hadoop, Kafka, Spark, or HBase clusters for HDInsight from a browser, the Azure classic CLI, Azure PowerShell, REST, or SDK.
44
ms.service: hdinsight
55
ms.topic: conceptual
66
ms.custom: hdinsightactive,hdiseo17may2017,seodec18, devx-track-azurecli
7-
ms.date: 03/30/2022
7+
ms.date: 07/22/2022
88
---
99

1010
# Set up clusters in HDInsight with Apache Hadoop, Apache Spark, Apache Kafka, and more
1111

1212
[!INCLUDE [selector](includes/hdinsight-create-linux-cluster-selector.md)]
1313

14-
Learn how to set up and configure Apache Hadoop, Apache Spark, Apache Kafka, Interactive Query, Apache HBase, or Apache Storm in HDInsight. Also, learn how to customize clusters and add security by joining them to a domain.
14+
Learn how to set up and configure Apache Hadoop, Apache Spark, Apache Kafka, Interactive Query, or Apache HBase or in HDInsight. Also, learn how to customize clusters and add security by joining them to a domain.
1515

1616
A Hadoop cluster consists of several virtual machines (nodes) that are used for distributed processing of tasks. Azure HDInsight handles implementation details of installation and configuration of individual nodes, so you only have to provide general configuration information.
1717

18-
> [!IMPORTANT]
18+
> [!IMPORTANT]
1919
> HDInsight cluster billing starts once a cluster is created and stops when the cluster is deleted. Billing is pro-rated per minute, so you should always delete your cluster when it is no longer in use. Learn how to [delete a cluster.](hdinsight-delete-cluster.md)
2020
2121
If you're using multiple clusters together, you'll want to create a virtual network, and if you're using a Spark cluster you'll also want to use the Hive Warehouse Connector. For more information, see [Plan a virtual network for Azure HDInsight](./hdinsight-plan-virtual-network-deployment.md) and [Integrate Apache Spark and Apache Hive with the Hive Warehouse Connector](interactive-query/apache-hive-warehouse-connector.md).
@@ -63,8 +63,8 @@ You don't need to specify the cluster location explicitly: The cluster is in the
6363

6464
Azure HDInsight currently provides the following cluster types, each with a set of components to provide certain functionalities.
6565

66-
> [!IMPORTANT]
67-
> HDInsight clusters are available in various types, each for a single workload or technology. There is no supported method to create a cluster that combines multiple types, such as Storm and HBase on one cluster. If your solution requires technologies that are spread across multiple HDInsight cluster types, an [Azure virtual network](../virtual-network/index.yml) can connect the required cluster types.
66+
> [!IMPORTANT]
67+
> HDInsight clusters are available in various types, each for a single workload or technology. There is no supported method to create a cluster that combines multiple types, such HBase on one cluster. If your solution requires technologies that are spread across multiple HDInsight cluster types, an [Azure virtual network](../virtual-network/index.yml) can connect the required cluster types.
6868
6969
| Cluster type | Functionality |
7070
| --- | --- |
@@ -73,7 +73,6 @@ Azure HDInsight currently provides the following cluster types, each with a set
7373
| [Interactive Query](./interactive-query/apache-interactive-query-get-started.md) |In-memory caching for interactive and faster Hive queries |
7474
| [Kafka](kafka/apache-kafka-introduction.md) | A distributed streaming platform that can be used to build real-time streaming data pipelines and applications |
7575
| [Spark](spark/apache-spark-overview.md) |In-memory processing, interactive queries, micro-batch stream processing |
76-
| [Storm](storm/apache-storm-overview.md) |Real-time event processing |
7776

7877
#### Version
7978

@@ -89,15 +88,15 @@ With HDInsight clusters, you can configure two user accounts during cluster crea
8988
The HTTP username has the following restrictions:
9089

9190
* Allowed special characters: `_` and `@`
92-
* Characters not allowed: #;."',\/:`!*?$(){}[]<>|&--=+%~^space
91+
* Characters not allowed: #;."',/:`!*?$(){}[]<>|&--=+%~^space
9392
* Max length: 20
9493

9594
The SSH username has the following restrictions:
9695

9796
* Allowed special characters:`_` and `@`
98-
* Characters not allowed: #;."',\/:`!*?$(){}[]<>|&--=+%~^space
97+
* Characters not allowed: #;."',/:`!*?$(){}[]<>|&--=+%~^space
9998
* Max length: 64
100-
* Reserved names: hadoop, users, oozie, hive, mapred, ambari-qa, zookeeper, tez, hdfs, sqoop, yarn, hcat, ams, hbase, storm, administrator, admin, user, user1, test, user2, test1, user3, admin1, 1, 123, a, actuser, adm, admin2, aspnet, backup, console, david, guest, john, owner, root, server, sql, support, support_388945a0, sys, test2, test3, user4, user5, spark
99+
* Reserved names: hadoop, users, oozie, hive, mapred, ambari-qa, zookeeper, tez, hdfs, sqoop, yarn, hcat, ams, hbase, administrator, admin, user, user1, test, user2, test1, user3, admin1, 1, 123, a, actuser, adm, admin2, aspnet, backup, console, david, guest, john, owner, root, server, sql, support, support_388945a0, sys, test2, test3, user4, user5, spark
101100

102101
## Storage
103102

@@ -115,7 +114,7 @@ HDInsight clusters can use the following storage options:
115114

116115
For more information on storage options with HDInsight, see [Compare storage options for use with Azure HDInsight clusters](hdinsight-hadoop-compare-storage-options.md).
117116

118-
> [!WARNING]
117+
> [!WARNING]
119118
> Using an additional storage account in a different location from the HDInsight cluster is not supported.
120119
121120
During configuration, for the default storage endpoint you specify a blob container of an Azure Storage account or Data Lake Storage. The default storage contains application and system logs. Optionally, you can specify additional linked Azure Storage accounts and Data Lake Storage accounts that the cluster can access. The HDInsight cluster and the dependent storage accounts must be in the same Azure location.
@@ -125,7 +124,7 @@ During configuration, for the default storage endpoint you specify a blob contai
125124
> [!IMPORTANT]
126125
> Enabling secure storage transfer after creating a cluster can result in errors using your storage account and is not recommended. It is better to create a new cluster using a storage account with secure transfer already enabled.
127126
128-
> [!Note]
127+
> [!Note]
129128
> Azure HDInsight does not automatically transfer, move or copy your data stored in Azure Storage from one region to another.
130129
131130
### Metastore settings
@@ -134,7 +133,7 @@ You can create optional Hive or Apache Oozie metastores. However, not all cluste
134133

135134
For more information, see [Use external metadata stores in Azure HDInsight](./hdinsight-use-external-metadata-stores.md).
136135

137-
> [!IMPORTANT]
136+
> [!IMPORTANT]
138137
> When you create a custom metastore, don't use dashes, hyphens, or spaces in the database name. This can cause the cluster creation process to fail.
139138
140139
#### SQL database for Hive
@@ -154,7 +153,7 @@ To increase performance when using Oozie, use a custom metastore. A metastore ca
154153

155154
Ambari is used to monitor HDInsight clusters, make configuration changes, and store cluster management information as well as job history. The custom Ambari DB feature allows you to deploy a new cluster and setup Ambari in an external database that you manage. For more information, see [Custom Ambari DB](./hdinsight-custom-ambari-db.md).
156155

157-
> [!IMPORTANT]
156+
> [!IMPORTANT]
158157
> You cannot reuse a custom Oozie metastore. To use a custom Oozie metastore, you must provide an empty Azure SQL Database when creating the HDInsight cluster.
159158
160159
## Security + networking
@@ -195,7 +194,7 @@ For more information, see [Managed identities in Azure HDInsight](./hdinsight-ma
195194

196195
## Configuration + pricing
197196

198-
:::image type="content" source="./media/hdinsight-hadoop-provision-linux-clusters/azure-portal-cluster-configuration.png" alt-text="HDInsight choose your node size":::
197+
:::image type="content" source="./media/hdinsight-hadoop-provision-linux-clusters/azure-portal-cluster-configuration-disk-attach.png" alt-text="HDInsight choose your node size":::
199198

200199
You're billed for node usage for as long as the cluster exists. Billing starts when a cluster is created and stops when the cluster is deleted. Clusters can't be de-allocated or put on hold.
201200

@@ -207,25 +206,23 @@ Each cluster type has its own number of nodes, terminology for nodes, and defaul
207206
| --- | --- | --- |
208207
| Hadoop |Head node (2), Worker node (1+) |:::image type="content" source="./media/hdinsight-hadoop-provision-linux-clusters/hdinsight-hadoop-cluster-type-nodes.png" alt-text="HDInsight Hadoop cluster nodes" border="false"::: |
209208
| HBase |Head server (2), region server (1+), master/ZooKeeper node (3) |:::image type="content" source="./media/hdinsight-hadoop-provision-linux-clusters/hdinsight-hbase-cluster-type-setup.png" alt-text="HDInsight HBase cluster type setup" border="false"::: |
210-
| Storm |Nimbus node (2), supervisor server (1+), ZooKeeper node (3) |:::image type="content" source="./media/hdinsight-hadoop-provision-linux-clusters/hdinsight-storm-cluster-type-setup.png" alt-text="HDInsight storm cluster type setup" border="false"::: |
211209
| Spark |Head node (2), Worker node (1+), ZooKeeper node (3) (free for A1 ZooKeeper VM size) |:::image type="content" source="./media/hdinsight-hadoop-provision-linux-clusters/hdinsight-spark-cluster-type-setup.png" alt-text="HDInsight spark cluster type setup" border="false"::: |
212210

213211
For more information, see [Default node configuration and virtual machine sizes for clusters](hdinsight-supported-node-configuration.md) in "What are the Hadoop components and versions in HDInsight?"
214212

215213
The cost of HDInsight clusters is determined by the number of nodes and the virtual machines sizes for the nodes.
216214

217215
Different cluster types have different node types, numbers of nodes, and node sizes:
216+
218217
* Hadoop cluster type default:
219-
* Two *head nodes*
220-
* Four *Worker nodes*
221-
* Storm cluster type default:
222-
* Two *Nimbus nodes*
223-
* Three *ZooKeeper nodes*
224-
* Four *supervisor nodes*
218+
* Two *head nodes*
219+
220+
221+
* Four *Worker nodes*
225222

226223
If you're just trying out HDInsight, we recommend you use one Worker node. For more information about HDInsight pricing, see [HDInsight pricing](https://go.microsoft.com/fwLink/?LinkID=282635&clcid=0x409).
227224

228-
> [!NOTE]
225+
> [!NOTE]
229226
> The cluster size limit varies among Azure subscriptions. Contact [Azure billing support](../azure-portal/supportability/how-to-create-azure-support-request.md) to increase the limit.
230227
231228
When you use the Azure portal to configure the cluster, the node size is available through the **Configuration + pricing** tab. In the portal, you can also see the cost associated with the different node sizes.
@@ -239,14 +236,31 @@ When you deploy clusters, choose compute resources based on the solution you pla
239236

240237
To find out what value you should use to specify a VM size while creating a cluster using the different SDKs or while using Azure PowerShell, see [VM sizes to use for HDInsight clusters](../cloud-services/cloud-services-sizes-specs.md#size-tables). From this linked article, use the value in the **Size** column of the tables.
241238

242-
> [!IMPORTANT]
239+
> [!IMPORTANT]
243240
> If you need more than 32 Worker nodes in a cluster, you must select a head node size with at least 8 cores and 14 GB of RAM.
244241
245242
For more information, see [Sizes for virtual machines](../virtual-machines/sizes.md). For information about pricing of the various sizes, see [HDInsight pricing](https://azure.microsoft.com/pricing/details/hdinsight).
246243

244+
### Disk attachment
245+
246+
On each of the **NodeManager** machines, **LocalResources** are ultimately localized in the target directories.
247+
248+
By normal configuration only the default disk is added as the local disk in NodeManager. For large applications this disk space may not be enough which can result in job failure.
249+
250+
If the cluster is expected to run large data application, you can choose to add extra disks to the **NodeManager**.
251+
252+
You can add number of disks per VM and each disk will be of 1 TB size.
253+
254+
1. From **Configuration + pricing** tab
255+
1. Select **Enable managed disk** option
256+
1. From **Standard disks**, Enter the **Number of disks**
257+
1. Choose your **Worker node**
258+
259+
You can verify the number of disks from **Review + create** tab, under **Cluster configuration**
260+
247261
### Add application
248262

249-
An HDInsight application is an application that users can install on a Linux-based HDInsight cluster. You can use applications provided by Microsoft, third parties, or that you develop yourself. For more information, see [Install third-party Apache Hadoop applications on Azure HDInsight](hdinsight-apps-install-applications.md).
263+
HDInsight application is an application, that users can install on a Linux-based HDInsight cluster. You can use applications provided by Microsoft, third parties, or developed by you. For more information, see [Install third-party Apache Hadoop applications on Azure HDInsight](hdinsight-apps-install-applications.md).
250264

251265
Most of the HDInsight applications are installed on an empty edge node. An empty edge node is a Linux virtual machine with the same client tools installed and configured as in the head node. You can use the edge node for accessing the cluster, testing your client applications, and hosting your client applications. For more information, see [Use empty edge nodes in HDInsight](hdinsight-apps-use-edge-node.md).
252266

@@ -256,28 +270,27 @@ You can install additional components or customize cluster configuration by usin
256270

257271
Some native Java components, like Apache Mahout and Cascading, can be run on the cluster as Java Archive (JAR) files. These JAR files can be distributed to Azure Storage and submitted to HDInsight clusters with Hadoop job submission mechanisms. For more information, see [Submit Apache Hadoop jobs programmatically](hadoop/submit-apache-hadoop-jobs-programmatically.md).
258272

259-
> [!NOTE]
273+
> [!NOTE]
260274
> If you have issues deploying JAR files to HDInsight clusters, or calling JAR files on HDInsight clusters, contact [Microsoft Support](https://azure.microsoft.com/support/options/).
261-
>
275+
>
262276
> Cascading is not supported by HDInsight and is not eligible for Microsoft Support. For lists of supported components, see [What's new in the cluster versions provided by HDInsight](hdinsight-component-versioning.md).
263277
264278
Sometimes, you want to configure the following configuration files during the creation process:
265279

266-
* clusterIdentity.xml
267-
* core-site.xml
268-
* gateway.xml
269-
* hbase-env.xml
270-
* hbase-site.xml
271-
* hdfs-site.xml
272-
* hive-env.xml
273-
* hive-site.xml
274-
* mapred-site
275-
* oozie-site.xml
276-
* oozie-env.xml
277-
* storm-site.xml
278-
* tez-site.xml
279-
* webhcat-site.xml
280-
* yarn-site.xml
280+
* clusterIdentity.xml
281+
* core-site.xml
282+
* gateway.xml
283+
* hbase-env.xml
284+
* hbase-site.xml
285+
* hdfs-site.xml
286+
* hive-env.xml
287+
* hive-site.xml
288+
* mapred-site
289+
* oozie-site.xml
290+
* oozie-env.xml
291+
* tez-site.xml
292+
* webhcat-site.xml
293+
* yarn-site.xml
281294

282295
For more information, see [Customize HDInsight clusters using Bootstrap](hdinsight-hadoop-customize-cluster-bootstrap.md).
283296

Loading

0 commit comments

Comments
 (0)