Skip to content

Commit a7be697

Browse files
committed
freshness39
1 parent 49f8a82 commit a7be697

File tree

1 file changed

+30
-29
lines changed

1 file changed

+30
-29
lines changed

articles/hdinsight/hdinsight-scaling-best-practices.md

Lines changed: 30 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,16 @@ ms.author: ashish
66
ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.topic: conceptual
9-
ms.date: 02/26/2020
9+
ms.date: 04/06/2020
1010
---
1111

1212
# Scale Azure HDInsight clusters
1313

14-
HDInsight provides elasticity by giving you the option to scale up and scale down the number of worker nodes in your clusters. This elasticity, allows you to shrink a cluster after hours or on weekends, and expand it during peak business demands.
14+
HDInsight provides elasticity with options to scale up and scale down the number of worker nodes in your clusters. This elasticity allows you to shrink a cluster after hours or on weekends. And expand it during peak business demands.
1515

16-
If you have periodic batch processing, the HDInsight cluster can be scaled up a few minutes prior to that operation, so that your cluster has adequate memory and CPU power.  Later, after the processing is done, and usage goes down again, you can scale down the HDInsight cluster to fewer worker nodes.
16+
Scale up your cluster before periodic batch processing so the cluster has adequate resources.  After processing completes, and usage goes down, scale down the HDInsight cluster to fewer worker nodes.
1717

18-
You can scale a cluster manually using one of the methods outlined below, or use [autoscale](hdinsight-autoscale-clusters.md) options to have the system automatically scale up and down in response to CPU, memory, and other metrics.
18+
You can scale a cluster manually using one of the methods outlined below. You can also use [autoscale](hdinsight-autoscale-clusters.md) options to automatically scale up and down in response to certain metrics.
1919

2020
> [!NOTE]
2121
> Only clusters with HDInsight version 3.1.3 or higher are supported. If you are unsure of the version of your cluster, you can check the Properties page.
@@ -26,37 +26,37 @@ Microsoft provides the following utilities to scale clusters:
2626

2727
|Utility | Description|
2828
|---|---|
29-
|[PowerShell Az](https://docs.microsoft.com/powershell/azure)|[Set-AzHDInsightClusterSize](https://docs.microsoft.com/powershell/module/az.hdinsight/set-azhdinsightclustersize) -ClusterName \<Cluster Name> -TargetInstanceCount \<NewSize>|
30-
|[PowerShell AzureRM](https://docs.microsoft.com/powershell/azure/azurerm) |[Set-AzureRmHDInsightClusterSize](https://docs.microsoft.com/powershell/module/azurerm.hdinsight/set-azurermhdinsightclustersize) -ClusterName \<Cluster Name> -TargetInstanceCount \<NewSize>|
31-
|[Azure CLI](https://docs.microsoft.com/cli/azure/?view=azure-cli-latest)| [az hdinsight resize](https://docs.microsoft.com/cli/azure/hdinsight?view=azure-cli-latest#az-hdinsight-resize) --resource-group \<Resource group> --name \<Cluster Name> --workernode-count \<NewSize>|
32-
|[Azure Classic CLI](hdinsight-administer-use-command-line.md)|azure hdinsight cluster resize \<clusterName> \<Target Instance Count> |
29+
|[PowerShell Az](https://docs.microsoft.com/powershell/azure)|[`Set-AzHDInsightClusterSize`](https://docs.microsoft.com/powershell/module/az.hdinsight/set-azhdinsightclustersize) `-ClusterName CLUSTERNAME -TargetInstanceCount NEWSIZE`|
30+
|[PowerShell AzureRM](https://docs.microsoft.com/powershell/azure/azurerm) |[`Set-AzureRmHDInsightClusterSize`](https://docs.microsoft.com/powershell/module/azurerm.hdinsight/set-azurermhdinsightclustersize) `-ClusterName CLUSTERNAME -TargetInstanceCount NEWSIZE`|
31+
|[Azure CLI](https://docs.microsoft.com/cli/azure/?view=azure-cli-latest) | [`az hdinsight resize`](https://docs.microsoft.com/cli/azure/hdinsight?view=azure-cli-latest#az-hdinsight-resize) `--resource-group RESOURCEGROUP --name CLUSTERNAME --workernode-count NEWSIZE`|
32+
|[Azure Classic CLI](hdinsight-administer-use-command-line.md)|`azure hdinsight cluster resize CLUSTERNAME NEWSIZE` |
3333
|[Azure portal](https://portal.azure.com)|Open your HDInsight cluster pane, select **Cluster size** on the left-hand menu, then on the Cluster size pane, type in the number of worker nodes, and select Save.|
3434

3535
![Azure portal scale cluster option](./media/hdinsight-scaling-best-practices/azure-portal-settings-nodes.png)
3636

3737
Using any of these methods, you can scale your HDInsight cluster up or down within minutes.
3838

3939
> [!IMPORTANT]
40-
> * The Azure classic CLI is deprecated and should only be used with the classic deployment model. For all other deployments, use the [Azure CLI](https://docs.microsoft.com/cli/azure/?view=azure-cli-latest).
40+
> * The Azure classic CLI is deprecated and should only be used with the classic deployment model. For all other deployments, use the [Azure CLI](https://docs.microsoft.com/cli/azure/?view=azure-cli-latest).
4141
> * The PowerShell AzureRM module is deprecated. Please use the [Az module](https://docs.microsoft.com/powershell/azure/new-azureps-module-az?view=azps-1.4.0) whenever possible.
4242
4343
## Impact of scaling operations
4444

45-
When you **add** nodes to your running HDInsight cluster (scale up), any pending or running jobs will not be affected. New jobs can be safely submitted while the scaling process is running. If the scaling operation fails for any reason, the failure will be handled to leave your cluster in a functional state.
45+
When you **add** nodes to your running HDInsight cluster (scale up), jobs won't be affected. New jobs can be safely submitted while the scaling process is running. If the scaling operation fails, the failure will leave your cluster in a functional state.
4646

47-
If you **remove** nodes (scale down), any pending or running jobs will fail when the scaling operation completes. This failure is due to some of the services restarting during the scaling process. There is also a risk that your cluster can get stuck in safe mode during a manual scaling operation.
47+
If you **remove** nodes (scale down), pending or running jobs will fail when the scaling operation completes. This failure is because of some of the services restarting during the scaling process. Your cluster may get stuck in safe mode during a manual scaling operation.
4848

4949
The impact of changing the number of data nodes varies for each type of cluster supported by HDInsight:
5050

5151
* Apache Hadoop
5252

53-
You can seamlessly increase the number of worker nodes in a Hadoop cluster that is running without impacting any pending or running jobs. New jobs can also be submitted while the operation is in progress. Failures in a scaling operation are gracefully handled so that the cluster is always left in a functional state.
53+
You can seamlessly increase the number of worker nodes in a running Hadoop cluster without impacting any jobs. New jobs can also be submitted while the operation is in progress. Failures in a scaling operation are gracefully handled. The cluster is always left in a functional state.
5454

55-
When a Hadoop cluster is scaled down by reducing the number of data nodes, some of the services in the cluster are restarted. This behavior causes all running and pending jobs to fail at the completion of the scaling operation. You can, however, resubmit the jobs once the operation is complete.
55+
When a Hadoop cluster is scaled down with fewer data nodes, some services are restarted. This behavior causes all running and pending jobs to fail at the completion of the scaling operation. You can, however, resubmit the jobs once the operation is complete.
5656

5757
* Apache HBase
5858

59-
You can seamlessly add or remove nodes to your HBase cluster while it is running. Regional Servers are automatically balanced within a few minutes of completing the scaling operation. However, you can also manually balance the regional servers by logging in to the headnode of cluster and running the following commands from a command prompt window:
59+
You can seamlessly add or remove nodes to your HBase cluster while it's running. Regional Servers are automatically balanced within a few minutes of completing the scaling operation. However, you can manually balance the regional servers. Log in to the cluster headnode and run the following commands:
6060

6161
```bash
6262
pushd %HBASE_HOME%\bin
@@ -68,14 +68,14 @@ The impact of changing the number of data nodes varies for each type of cluster
6868

6969
* Apache Storm
7070

71-
You can seamlessly add or remove data nodes to your Storm cluster while it is running. However, after a successful completion of the scaling operation, you will need to rebalance the topology.
71+
You can seamlessly add or remove data nodes while Storm is running. However, after a successful completion of the scaling operation, you'll need to rebalance the topology.
7272
7373
Rebalancing can be accomplished in two ways:
7474
7575
* Storm web UI
7676
* Command-line interface (CLI) tool
7777
78-
Refer to the [Apache Storm documentation](https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html) for more details.
78+
For more information, see [Apache Storm documentation](https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html).
7979
8080
The Storm web UI is available on the HDInsight cluster:
8181
@@ -105,15 +105,15 @@ To see a list of pending and running jobs, you can use the YARN **Resource Manag
105105
1. From the [Azure portal](https://portal.azure.com/), select your cluster. See [List and show clusters](./hdinsight-administer-use-portal-linux.md#showClusters) for the instructions. The cluster is opened in a new portal page.
106106
2. From the main view, navigate to **Cluster dashboards** > **Ambari home**. Enter your cluster credentials.
107107
3. From the Ambari UI, select **YARN** on the list of services on the left-hand menu.
108-
4. From the YARN page, select **Quick Links** and hover over the active head node, then select **ResourceManager UI**.
108+
4. From the YARN page, select **Quick Links** and hover over the active head node, then select **Resource Manager UI**.
109109
110-
![Apache Ambari quick links ResourceManager UI](./media/hdinsight-scaling-best-practices/resource-manager-ui1.png)
110+
![Apache Ambari quick links Resource Manager UI](./media/hdinsight-scaling-best-practices/resource-manager-ui1.png)
111111
112-
You may directly access the ResourceManager UI with `https://<HDInsightClusterName>.azurehdinsight.net/yarnui/hn/cluster`.
112+
You may directly access the Resource Manager UI with `https://<HDInsightClusterName>.azurehdinsight.net/yarnui/hn/cluster`.
113113
114114
You see a list of jobs, along with their current state. In the screenshot, there's one job currently running:
115115

116-
![ResourceManager UI applications](./media/hdinsight-scaling-best-practices/resourcemanager-ui-applications.png)
116+
![Resource Manager UI applications](./media/hdinsight-scaling-best-practices/resourcemanager-ui-applications.png)
117117

118118
To manually kill that running application, execute the following command from the SSH shell:
119119

@@ -129,11 +129,11 @@ yarn application -kill "application_1499348398273_0003"
129129

130130
### Getting stuck in safe mode
131131

132-
When you scale down a cluster, HDInsight uses Apache Ambari management interfaces to first decommission the extra worker nodes, which replicate their HDFS blocks to other online worker nodes. After that, HDInsight safely scales the cluster down. HDFS goes into safe mode during the scaling operation, and is supposed to come out once the scaling is finished. In some cases, however, HDFS gets stuck in safe mode during a scaling operation because of file block under-replication.
132+
When you scale down a cluster, HDInsight uses Apache Ambari management interfaces to first decommission the extra worker nodes. The nodes replicate their HDFS blocks to other online worker nodes. After that, HDInsight safely scales the cluster down. HDFS goes into safe mode during the scaling operation. HDFS is supposed to come out once the scaling is finished. In some cases, however, HDFS gets stuck in safe mode during a scaling operation because of file block under-replication.
133133

134134
By default, HDFS is configured with a `dfs.replication` setting of 1, which controls how many copies of each file block are available. Each copy of a file block is stored on a different node of the cluster.
135135

136-
When HDFS detects that the expected number of block copies aren't available, HDFS enters safe mode and Ambari generates alerts. If HDFS enters safe mode for a scaling operation, but then cannot exit safe mode because the required number of nodes are not detected for replication, the cluster can become stuck in safe mode.
136+
When the expected number of block copies aren't available, HDFS enters safe mode and Ambari generates alerts. HDFS may enter safe mode for a scaling operation. The cluster may get stuck in safe mode if the required number of nodes aren't detected for replication.
137137

138138
### Example errors when safe mode is turned on
139139

@@ -147,7 +147,7 @@ org.apache.http.conn.HttpHostConnectException: Connect to active-headnode-name.s
147147

148148
You can review the name node logs from the `/var/log/hadoop/hdfs/` folder, near the time when the cluster was scaled, to see when it entered safe mode. The log files are named `Hadoop-hdfs-namenode-<active-headnode-name>.*`.
149149

150-
The root cause of the previous errors is that Hive depends on temporary files in HDFS while running queries. When HDFS enters safe mode, Hive cannot run queries because it cannot write to HDFS. The temp files in HDFS are located in the local drive mounted to the individual worker node VMs, and replicated amongst other worker nodes at three replicas, minimum.
150+
The root cause was that Hive depends on temporary files in HDFS while running queries. When HDFS enters safe mode, Hive can't run queries because it can't write to HDFS. Temp files in HDFS are located in the local drive mounted to the individual worker node VMs. The files are replicated among other worker nodes at three replicas, minimum.
151151

152152
### How to prevent HDInsight from getting stuck in safe mode
153153

@@ -180,7 +180,8 @@ If Hive has left behind temporary files, then you can manually clean up those fi
180180
```
181181

182182
1. Stop Hive services and be sure all queries and jobs are completed.
183-
2. List the contents of the scratch directory found above, `hdfs://mycluster/tmp/hive/` to see if it contains any files:
183+
184+
1. List the contents of the scratch directory found above, `hdfs://mycluster/tmp/hive/` to see if it contains any files:
184185

185186
```bash
186187
hadoop fs -ls -R hdfs://mycluster/tmp/hive/hive
@@ -198,7 +199,7 @@ If Hive has left behind temporary files, then you can manually clean up those fi
198199
-rw-r--r-- 3 hive hdfs 26 2017-07-06 20:30 hdfs://mycluster/tmp/hive/hive/c108f1c2-453e-400f-ac3e-e3a9b0d22699/inuse.info
199200
```
200201

201-
3. If you know Hive is done with these files, you can remove them. Be sure that Hive does not have any queries running by looking in the Yarn ResourceManager UI page.
202+
1. If you know Hive is done with these files, you can remove them. Be sure that Hive doesn't have any queries running by looking in the Yarn Resource Manager UI page.
202203
203204
Example command line to remove files from HDFS:
204205
@@ -208,17 +209,17 @@ If Hive has left behind temporary files, then you can manually clean up those fi
208209
209210
#### Scale HDInsight to three or more worker nodes
210211
211-
If your clusters get stuck in safe mode frequently when scaling down to fewer than three worker nodes, and the previous steps don't work, then you can avoid your cluster going in to safe mode altogether by keeping at least three worker nodes.
212+
If your clusters get stuck in safe mode frequently when scaling down to fewer than three worker nodes, then keep at least three worker nodes.
212213
213-
Retaining three worker nodes is more costly than scaling down to only one worker node, but it will prevent your cluster from getting stuck in safe mode.
214+
Having three worker nodes is more costly than scaling down to only one worker node. However, this action will prevent your cluster from getting stuck in safe mode.
214215
215216
### Scale HDInsight down to one worker node
216217
217-
Even when the cluster is scaled down to 1 node, worker node 0 will still survive. Worker node 0 can never be decommissioned.
218+
Even when the cluster is scaled down to one node, worker node 0 will still survive. Worker node 0 can never be decommissioned.
218219
219220
#### Run the command to leave safe mode
220221
221-
The final option is to execute the leave safe mode command. If you know that the reason for HDFS entering safe mode is because of Hive file under-replication, you can execute the following command to leave safe mode:
222+
The final option is to execute the leave safe mode command. If HDFS entered safe mode because of Hive file under-replication, execute the following command to leave safe mode:
222223
223224
```bash
224225
hdfs dfsadmin -D 'fs.default.name=hdfs://mycluster/' -safemode leave

0 commit comments

Comments
 (0)