Skip to content

Commit 6ae715d

Browse files
Merge pull request #113277 from dagiro/freshness_c63
freshness_c63
2 parents ffdb577 + fe08a32 commit 6ae715d

File tree

2 files changed

+36
-55
lines changed

2 files changed

+36
-55
lines changed

articles/hdinsight/hdinsight-hadoop-linux-information.md

Lines changed: 10 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@ author: hrasheed-msft
55
ms.author: hrasheed
66
ms.reviewer: jasonh
77
ms.service: hdinsight
8-
ms.custom: hdinsightactive
8+
ms.custom: hdinsightactive,seoapr2020
99
ms.topic: conceptual
10-
ms.date: 11/14/2019
10+
ms.date: 04/29/2020
1111
---
1212

1313
# Information about using HDInsight on Linux
@@ -90,21 +90,21 @@ Example data and JAR files can be found on Hadoop Distributed File System at `/e
9090

9191
## HDFS, Azure Storage, and Data Lake Storage
9292

93-
In most Hadoop distributions, the data is stored in HDFS, which is backed by local storage on the machines in the cluster. Using local storage can be costly for a cloud-based solution where you're charged hourly or by minute for compute resources.
93+
In most Hadoop distributions, the data is stored in HDFS. HDFS is backed by local storage on the machines in the cluster. Using local storage can be costly for a cloud-based solution where you're charged hourly or by minute for compute resources.
9494

95-
When using HDInsight, the data files are stored in a scalable and resilient way in the cloud using Azure Blob Storage and optionally Azure Data Lake Storage. These services provide the following benefits:
95+
When using HDInsight, the data files are stored in an adaptable and resilient way in the cloud using Azure Blob Storage and optionally Azure Data Lake Storage. These services provide the following benefits:
9696

9797
* Cheap long-term storage.
9898
* Accessibility from external services such as websites, file upload/download utilities, various language SDKs, and web browsers.
99-
* Large file capacity and large scalable storage.
99+
* Large file capacity and large adaptable storage.
100100

101101
For more information, see [Understanding blobs](https://docs.microsoft.com/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs) and [Data Lake Storage](https://azure.microsoft.com/services/storage/data-lake-storage/).
102102

103-
When using either Azure Storage or Data Lake Storage, you don't have to do anything special from HDInsight to access the data. For example, the following command lists files in the `/example/data` folder regardless of whether it's stored on Azure Storage or Data Lake Storage:
103+
When using either Azure Storage or Data Lake Storage, you don't have to do anything special from HDInsight to access the data. For example, the following command lists files in the `/example/data` folder whether it's stored on Azure Storage or Data Lake Storage:
104104

105105
hdfs dfs -ls /example/data
106106

107-
In HDInsight, the data storage resources (Azure Blob Storage and Azure Data Lake Storage) are decoupled from compute resources. Therefore, you can create HDInsight clusters to do computation as you need, and later delete the cluster when the work is finished, meanwhile keeping your data files persisted safely in cloud storage as long as you need.
107+
In HDInsight, the data storage resources (Azure Blob Storage and Azure Data Lake Storage) are decoupled from compute resources. You can create HDInsight clusters to do computation as you need, and later delete the cluster when the work is finished. Meanwhile keeping your data files persisted safely in cloud storage as long as you need.
108108

109109
### <a name="URI-and-scheme"></a>URI and scheme
110110

@@ -205,46 +205,11 @@ If using __Azure Data Lake Storage__, see the following links for ways that you
205205

206206
## <a name="scaling"></a>Scaling your cluster
207207

208-
The cluster scaling feature allows you to dynamically change the number of data nodes used by a cluster. You can perform scaling operations while other jobs or processes are running on a cluster. See also, [Scale HDInsight clusters](./hdinsight-scaling-best-practices.md)
209-
210-
The different cluster types are affected by scaling as follows:
211-
212-
* **Hadoop**: When scaling down the number of nodes in a cluster, some of the services in the cluster are restarted. Scaling operations can cause jobs running or pending to fail at the completion of the scaling operation. You can resubmit the jobs once the operation is complete.
213-
* **HBase**: Regional servers are automatically balanced within a few minutes, once the scaling operation completes. To manually balance regional servers, use the following steps:
214-
215-
1. Connect to the HDInsight cluster using SSH. For more information, see [Use SSH with HDInsight](hdinsight-hadoop-linux-use-ssh-unix.md).
216-
217-
2. Use the following to start the HBase shell:
218-
219-
hbase shell
220-
221-
3. Once the HBase shell has loaded, use the following to manually balance the regional servers:
222-
223-
balancer
224-
225-
* **Storm**: You should rebalance any running Storm topologies after a scaling operation has been performed. Rebalancing allows the topology to readjust parallelism settings based on the new number of nodes in the cluster. To rebalance running topologies, use one of the following options:
226-
227-
* **SSH**: Connect to the server and use the following command to rebalance a topology:
228-
229-
storm rebalance TOPOLOGYNAME
230-
231-
You can also specify parameters to override the parallelism hints originally provided by the topology. For example, `storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10` reconfigures the topology to 5 worker processes, 3 executors for the blue-spout component, and 10 executors for the yellow-bolt component.
232-
233-
* **Storm UI**: Use the following steps to rebalance a topology using the Storm UI.
234-
235-
1. Open `https://CLUSTERNAME.azurehdinsight.net/stormui` in your web browser, where `CLUSTERNAME` is the name of your Storm cluster. If prompted, enter the HDInsight cluster administrator (admin) name and password you specified when creating the cluster.
236-
2. Select the topology you wish to rebalance, then select the **Rebalance** button. Enter the delay before the rebalance operation is performed.
237-
238-
* **Kafka**: You should rebalance partition replicas after scaling operations. For more information, see the [High availability of data with Apache Kafka on HDInsight](./kafka/apache-kafka-high-availability.md) document.
239-
240-
For specific information on scaling your HDInsight cluster, see:
241-
242-
* [Manage Apache Hadoop clusters in HDInsight by using the Azure portal](hdinsight-administer-use-portal-linux.md#scale-clusters)
243-
* [Manage Apache Hadoop clusters in HDInsight by using Azure CLI](hdinsight-administer-use-command-line.md#scale-clusters)
208+
The cluster scaling feature allows you to dynamically change the number of data nodes used by a cluster. You can do scaling operations while other jobs or processes are running on a cluster. See [Scale HDInsight clusters](./hdinsight-scaling-best-practices.md)
244209

245210
## How do I install Hue (or other Hadoop component)?
246211

247-
HDInsight is a managed service. If Azure detects a problem with the cluster, it may delete the failing node and create a node to replace it. If you manually install things on the cluster, they aren't persisted when this operation occurs. Instead, use [HDInsight Script Actions](hdinsight-hadoop-customize-cluster-linux.md). A script action can be used to make the following changes:
212+
HDInsight is a managed service. If Azure detects a problem with the cluster, it may delete the failing node and create a node to replace it. When you manually install things on the cluster, they aren't persisted when this operation occurs. Instead, use [HDInsight Script Actions](hdinsight-hadoop-customize-cluster-linux.md). A script action can be used to make the following changes:
248213
249214
* Install and configure a service or web site.
250215
* Install and configure a component that requires configuration changes on multiple nodes in the cluster.
@@ -253,7 +218,7 @@ Script Actions are Bash scripts. The scripts run during cluster creation, and ar
253218
254219
### Jar files
255220
256-
Some Hadoop technologies are provided in self-contained jar files that contain functions used as part of a MapReduce job, or from inside Pig or Hive. They often don't require any setup, and can be uploaded to the cluster after creation and used directly. If you want to make sure the component survives reimaging of the cluster, you can store the jar file in the default storage for your cluster (WASB or ADL).
221+
Some Hadoop technologies provide self-contained jar files. These files contain functions used as part of a MapReduce job, or from inside Pig or Hive. They often don't require any setup, and can be uploaded to the cluster after creation and used directly. If you want to make sure the component survives reimaging of the cluster, store the jar file in the cluster default storage.
257222

258223
For example, if you want to use the latest version of [Apache DataFu](https://datafu.incubator.apache.org/), you can download a jar containing the project and upload it to the HDInsight cluster. Then follow the DataFu documentation on how to use it from Pig or Hive.
259224

articles/hdinsight/hdinsight-scaling-best-practices.md

Lines changed: 26 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.topic: conceptual
99
ms.custom: seoapr2020
10-
ms.date: 04/23/2020
10+
ms.date: 04/29/2020
1111
---
1212

1313
# Scale Azure HDInsight clusters
@@ -69,28 +69,39 @@ The impact of changing the number of data nodes varies for each type of cluster
6969

7070
* Apache Storm
7171

72-
You can seamlessly add or remove data nodes while Storm is running. However, after a successful completion of the scaling operation, you'll need to rebalance the topology.
73-
74-
Rebalancing can be accomplished in two ways:
72+
You can seamlessly add or remove data nodes while Storm is running. However, after a successful completion of the scaling operation, you'll need to rebalance the topology. Rebalancing allows the topology to readjust [parallelism settings](https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html) based on the new number of nodes in the cluster. To rebalance running topologies, use one of the following options:
7573
7674
* Storm web UI
77-
* Command-line interface (CLI) tool
7875
79-
For more information, see [Apache Storm documentation](https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html).
76+
Use the following steps to rebalance a topology using the Storm UI.
77+
78+
1. Open `https://CLUSTERNAME.azurehdinsight.net/stormui` in your web browser, where `CLUSTERNAME` is the name of your Storm cluster. If prompted, enter the HDInsight cluster administrator (admin) name and password you specified when creating the cluster.
79+
80+
1. Select the topology you wish to rebalance, then select the **Rebalance** button. Enter the delay before the rebalance operation is done.
81+
82+
![HDInsight Storm scale rebalance](./media/hdinsight-scaling-best-practices/hdinsight-portal-scale-cluster-storm-rebalance.png)
83+
84+
* Command-line interface (CLI) tool
8085
81-
The Storm web UI is available on the HDInsight cluster:
86+
Connect to the server and use the following command to rebalance a topology:
8287
83-
![HDInsight Storm scale rebalance](./media/hdinsight-scaling-best-practices/hdinsight-portal-scale-cluster-storm-rebalance.png)
88+
```bash
89+
storm rebalance TOPOLOGYNAME
90+
```
8491
85-
Here is an example CLI command to rebalance the Storm topology:
92+
You can also specify parameters to override the parallelism hints originally provided by the topology. For example, the code below reconfigures the `mytopology` topology to 5 worker processes, 3 executors for the blue-spout component, and 10 executors for the yellow-bolt component.
8693
87-
```console
94+
```bash
8895
## Reconfigure the topology "mytopology" to use 5 worker processes,
8996
## the spout "blue-spout" to use 3 executors, and
9097
## the bolt "yellow-bolt" to use 10 executors
9198
$ storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10
9299
```
93100
101+
* Kafka
102+
103+
You should rebalance partition replicas after scaling operations. For more information, see the [High availability of data with Apache Kafka on HDInsight](./kafka/apache-kafka-high-availability.md) document.
104+
94105
## How to safely scale down a cluster
95106
96107
### Scale down a cluster with running jobs
@@ -247,3 +258,8 @@ Region servers are automatically balanced within a few minutes after completing
247258
## Next steps
248259
249260
* [Automatically scale Azure HDInsight clusters](hdinsight-autoscale-clusters.md)
261+
262+
For specific information on scaling your HDInsight cluster, see:
263+
264+
* [Manage Apache Hadoop clusters in HDInsight by using the Azure portal](hdinsight-administer-use-portal-linux.md#scale-clusters)
265+
* [Manage Apache Hadoop clusters in HDInsight by using Azure CLI](hdinsight-administer-use-command-line.md#scale-clusters)

0 commit comments

Comments
 (0)