You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/hdinsight-hadoop-linux-information.md
+10-45Lines changed: 10 additions & 45 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,9 +5,9 @@ author: hrasheed-msft
5
5
ms.author: hrasheed
6
6
ms.reviewer: jasonh
7
7
ms.service: hdinsight
8
-
ms.custom: hdinsightactive
8
+
ms.custom: hdinsightactive,seoapr2020
9
9
ms.topic: conceptual
10
-
ms.date: 11/14/2019
10
+
ms.date: 04/29/2020
11
11
---
12
12
13
13
# Information about using HDInsight on Linux
@@ -90,21 +90,21 @@ Example data and JAR files can be found on Hadoop Distributed File System at `/e
90
90
91
91
## HDFS, Azure Storage, and Data Lake Storage
92
92
93
-
In most Hadoop distributions, the data is stored in HDFS, which is backed by local storage on the machines in the cluster. Using local storage can be costly for a cloud-based solution where you're charged hourly or by minute for compute resources.
93
+
In most Hadoop distributions, the data is stored in HDFS. HDFS is backed by local storage on the machines in the cluster. Using local storage can be costly for a cloud-based solution where you're charged hourly or by minute for compute resources.
94
94
95
-
When using HDInsight, the data files are stored in a scalable and resilient way in the cloud using Azure Blob Storage and optionally Azure Data Lake Storage. These services provide the following benefits:
95
+
When using HDInsight, the data files are stored in an adaptable and resilient way in the cloud using Azure Blob Storage and optionally Azure Data Lake Storage. These services provide the following benefits:
96
96
97
97
* Cheap long-term storage.
98
98
* Accessibility from external services such as websites, file upload/download utilities, various language SDKs, and web browsers.
99
-
* Large file capacity and large scalable storage.
99
+
* Large file capacity and large adaptable storage.
100
100
101
101
For more information, see [Understanding blobs](https://docs.microsoft.com/rest/api/storageservices/understanding-block-blobs--append-blobs--and-page-blobs) and [Data Lake Storage](https://azure.microsoft.com/services/storage/data-lake-storage/).
102
102
103
-
When using either Azure Storage or Data Lake Storage, you don't have to do anything special from HDInsight to access the data. For example, the following command lists files in the `/example/data` folder regardless of whether it's stored on Azure Storage or Data Lake Storage:
103
+
When using either Azure Storage or Data Lake Storage, you don't have to do anything special from HDInsight to access the data. For example, the following command lists files in the `/example/data` folder whether it's stored on Azure Storage or Data Lake Storage:
104
104
105
105
hdfs dfs -ls /example/data
106
106
107
-
In HDInsight, the data storage resources (Azure Blob Storage and Azure Data Lake Storage) are decoupled from compute resources. Therefore, you can create HDInsight clusters to do computation as you need, and later delete the cluster when the work is finished, meanwhile keeping your data files persisted safely in cloud storage as long as you need.
107
+
In HDInsight, the data storage resources (Azure Blob Storage and Azure Data Lake Storage) are decoupled from compute resources. You can create HDInsight clusters to do computation as you need, and later delete the cluster when the work is finished. Meanwhile keeping your data files persisted safely in cloud storage as long as you need.
108
108
109
109
### <aname="URI-and-scheme"></a>URI and scheme
110
110
@@ -205,46 +205,11 @@ If using __Azure Data Lake Storage__, see the following links for ways that you
205
205
206
206
## <a name="scaling"></a>Scaling your cluster
207
207
208
-
The cluster scaling feature allows you to dynamically change the number of data nodes used by a cluster. You can perform scaling operations while other jobs or processes are running on a cluster. See also, [Scale HDInsight clusters](./hdinsight-scaling-best-practices.md)
209
-
210
-
The different cluster types are affected by scaling as follows:
211
-
212
-
***Hadoop**: When scaling down the number of nodes in a cluster, some of the services in the cluster are restarted. Scaling operations can cause jobs running or pending to fail at the completion of the scaling operation. You can resubmit the jobs once the operation is complete.
213
-
***HBase**: Regional servers are automatically balanced within a few minutes, once the scaling operation completes. To manually balance regional servers, use the following steps:
214
-
215
-
1. Connect to the HDInsight cluster using SSH. For more information, see [Use SSH with HDInsight](hdinsight-hadoop-linux-use-ssh-unix.md).
216
-
217
-
2. Use the following to start the HBase shell:
218
-
219
-
hbase shell
220
-
221
-
3. Once the HBase shell has loaded, use the following to manually balance the regional servers:
222
-
223
-
balancer
224
-
225
-
***Storm**: You should rebalance any running Storm topologies after a scaling operation has been performed. Rebalancing allows the topology to readjust parallelism settings based on the new number of nodes in the cluster. To rebalance running topologies, use one of the following options:
226
-
227
-
***SSH**: Connect to the server and use the following command to rebalance a topology:
228
-
229
-
storm rebalance TOPOLOGYNAME
230
-
231
-
You can also specify parameters to override the parallelism hints originally provided by the topology. For example, `storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10` reconfigures the topology to 5 worker processes, 3 executors for the blue-spout component, and 10 executors for the yellow-bolt component.
232
-
233
-
***Storm UI**: Use the following steps to rebalance a topology using the Storm UI.
234
-
235
-
1. Open `https://CLUSTERNAME.azurehdinsight.net/stormui`in your web browser, where `CLUSTERNAME` is the name of your Storm cluster. If prompted, enter the HDInsight cluster administrator (admin) name and password you specified when creating the cluster.
236
-
2. Select the topology you wish to rebalance, thenselectthe**Rebalance** button. Enter the delay before the rebalance operation is performed.
237
-
238
-
***Kafka**: You should rebalance partition replicas after scaling operations. For more information, see the [High availability of data with Apache Kafka on HDInsight](./kafka/apache-kafka-high-availability.md) document.
239
-
240
-
For specific information on scaling your HDInsight cluster, see:
241
-
242
-
* [Manage Apache Hadoop clusters in HDInsight by using the Azure portal](hdinsight-administer-use-portal-linux.md#scale-clusters)
243
-
* [Manage Apache Hadoop clusters in HDInsight by using Azure CLI](hdinsight-administer-use-command-line.md#scale-clusters)
208
+
The cluster scaling feature allows you to dynamically change the number of data nodes used by a cluster. You can do scaling operations while other jobs or processes are running on a cluster. See [Scale HDInsight clusters](./hdinsight-scaling-best-practices.md)
244
209
245
210
## How do I install Hue (or other Hadoop component)?
246
211
247
-
HDInsight is a managed service. If Azure detects a problem with the cluster, it may delete the failing node and create a node to replace it. If you manually install things on the cluster, they aren't persisted when this operation occurs. Instead, use [HDInsight Script Actions](hdinsight-hadoop-customize-cluster-linux.md). A script action can be used to make the following changes:
212
+
HDInsight is a managed service. If Azure detects a problem with the cluster, it may delete the failing node and create a node to replace it. When you manually install things on the cluster, they aren't persisted when this operation occurs. Instead, use [HDInsight Script Actions](hdinsight-hadoop-customize-cluster-linux.md). A script action can be used to make the following changes:
248
213
249
214
* Install and configure a service or web site.
250
215
* Install and configure a component that requires configuration changes on multiple nodes in the cluster.
@@ -253,7 +218,7 @@ Script Actions are Bash scripts. The scripts run during cluster creation, and ar
253
218
254
219
### Jar files
255
220
256
-
Some Hadoop technologies are provided in self-contained jar files that contain functions used as part of a MapReduce job, or from inside Pig or Hive. They often don't require any setup, and can be uploaded to the cluster after creation and used directly. If you want to make sure the component survives reimaging of the cluster, you can store the jar file in the default storage for your cluster (WASB or ADL).
221
+
Some Hadoop technologies provide self-contained jar files. These files contain functions used as part of a MapReduce job, or from inside Pig or Hive. They often don't require any setup, and can be uploaded to the cluster after creation and used directly. If you want to make sure the component survives reimaging of the cluster, store the jar file in the cluster default storage.
257
222
258
223
For example, if you want to use the latest version of [Apache DataFu](https://datafu.incubator.apache.org/), you can download a jar containing the project and upload it to the HDInsight cluster. Then follow the DataFu documentation on how to use it from Pig or Hive.
Copy file name to clipboardExpand all lines: articles/hdinsight/hdinsight-scaling-best-practices.md
+26-10Lines changed: 26 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ ms.reviewer: jasonh
7
7
ms.service: hdinsight
8
8
ms.topic: conceptual
9
9
ms.custom: seoapr2020
10
-
ms.date: 04/23/2020
10
+
ms.date: 04/29/2020
11
11
---
12
12
13
13
# Scale Azure HDInsight clusters
@@ -69,28 +69,39 @@ The impact of changing the number of data nodes varies for each type of cluster
69
69
70
70
* Apache Storm
71
71
72
-
You can seamlessly add or remove data nodes while Storm is running. However, after a successful completion of the scaling operation, you'll need to rebalance the topology.
73
-
74
-
Rebalancing can be accomplished in two ways:
72
+
You can seamlessly add or remove data nodes while Storm is running. However, after a successful completion of the scaling operation, you'll need to rebalance the topology. Rebalancing allows the topology to readjust [parallelism settings](https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html) based on the new number of nodes in the cluster. To rebalance running topologies, use one of the following options:
75
73
76
74
* Storm web UI
77
-
* Command-line interface (CLI) tool
78
75
79
-
For more information, see [Apache Storm documentation](https://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html).
76
+
Use the following steps to rebalance a topology using the Storm UI.
77
+
78
+
1. Open `https://CLUSTERNAME.azurehdinsight.net/stormui` in your web browser, where `CLUSTERNAME` is the name of your Storm cluster. If prompted, enter the HDInsight cluster administrator (admin) name and password you specified when creating the cluster.
79
+
80
+
1. Select the topology you wish to rebalance, then select the **Rebalance** button. Enter the delay before the rebalance operation is done.
Here is an example CLI command to rebalance the Storm topology:
92
+
You can also specify parameters to override the parallelism hints originally provided by the topology. For example, the code below reconfigures the `mytopology` topology to 5 worker processes, 3 executors for the blue-spout component, and 10 executors for the yellow-bolt component.
86
93
87
-
```console
94
+
```bash
88
95
## Reconfigure the topology "mytopology" to use 5 worker processes,
You should rebalance partition replicas after scaling operations. For more information, see the [High availability of data with Apache Kafka on HDInsight](./kafka/apache-kafka-high-availability.md) document.
104
+
94
105
## How to safely scale down a cluster
95
106
96
107
### Scale down a cluster with running jobs
@@ -247,3 +258,8 @@ Region servers are automatically balanced within a few minutes after completing
0 commit comments