You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Learn how to submit MapReduce jobs using HDInsight .NET SDK. HDInsight clusters come with a jar file with some MapReduce samples. The jar file is`/example/jars/hadoop-mapreduce-examples.jar`. One of the samples is **wordcount**. You develop a C# console application to submit a wordcount job. The job reads the `/example/data/gutenberg/davinci.txt` file, and outputs the results to `/example/data/davinciwordcount`. If you want to rerun the application, you must clean up the output folder.
14
+
Learn how to submit MapReduce jobs using HDInsight .NET SDK. HDInsight clusters come with a jar file with some MapReduce samples. The jar file is`/example/jars/hadoop-mapreduce-examples.jar`. One of the samples is **wordcount**. You develop a C# console application to submit a wordcount job. The job reads the `/example/data/gutenberg/davinci.txt` file, and outputs the results to `/example/data/davinciwordcount`. If you want to rerun the application, you must clean up the output folder.
15
15
16
16
> [!NOTE]
17
17
> The steps in this article must be performed from a Windows client. For information on using a Linux, OS X, or Unix client to work with Hive, use the tab selector shown on the top of the article.
@@ -34,7 +34,7 @@ The HDInsight .NET SDK provides .NET client libraries, which make it easier to w
1. Copy the code into **Program.cs**. Then edit the code by setting the values for: `existingClusterName`, `existingClusterPassword`, `defaultStorageAccountName`, `defaultStorageAccountKey`, and `defaultStorageContainerName`.
37
+
1. Copy the code below into **Program.cs**. Then edit the code by setting the values for: `existingClusterName`, `existingClusterPassword`, `defaultStorageAccountName`, `defaultStorageAccountKey`, and `defaultStorageContainerName`.
38
38
39
39
```csharp
40
40
using System.Collections.Generic;
@@ -155,7 +155,7 @@ The HDInsight .NET SDK provides .NET client libraries, which make it easier to w
155
155
156
156
1. Press **F5** to run the application.
157
157
158
-
To run the job again, you must change the job output folder name, in the sample its `/example/data/davinciwordcount`.
158
+
To run the job again, you must change the job output folder name, in the sample it's `/example/data/davinciwordcount`.
159
159
160
160
When the job completes successfully, the application prints the content of the output file `part-r-00000`.
Copy file name to clipboardExpand all lines: articles/hdinsight/hdinsight-autoscale-clusters.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,7 +26,7 @@ Schedule-based scaling can be used:
26
26
27
27
Load based scaling can be used:
28
28
29
-
* When the load patterns fluctuate substantially and unpredictably during the day, for example, order data processing with random fluctuations in load patterns based on various factors.
29
+
* When the load patterns fluctuate substantially and unpredictably during the day. For example, order data processing with random fluctuations in load patterns based on various factors.
30
30
31
31
### Cluster metrics
32
32
@@ -228,7 +228,7 @@ All of the cluster status messages that you might see are explained in the follo
228
228
| Updating | The cluster Autoscale configuration is being updated. |
229
229
| HDInsight configuration | A cluster scale up or scale down operation is in progress. |
230
230
| Updating Error | HDInsight met issues during the Autoscale configuration update. Customers can choose to either retry the update or disable autoscale. |
231
-
| Error | Something is wrong with the cluster, and it'sn't usable. Delete this cluster and create a new one. |
231
+
| Error | Something is wrong with the cluster, and it isn't usable. Delete this cluster and create a new one. |
232
232
233
233
To view the current number of nodes in your cluster, go to the **Cluster size** chart on the **Overview** page for your cluster. Or select **Cluster size** under **Settings**.
Copy file name to clipboardExpand all lines: articles/hdinsight/hdinsight-hadoop-manage-ambari-rest-api.md
+10-10Lines changed: 10 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,7 +21,7 @@ Apache Ambari simplifies the management and monitoring of Hadoop clusters by pro
21
21
22
22
* A Hadoop cluster on HDInsight. See [Get Started with HDInsight on Linux](hadoop/apache-hadoop-linux-tutorial-get-started.md).
23
23
24
-
* Bash on Ubuntu on Windows 10. The examples in this article use the Bash shell on Windows 10. See [Windows Subsystem for Linux Installation Guide for Windows 10](/windows/wsl/install-win10) for installation steps. Other [Unix shells](https://www.gnu.org/software/bash/)works as well. The examples, with some slight modifications, can work on a Windows Command prompt. Or you can use Windows PowerShell.
24
+
* Bash on Ubuntu on Windows 10. The examples in this article use the Bash shell on Windows 10. See [Windows Subsystem for Linux Installation Guide for Windows 10](/windows/wsl/install-win10) for installation steps. Other [Unix shells](https://www.gnu.org/software/bash/)work as well. The examples, with some slight modifications, can work on a Windows Command prompt. Or you can use Windows PowerShell.
25
25
26
26
* jq, a command-line JSON processor. See [https://stedolan.github.io/jq/](https://stedolan.github.io/jq/).
27
27
@@ -41,7 +41,7 @@ For Enterprise Security Package clusters, instead of `admin`, use a fully qualif
41
41
42
42
### Setup (Preserve credentials)
43
43
44
-
Preserve your credentials to avoid reentering them for each example. The cluster name preserved in a separate step.
44
+
Preserve your credentials to avoid reentering them for each example. The cluster name is preserved in a separate step.
45
45
46
46
**A. Bash**
47
47
Edit the script by replacing `PASSWORD` with your actual password. Then enter the command.
@@ -185,7 +185,7 @@ foreach($item in $respObj.items) {
185
185
186
186
### Get the default storage
187
187
188
-
HDInsight clusters must use an Azure Storage Account or Data Lake Storage as the default storage. You can use Ambari to retrieve this information after the cluster created. For example, if you want to read/write data to the container outside HDInsight.
188
+
HDInsight clusters must use an Azure Storage Account or Data Lake Storage as the default storage. You can use Ambari to retrieve this information after the cluster has been created. For example, if you want to read/write data to the container outside HDInsight.
189
189
190
190
The following examples retrieve the default storage configuration from the cluster:
> These examples return the first configuration applied to the server (`service_config_version=1`) which contains this information. If you retrieve a value that modified after cluster creation, you may need to list the configuration versions and retrieve the latest one.
205
+
> These examples return the first configuration applied to the server (`service_config_version=1`) which contains this information. If you retrieve a value that has been modified after cluster creation, you may need to list the configuration versions and retrieve the latest one.
206
206
207
207
The return value is similar to one of the following examples:
208
208
@@ -310,7 +310,7 @@ This example returns a JSON document containing the current configuration for th
310
310
```
311
311
312
312
**B. PowerShell**
313
-
The PowerShell script uses [jq](https://stedolan.github.io/jq/). Edit `C:\HD\jq\jq-win64` to reflect your actual path and version of [jq](https://stedolan.github.io/jq/).
313
+
The PowerShell script uses [jq](https://stedolan.github.io/jq/). Edit `C:\HD\jq\jq-win64`below to reflect your actual path and version of [jq](https://stedolan.github.io/jq/).
@@ -385,7 +385,7 @@ This example returns a JSON document containing the current configuration for th
385
385
386
386
At this point, the Ambari web UI indicates the Spark service needs to be restarted before the new configuration can take effect. Use the following steps to restart the service.
387
387
388
-
1. Use the following to enable maintenance mode for the Spark 2 service:
388
+
1. Use the following to enable maintenance mode for the Spark2 service:
@@ -453,7 +453,7 @@ At this point, the Ambari web UI indicates the Spark service needs to be restar
453
453
> The `href` value returned by this URI is using the internal IP address of the cluster node. To use it from outside the cluster, replace the `10.0.0.18:8080` portion with the FQDN of the cluster.
454
454
455
455
4. Verify request.
456
-
Edit the command by replacing `29` with the actual value for`id` returned from the prior step. The following commands retrieve the status of the request:
456
+
Edit the commandbelow by replacing `29` with the actual value for`id` returned from the prior step. The following commands retrieve the status of the request:
Copy file name to clipboardExpand all lines: articles/hdinsight/hdinsight-hadoop-script-actions-linux.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,7 +34,7 @@ When you develop a custom script for an HDInsight cluster, there are several bes
34
34
*[Target the Apache Hadoop version](#bPS1)
35
35
*[Target the OS Version](#bps10)
36
36
*[Provide stable links to script resources](#bPS2)
37
-
*[Use precompiled resources](#bPS4)
37
+
*[Use pre-compiled resources](#bPS4)
38
38
*[Ensure that the cluster customization script is idempotent](#bPS3)
39
39
*[Ensure high availability of the cluster architecture](#bPS5)
40
40
*[Configure the custom components to use Azure Blob storage](#bPS6)
@@ -118,15 +118,15 @@ The best practice is to download and archive everything in an Azure Storage acco
118
118
119
119
For example, the samples provided by Microsoft are stored in the `https://hdiconfigactions.blob.core.windows.net/` storage account. This location is a public, read-only container maintained by the HDInsight team.
120
120
121
-
### <aname="bPS4"></a>Use precompiled resources
121
+
### <aname="bPS4"></a>Use pre-compiled resources
122
122
123
-
To reduce the time it takes to run the script, avoid operations that compile resources from source code. For example, precompile resources and store them in an Azure Storage account blob in the same data center as HDInsight.
123
+
To reduce the time it takes to run the script, avoid operations that compile resources from source code. For example, pre-compile resources and store them in an Azure Storage account blob in the same data center as HDInsight.
124
124
125
125
### <aname="bPS3"></a>Ensure that the cluster customization script is idempotent
126
126
127
127
Scripts must be idempotent. If the script runs multiple times, it should return the cluster to the same state every time.
128
128
129
-
If the script runs multiple times, the script modifies configuration files shouldn't add duplicate entries.
129
+
If the script runs multiple times, the script that modifies configuration files shouldn't add duplicate entries.
130
130
131
131
### <aname="bPS5"></a>Ensure high availability of the cluster architecture
Copy file name to clipboardExpand all lines: articles/hdinsight/hdinsight-phoenix-in-hdinsight.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,7 +11,7 @@ ms.date: 05/22/2024
11
11
12
12
[Apache Phoenix](https://phoenix.apache.org/) is an open source, massively parallel relational database layer built on [Apache HBase](hbase/apache-hbase-overview.md). Phoenix allows you to use SQL-like queries over HBase. Phoenix uses JDBC drivers underneath to enable users to create, delete, alter SQL tables, indexes, views and sequences, and upsert rows individually and in bulk. Phoenix uses noSQL native compilation rather than using MapReduce to compile queries, enabling the creation of low-latency applications on top of HBase. Phoenix adds coprocessors to support running client-supplied code in the address space of the server, executing the code colocated with the data. This approach minimizes client/server data transfer.
13
13
14
-
Apache Phoenix opens up big data queries to nondevelopers who can use a SQL-like syntax rather than programming. Phoenix is highly optimized for HBase, unlike other tools such as [Apache Hive](hadoop/hdinsight-use-hive.md) and Apache Spark SQL. The benefit to developers is writing highly performant queries with much less code.
14
+
Apache Phoenix opens up big data queries to non-developers who can use a SQL-like syntax rather than programming. Phoenix is highly optimized for HBase, unlike other tools such as [Apache Hive](hadoop/hdinsight-use-hive.md) and Apache Spark SQL. The benefit to developers is writing highly performant queries with much less code.
15
15
16
16
When you submit a SQL query, Phoenix compiles the query to HBase native calls and runs the scan (or plan) in parallel for optimization. This layer of abstraction frees the developer from writing MapReduce jobs, to focus instead on the business logic and the workflow of their application around Phoenix's big data storage.
17
17
@@ -89,9 +89,9 @@ ALTER TABLE my_other_table SET TRANSACTIONAL=true;
89
89
90
90
### Salted Tables
91
91
92
-
*Region server hotspotting* can occur when writing records with sequential keys to HBase. Though you may have multiple region servers in your cluster, your writes are all occurring on just one. This concentration creates the hotspotting issue where, instead of your write workload being distributed across all of the available region servers, just one is handling the load. Since each region has a predefined maximum size, when a region reaches that size limit, split into two small regions. When that happens, one of these new regions takes all new records, becoming the new hotspot.
92
+
*Region server hotspotting* can occur when writing records with sequential keys to HBase. Though you may have multiple region servers in your cluster, your writes are all occurring on just one. This concentration creates the hotspotting issue where, instead of your write workload being distributed across all of the available region servers, just one is handling the load. Since each region has a predefined maximum size, when a region reaches that size limit, it's split into two small regions. When that happens, one of these new regions takes all new records, becoming the new hotspot.
93
93
94
-
To mitigate this problem and achieve better performance, presplit tables so that all of the region servers are equally used. Phoenix provides *salted tables*, transparently adding the salting byte to the row key for a particular table. The table is presplit on the salt byte boundaries to ensure equal load distribution among region servers during the initial phase of the table. This approach distributes the write workload across all of the available region servers, improving the write and read performance. To salt a table, specify the `SALT_BUCKETS` table property when the table is created:
94
+
To mitigate this problem and achieve better performance, pre-split tables so that all of the region servers are equally used. Phoenix provides *salted tables*, transparently adding the salting byte to the row key for a particular table. The table is pre-split on the salt byte boundaries to ensure equal load distribution among region servers during the initial phase of the table. This approach distributes the write workload across all of the available region servers, improving the write and read performance. To salt a table, specify the `SALT_BUCKETS` table property when the table is created:
Copy file name to clipboardExpand all lines: articles/hdinsight/hdinsight-selecting-vm-size.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,13 +11,13 @@ ms.date: 05/22/2024
11
11
12
12
This article discusses how to select the right VM size for the various nodes in your HDInsight cluster.
13
13
14
-
Begin by understanding how the properties of a virtual machine such as CPU processing, RAM size, and network latency affects the processing of your workloads. Next, think about your application and how it matches with what different VM families are optimized for. Make sure that the VM family that you would like to use is compatible with the cluster type that you plan to deploy. For a list of all supported and recommended VM sizes for each cluster type, see [Azure HDInsight supported node configurations](hdinsight-supported-node-configuration.md). Lastly, you can use a benchmarking process to test some sample workloads and check which SKU within that family is right for you.
14
+
Begin by understanding how the properties of a virtual machine such as CPU processing, RAM size, and network latency affect the processing of your workloads. Next, think about your application and how it matches with what different VM families are optimized for. Make sure that the VM family that you would like to use is compatible with the cluster type that you plan to deploy. For a list of all supported and recommended VM sizes for each cluster type, see [Azure HDInsight supported node configurations](hdinsight-supported-node-configuration.md). Lastly, you can use a benchmarking process to test some sample workloads and check which SKU within that family is right for you.
15
15
16
16
For more information on planning other aspects of your cluster such as selecting a storage type or cluster size, see [Capacity planning for HDInsight clusters](hdinsight-capacity-planning.md).
17
17
18
18
## VM properties and big data workloads
19
19
20
-
The VM size and type determined by CPU processing power, RAM size, and network latency:
20
+
The VM size and type are determined by CPU processing power, RAM size, and network latency:
21
21
22
22
- CPU: The VM size dictates the number of cores. The more cores, the greater the degree of parallel computation each node can achieve. Also, some VM types have faster cores.
23
23
@@ -40,7 +40,7 @@ Virtual machine families in Azure are optimized to suit different use cases. In
40
40
41
41
## Cost saving VM types for light workloads
42
42
43
-
If you have light processing requirements, the [F-series](https://azure.microsoft.com/blog/f-series-vm-size/) can be a good choice to get started with HDInsight. At a lower per-hour list price, the F-series are the best value in price-performance in the Azure portfolio based on the Azure Compute Unit (ACU) per vCPU.
43
+
If you have light processing requirements, the [F-series](https://azure.microsoft.com/blog/f-series-vm-size/) can be a good choice to get started with HDInsight. At a lower per-hour list price, the F-series is the best value in price-performance in the Azure portfolio based on the Azure Compute Unit (ACU) per vCPU.
44
44
45
45
The following table describes the cluster types and node types, which can be created with the Fsv2-series VMs.
0 commit comments