Skip to content

Commit 639482c

Browse files
authored
Merge pull request #293971 from sreekzz/feb-freshness-index
Feb Freshness Index update
2 parents ab669c1 + 3515d15 commit 639482c

18 files changed

+46
-46
lines changed

articles/hdinsight/hadoop/apache-hadoop-connect-hive-jdbc-driver.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Use the JDBC driver from a Java application to submit Apache Hive q
44
ms.service: azure-hdinsight
55
ms.topic: how-to
66
ms.custom: hdinsightactive, devx-track-extended-java
7-
ms.date: 02/12/2024
7+
ms.date: 02/03/2025
88
---
99

1010
# Query Apache Hive through the JDBC driver in HDInsight
@@ -145,7 +145,7 @@ at java.util.concurrent.FutureTask.get(FutureTask.java:206)
145145

146146
**Symptoms**: HDInsight unexpectedly disconnects the connection when trying to download a huge amount of data (say several GBs) through JDBC/ODBC.
147147

148-
**Cause**: The limitation on Gateway nodes causes this error. When getting data from JDBC/ODBC, all data needs to pass through the Gateway node. However, a gateway isn't designed to download a huge amount of data, so the Gateway might close the connection if it can't handle the traffic.
148+
**Cause**: The limitation on Gateway nodes causes this error. When you get data from JDBC/ODBC, all data needs to pass through the Gateway node. However, a gateway isn't designed to download a huge amount of data, so the Gateway might close the connection if it can't handle the traffic.
149149

150150
**Resolution**: Avoid using JDBC/ODBC driver to download huge amounts of data. Copy data directly from blob storage instead.
151151

articles/hdinsight/hadoop/apache-hadoop-run-custom-programs.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: When and how to run custom Apache MapReduce programs on Azure HDIns
44
ms.service: azure-hdinsight
55
ms.topic: how-to
66
ms.custom: hdinsightactive
7-
ms.date: 02/12/2024
7+
ms.date: 02/03/2025
88
---
99

1010
# Run custom MapReduce programs
@@ -13,28 +13,28 @@ Apache Hadoop-based big data systems such as HDInsight enable data processing us
1313

1414
| Query mechanism | Advantages | Considerations |
1515
| --- | --- | --- |
16-
| **Apache Hive using HiveQL** | <ul><li>An excellent solution for batch processing and analysis of large amounts of immutable data, for data summarization, and for on demand querying. It uses a familiar SQL-like syntax.</li><li>It can be used to produce persistent tables of data that can be easily partitioned and indexed.</li><li>Multiple external tables and views can be created over the same data.</li><li>It supports a simple data warehouse implementation that provides massive scale-out and fault-tolerance capabilities for data storage and processing.</li></ul> | <ul><li>It requires the source data to have at least some identifiable structure.</li><li>It isn't suitable for real-time queries and row level updates. It's best used for batch jobs over large sets of data.</li><li>It might not be able to carry out some types of complex processing tasks.</li></ul> |
16+
| **Apache Hive using HiveQL** | <ul><li>An excellent solution for batch processing and analysis of large amounts of immutable data, for data summarization, and for on demand querying. It uses a familiar SQL-like syntax.</li><li>It can be used to that produce persistent tables of data that can be easily partitioned and indexed.</li><li>Multiple external tables and views can be created over the same data.</li><li>It supports a simple data warehouse implementation that provides massive scale-out and fault-tolerance capabilities for data storage and processing.</li></ul> | <ul><li>It requires the source data to have at least some identifiable structure.</li><li>It isn't suitable for real-time queries and row level updates. It's best used for batch jobs over large sets of data.</li><li>It might not be able to carry out some types of complex processing tasks.</li></ul> |
1717
| **Apache Pig using Pig Latin** | <ul><li>An excellent solution for manipulating data as sets, merging and filtering datasets, applying functions to records or groups of records, and for restructuring data by defining columns, by grouping values, or by converting columns to rows.</li><li>It can use a workflow-based approach as a sequence of operations on data.</li></ul> | <ul><li>SQL users may find Pig Latin is less familiar and more difficult to use than HiveQL.</li><li>The default output is usually a text file and so can be more difficult to use with visualization tools such as Excel. Typically you'll layer a Hive table over the output.</li></ul> |
18-
| **Custom map/reduce** | <ul><li>It provides full control over the map and reduces phases, and execution.</li><li>It allows queries to be optimized to achieve maximum performance from the cluster, or to minimize the load on the servers and the network.</li><li>The components can be written in a range of well-known languages.</li></ul> | <ul><li>It's more difficult than using Pig or Hive because you must create your own map and reduce components.</li><li>Processes that require joining sets of data are more difficult to implement.</li><li>Even though there are test frameworks available, debugging code is more complex than a normal application because the code runs as a batch job under the control of the Hadoop job scheduler.</li></ul> |
18+
| **Custom MapReduce** | <ul><li>It provides full control over the map and reduces phases, and execution.</li><li>It allows queries to be optimized to achieve maximum performance from the cluster, or to minimize the load on the servers and the network.</li><li>The components can be written in a range of well-known languages.</li></ul> | <ul><li>It's more difficult than using Pig or Hive because you must create your own map and reduce components.</li><li>Processes that require joining sets of data are more difficult to implement.</li><li>Even though there are test frameworks available, debugging code is more complex than a normal application because the code runs as a batch job under the control of the Hadoop job scheduler.</li></ul> |
1919
| `Apache HCatalog` | <ul><li>It abstracts the path details of storage, making administration easier and removing the need for users to know where the data is stored.</li><li>It enables notification of events such as data availability, allowing other tools such as Oozie to detect when operations have occurred.</li><li>It exposes a relational view of data, including partitioning by key, and makes the data easy to access.</li></ul> | <ul><li>It supports RCFile, CSV text, JSON text, SequenceFile, and ORC file formats by default, but you may need to write a custom SerDe for other formats.</li><li>`HCatalog` isn't thread-safe.</li><li>There are some restrictions on the data types for columns when using the `HCatalog` loader in Pig scripts. For more information, see [HCatLoader Data Types](https://cwiki.apache.org/confluence/display/Hive/HCatalog%20LoadStore#HCatalogLoadStore-HCatLoaderDataTypes) in the Apache `HCatalog` documentation.</li></ul> |
2020

2121
Typically, you use the simplest of these approaches that can provide the results you require. For example, you may be able to achieve such results by using just Hive, but for more complex scenarios you may need to use Pig, or even write your own map and reduce components. You may also decide, after experimenting with Hive or Pig, that custom map and reduce components can provide better performance by allowing you to fine-tune and optimize the processing.
2222

23-
## Custom map/reduce components
23+
## Custom MapReduce components
2424

25-
Map/reduce code consists of two separate functions implemented as **map** and **reduce** components. The **map** component is run in parallel on multiple cluster nodes, each node applying the mapping to the node's own subset of the data. The **reduce** component collates and summarizes the results from all the map functions. For more information on these two components, see [Use MapReduce in Hadoop on HDInsight](hdinsight-use-mapreduce.md).
25+
MapReduce code consists of two separate functions implemented as **map** and **reduce** components. The **map** component is run in parallel on multiple cluster nodes, each node applying the mapping to the node's own subset of the data. The **reduce** component collates and summarizes the results from all the map functions. For more information on these two components, see [Use MapReduce in Hadoop on HDInsight](hdinsight-use-mapreduce.md).
2626

2727
In most HDInsight processing scenarios, it's simpler and more efficient to use a higher-level abstraction such as Pig or Hive. You can also create custom map and reduce components for use within Hive scripts to perform more sophisticated processing.
2828

29-
Custom map/reduce components are typically written in Java. Hadoop provides a streaming interface that also allows components to be used that are developed in other languages such as C#, F#, Visual Basic, Python, and JavaScript.
29+
Custom MapReduce components are typically written in Java. Hadoop provides a streaming interface that also allows components to be used that are developed in other languages such as C#, F#, Visual Basic, Python, and JavaScript.
3030

3131
* For a walkthrough on developing custom Java MapReduce programs, see [Develop Java MapReduce programs for Hadoop on HDInsight](apache-hadoop-develop-deploy-java-mapreduce-linux.md).
3232

3333
Consider creating your own map and reduce components for the following conditions:
3434

3535
* You need to process data that is completely unstructured by parsing the data and using custom logic to obtain structured information from it.
3636
* You want to perform complex tasks that are difficult (or impossible) to express in Pig or Hive without resorting to creating a UDF. For example, you might need to use an external geocoding service to convert latitude and longitude coordinates or IP addresses in the source data to geographical location names.
37-
* You want to reuse your existing .NET, Python, or JavaScript code in map/reduce components by using the Hadoop streaming interface.
37+
* You want to reuse your existing .NET, Python, or JavaScript code in MapReduce components by using the Hadoop streaming interface.
3838

3939
## Upload and run your custom MapReduce program
4040

articles/hdinsight/hbase/apache-hbase-phoenix-zeppelin.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Learn how to use Apache Zeppelin to run Apache Base queries with Ph
44
ms.service: azure-hdinsight
55
ms.custom: hdinsightactive
66
ms.topic: how-to
7-
ms.date: 02/12/2024
7+
ms.date: 02/03/2025
88
---
99

1010
# Use Apache Zeppelin to run Apache Phoenix queries over Apache HBase in Azure HDInsight

articles/hdinsight/hbase/apache-hbase-using-phoenix-query-server-rest-sdk.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Install and use the REST SDK for the Phoenix Query Server in Azure
44
ms.service: azure-hdinsight
55
ms.topic: how-to
66
ms.custom: "hdinsightactive, devx-track-csharp"
7-
ms.date: 02/12/2024
7+
ms.date: 02/03/2025
88
---
99

1010
# Apache Phoenix Query Server REST SDK

articles/hdinsight/hbase/hbase-troubleshoot-timeouts-hbase-hbck.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Timeouts with 'hbase hbck' command in Azure HDInsight
33
description: Time out issue with 'hbase hbck' command when fixing region assignments
44
ms.service: azure-hdinsight
55
ms.topic: troubleshooting
6-
ms.date: 02/20/2024
6+
ms.date: 02/03/2025
77
---
88

99
# Scenario: Timeouts with 'hbase hbck' command in Azure HDInsight

articles/hdinsight/hdinsight-apps-publish-applications.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Learn how to create an HDInsight application, and then publish it i
44
ms.service: azure-hdinsight
55
ms.custom: hdinsightactive
66
ms.topic: how-to
7-
ms.date: 02/12/2024
7+
ms.date: 02/03/2025
88

99
---
1010
# Publish an HDInsight application in the Azure Marketplace
@@ -47,7 +47,7 @@ When an application is installed on a cluster (either on an existing cluster, or
4747
> [!IMPORTANT]
4848
> The name of the application installation script must be unique for a specific cluster. The script name must have the following format:
4949
>
50-
> "name": "[concat('hue-install-v0','-' ,uniquestring(‘applicationName’)]"
50+
> `"name": "[concat('hue-install-v0','-' ,uniquestring(‘applicationName’)]"`
5151
>
5252
> The script name has three parts:
5353
>
@@ -60,9 +60,9 @@ When an application is installed on a cluster (either on an existing cluster, or
6060
6161
The installation script must have the following characteristics:
6262
* The script is idempotent. Multiple calls to the script produce the same result.
63-
* The script is properly versioned. Use a different location for the script when you are upgrading or testing changes. This ensures that customers who are installing the application are not affected by your updates or testing.
63+
* The script is properly versioned. Use a different location for the script when you're upgrading or testing changes. This ensures that customers who are installing the application aren't affected by your updates or testing.
6464
* The script has adequate logging at each point. Usually, script logs are the only way to debug application installation issues.
65-
* Calls to external services or resources have adequate retries so that the installation is not affected by transient network issues.
65+
* Calls to external services or resources have adequate retries so that the installation isn't affected by transient network issues.
6666
* If your script starts services on the nodes, services are monitored and configured to start automatically if a node reboot occurs.
6767

6868
## Package the application
@@ -83,7 +83,7 @@ To publish an HDInsight application:
8383
2. In the left menu, select **Solution templates**.
8484
3. Enter a title, and then select **Create a new solution template**.
8585
4. If you haven't already registered your organization, select **Create Dev Center account and join the Azure program**. For more information, see [Create a Microsoft Developer account](../marketplace/overview.md).
86-
5. Select **Define some Topologies to get Started**. A solution template is a "parent" to all its topologies. You can define multiple topologies in one offer or solution template. When an offer is pushed to staging, it is pushed with all its topologies.
86+
5. Select **Define some Topologies to get Started**. A solution template is a "parent" to all its topologies. You can define multiple topologies in one offer or solution template. When an offer is pushed to staging, it's pushed with all its topologies.
8787
6. Enter a topology name, and then select **+**.
8888
7. Enter a new version, and then select **+**.
8989
8. Upload the .zip file you created when you packaged the application.

articles/hdinsight/hdinsight-known-issues-conda-version-regression.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ title: Conda Version Regression in a recent HDInsight release
33
description: Known issue affecting image version 5.1.3000.0.2308052231
44
ms.service: azure-hdinsight
55
ms.topic: troubleshooting-known-issue
6-
ms.date: 02/22/2024
6+
ms.date: 02/03/2025
77
---
88

99
# Conda version regression in a recent HDInsight release

articles/hdinsight/hdinsight-os-patching.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Learn how to configure OS patching schedule for Linux-based HDInsig
44
ms.service: azure-hdinsight
55
ms.topic: how-to
66
ms.custom: hdinsightactive, linux-related-content
7-
ms.date: 02/12/2024
7+
ms.date: 02/03/2025
88
---
99

1010
# Configure the OS patching schedule for Linux-based HDInsight clusters

articles/hdinsight/hdinsight-upgrade-cluster.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,14 @@ description: Learn guidelines to migrate your Azure HDInsight cluster to a newer
55
ms.service: azure-hdinsight
66
ms.topic: how-to
77
ms.custom: hdinsightactive
8-
ms.date: 02/21/2024
8+
ms.date: 02/03/2025
99
---
1010
# Migrate HDInsight cluster to a newer version
1111

1212
To take advantage of the latest HDInsight features, we recommend that HDInsight clusters be regularly migrated to latest version. HDInsight doesn't support in-place upgrades where an existing cluster is upgraded to a newer component version. You must create a new cluster with the desired component and platform version and then migrate your applications to use the new cluster. Follow the below guidelines to migrate your HDInsight cluster versions.
1313

1414
> [!NOTE]
15-
> If you are creating a Hive cluster with a primary storage container, copy it from an existing HDInsight cluster. Don'tt copy the complete content. Copy only the data folders which are configured.
15+
> If you're creating a Hive cluster with a primary storage container, copy it from an existing HDInsight cluster. Don't copy the complete content. Copy only the data folders which are configured.
1616
1717
## Migration tasks
1818

@@ -24,9 +24,9 @@ The workflow to upgrade HDInsight Cluster is as follows.
2424
3. Copy existing jobs, data sources, and sinks to the new environment.
2525
4. Perform validation testing to make sure that your jobs work as expected on the new cluster.
2626

27-
Once you've verified that everything works as expected, schedule downtime for the migration. During this downtime, do the following actions:
27+
Once you have verified that everything works as expected, schedule downtime for the migration. During this downtime, do the following actions:
2828

29-
1. Back up any transient data stored locally on the cluster nodes. For example, if you've data stored directly on a head node.
29+
1. Back up any transient data stored locally on the cluster nodes. For example, if you have data stored directly on a head node.
3030
1. [Delete the existing cluster](./hdinsight-delete-cluster.md).
3131
1. Create a cluster in the same VNET subnet with latest (or supported) HDI version using the same default data store that the previous cluster used. This allows the new cluster to continue working against your existing production data.
3232
1. Import any transient data you backed up.
@@ -58,7 +58,7 @@ As mentioned above, Microsoft recommends that HDInsight clusters be regularly mi
5858
* **Third-party software**. Customers have the ability to install third-party software on their HDInsight clusters; however, we'll recommend recreating the cluster if it breaks the existing functionality.
5959
* **Multiple workloads on the same cluster**. In HDInsight 4.0, the Hive Warehouse Connector needs separate clusters for Spark and Interactive Query workloads. [Follow these steps to set up both clusters in Azure HDInsight](interactive-query/apache-hive-warehouse-connector.md). Similarly, integrating [Spark with HBASE](hdinsight-using-spark-query-hbase.md) requires two different clusters.
6060
* **Custom Ambari DB password changed**. The Ambari DB password is set during cluster creation and there's no current mechanism to update it. If a customer deploys the cluster with a [custom Ambari DB](hdinsight-custom-ambari-db.md), they have the ability to change the DB password on the SQL DB; however, there's no way to update this password for a running HDInsight cluster.
61-
* **Modifying HDInsight Load Balancers**. The HDInsight load balancers that are automatically deployed for Ambari and SSH access **should not** be modified or deleted. If you modify the HDInsight load balancer(s) and it breaks the cluster functionality, you will be advised to redeploy the cluster.
61+
* **Modifying HDInsight Load Balancers**. The HDInsight load balancers that are automatically deployed for Ambari and SSH access **should not** be modified or deleted. If you modify the HDInsight load balancers and it breaks the cluster functionality, you will be advised to redeploy the cluster.
6262
* **Reusing Ranger 4.X Databases in 5.X**. HDInsight 5.1 has [Apache Ranger version 2.3.0](https://cwiki.apache.org/confluence/display/RANGER/Apache+Ranger+2.3.0+-+Release+Notes) which is major version upgrade from 1.2.0 in HDInsight 4.X clusters. Reuse of an HDInsight 4.X Ranger database in HDInsight 5.1 would prevent the Ranger service from starting due to differences in the DB schema. You would need to create an empty Ranger database to successfully deploy HDInsight 5.1 ESP clusters.
6363

6464
## Next steps

articles/hdinsight/hdinsight-using-spark-query-hbase.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ description: Use the Spark HBase Connector to read and write data from a Spark c
44
ms.service: azure-hdinsight
55
ms.topic: how-to
66
ms.custom: hdinsightactive
7-
ms.date: 02/27/2024
7+
ms.date: 02/03/2025
88
---
99

1010
# Use Apache Spark to read and write Apache HBase data
@@ -105,11 +105,11 @@ wasb://sparkcon-2020-08-03t18-17-37-853z@sparkconhdistorage.blob.core.windows.ne
105105
|Persisted|yes|
106106
107107
108-
* You can specify how often you want this cluster to automatically check if update. Default: -s “*/1 * * * *” -h 0 (In this example, the Spark cron runs every minute, while the HBase cron doesn't run)
108+
* You can specify how often you want this cluster to automatically check if update. Default: `-s “*/1 * * * *” -h 0` (In this example, the Spark cron job runs every minute, while the HBase cron doesn't run)
109109
* Since HBase cron isn't set up by default, you need to rerun this script when perform scaling to your HBase cluster. If your HBase cluster scales often, you may choose to set up HBase cron job automatically. For example: `-s '*/1 * * * *' -h '*/30 * * * *' -d "securehadooprc"` configures the script to perform checks every 30 minutes. This will run HBase cron schedule periodically to automate downloading of new HBase information on the common storage account to local node.
110110
111111
>[!NOTE]
112-
>These scripts works only on HDI 5.0 and HDI 5.1 clusters.
112+
>These scripts work only on HDI 5.0 and HDI 5.1 clusters.
113113
114114
115115

0 commit comments

Comments
 (0)