You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/hadoop/apache-hadoop-run-custom-programs.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,7 @@ description: When and how to run custom Apache MapReduce programs on Azure HDIns
4
4
ms.service: hdinsight
5
5
ms.topic: how-to
6
6
ms.custom: hdinsightactive
7
-
ms.date: 01/31/2023
7
+
ms.date: 02/12/2024
8
8
---
9
9
10
10
# Run custom MapReduce programs
@@ -15,8 +15,8 @@ Apache Hadoop-based big data systems such as HDInsight enable data processing us
15
15
| --- | --- | --- |
16
16
|**Apache Hive using HiveQL**| <ul><li>An excellent solution for batch processing and analysis of large amounts of immutable data, for data summarization, and for on demand querying. It uses a familiar SQL-like syntax.</li><li>It can be used to produce persistent tables of data that can be easily partitioned and indexed.</li><li>Multiple external tables and views can be created over the same data.</li><li>It supports a simple data warehouse implementation that provides massive scale-out and fault-tolerance capabilities for data storage and processing.</li></ul> | <ul><li>It requires the source data to have at least some identifiable structure.</li><li>It isn't suitable for real-time queries and row level updates. It's best used for batch jobs over large sets of data.</li><li>It might not be able to carry out some types of complex processing tasks.</li></ul> |
17
17
|**Apache Pig using Pig Latin**| <ul><li>An excellent solution for manipulating data as sets, merging and filtering datasets, applying functions to records or groups of records, and for restructuring data by defining columns, by grouping values, or by converting columns to rows.</li><li>It can use a workflow-based approach as a sequence of operations on data.</li></ul> | <ul><li>SQL users may find Pig Latin is less familiar and more difficult to use than HiveQL.</li><li>The default output is usually a text file and so can be more difficult to use with visualization tools such as Excel. Typically you'll layer a Hive table over the output.</li></ul> |
18
-
|**Custom map/reduce**| <ul><li>It provides full control over the map and reduce phases, and execution.</li><li>It allows queries to be optimized to achieve maximum performance from the cluster, or to minimize the load on the servers and the network.</li><li>The components can be written in a range of well-known languages.</li></ul> | <ul><li>It's more difficult than using Pig or Hive because you must create your own map and reduce components.</li><li>Processes that require joining sets of data are more difficult to implement.</li><li>Even though there are test frameworks available, debugging code is more complex than a normal application because the code runs as a batch job under the control of the Hadoop job scheduler.</li></ul> |
19
-
|**Apache HCatalog**| <ul><li>It abstracts the path details of storage, making administration easier and removing the need for users to know where the data is stored.</li><li>It enables notification of events such as data availability, allowing other tools such as Oozie to detect when operations have occurred.</li><li>It exposes a relational view of data, including partitioning by key, and makes the data easy to access.</li></ul> | <ul><li>It supports RCFile, CSV text, JSON text, SequenceFile, and ORC file formats by default, but you may need to write a custom SerDe for other formats.</li><li>HCatalog isn't thread-safe.</li><li>There are some restrictions on the data types for columns when using the HCatalog loader in Pig scripts. For more information, see [HCatLoader Data Types](https://cwiki.apache.org/confluence/display/Hive/HCatalog%20LoadStore#HCatalogLoadStore-HCatLoaderDataTypes) in the Apache HCatalog documentation.</li></ul> |
18
+
|**Custom map/reduce**| <ul><li>It provides full control over the map and reduces phases, and execution.</li><li>It allows queries to be optimized to achieve maximum performance from the cluster, or to minimize the load on the servers and the network.</li><li>The components can be written in a range of well-known languages.</li></ul> | <ul><li>It's more difficult than using Pig or Hive because you must create your own map and reduce components.</li><li>Processes that require joining sets of data are more difficult to implement.</li><li>Even though there are test frameworks available, debugging code is more complex than a normal application because the code runs as a batch job under the control of the Hadoop job scheduler.</li></ul> |
19
+
|`Apache HCatalog`| <ul><li>It abstracts the path details of storage, making administration easier and removing the need for users to know where the data is stored.</li><li>It enables notification of events such as data availability, allowing other tools such as Oozie to detect when operations have occurred.</li><li>It exposes a relational view of data, including partitioning by key, and makes the data easy to access.</li></ul> | <ul><li>It supports RCFile, CSV text, JSON text, SequenceFile, and ORC file formats by default, but you may need to write a custom SerDe for other formats.</li><li>`HCatalog` isn't thread-safe.</li><li>There are some restrictions on the data types for columns when using the `HCatalog` loader in Pig scripts. For more information, see [HCatLoader Data Types](https://cwiki.apache.org/confluence/display/Hive/HCatalog%20LoadStore#HCatalogLoadStore-HCatLoaderDataTypes) in the Apache `HCatalog` documentation.</li></ul> |
20
20
21
21
Typically, you use the simplest of these approaches that can provide the results you require. For example, you may be able to achieve such results by using just Hive, but for more complex scenarios you may need to use Pig, or even write your own map and reduce components. You may also decide, after experimenting with Hive or Pig, that custom map and reduce components can provide better performance by allowing you to fine-tune and optimize the processing.
0 commit comments