You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/hdinsight-hadoop-optimize-hive-query.md
+19-18Lines changed: 19 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,7 +7,7 @@ ms.reviewer: jasonh
7
7
ms.service: hdinsight
8
8
ms.custom: hdinsightactive
9
9
ms.topic: conceptual
10
-
ms.date: 03/21/2019
10
+
ms.date: 11/14/2019
11
11
---
12
12
13
13
# Optimize Apache Hive queries in Azure HDInsight
@@ -16,15 +16,15 @@ In Azure HDInsight, there are several cluster types and technologies that can ru
16
16
17
17
For example, choose **Interactive Query** cluster type to optimize for ad hoc, interactive queries. Choose Apache **Hadoop** cluster type to optimize for Hive queries used as a batch process. **Spark** and **HBase** cluster types can also run Hive queries. For more information on running Hive queries on various HDInsight cluster types, see [What is Apache Hive and HiveQL on Azure HDInsight?](hadoop/hdinsight-use-hive.md).
18
18
19
-
HDInsight clusters of Hadoop cluster type are not optimized for performance by default. This article describes some of the most common Hive performance optimization methods that you can apply to your queries.
19
+
HDInsight clusters of Hadoop cluster type aren't optimized for performance by default. This article describes some of the most common Hive performance optimization methods that you can apply to your queries.
20
20
21
21
## Scale out worker nodes
22
22
23
23
Increasing the number of worker nodes in an HDInsight cluster allows the work to leverage more mappers and reducers to be run in parallel. There are two ways you can increase scale out in HDInsight:
24
24
25
25
* At the time when you create a cluster, you can specify the number of worker nodes using the Azure portal, Azure PowerShell, or command-line interface. For more information, see [Create HDInsight clusters](hdinsight-hadoop-provision-linux-clusters.md). The following screenshot shows the worker node configuration on the Azure portal:
* After creation, you can also edit the number of worker nodes to scale out a cluster further without recreating one:
30
30
@@ -40,8 +40,8 @@ For more information about scaling HDInsight, see [Scale HDInsight clusters](hdi
40
40
41
41
Tez is faster because:
42
42
43
-
***Execute Directed Acyclic Graph (DAG) as a single job in the MapReduce engine**. The DAG requires each set of mappers to be followed by one set of reducers. This causes multiple MapReduce jobs to be spun off for each Hive query. Tez does not have such constraint and can process complex DAG as one job thus minimizing job startup overhead.
44
-
***Avoids unnecessary writes**. Multiple jobs are used to process the same Hive query in the MapReduce engine. The output of each MapReduce job is written to HDFS for intermediate data. Since Tez minimizes number of jobs for each Hive query, it is able to avoid unnecessary writes.
43
+
***Execute Directed Acyclic Graph (DAG) as a single job in the MapReduce engine**. The DAG requires each set of mappers to be followed by one set of reducers. This causes multiple MapReduce jobs to be spun off for each Hive query. Tez doesn't have such constraint and can process complex DAG as one job thus minimizing job startup overhead.
44
+
***Avoids unnecessary writes**. Multiple jobs are used to process the same Hive query in the MapReduce engine. The output of each MapReduce job is written to HDFS for intermediate data. Since Tez minimizes number of jobs for each Hive query, it's able to avoid unnecessary writes.
45
45
***Minimizes start-up delays**. Tez is better able to minimize start-up delay by reducing the number of mappers it needs to start and also improving optimization throughout.
46
46
***Reuses containers**. Whenever possible Tez is able to reuse containers to ensure that latency due to starting up containers is reduced.
47
47
***Continuous optimization techniques**. Traditionally optimization was done during compilation phase. However more information about the inputs is available that allow for better optimization during runtime. Tez uses continuous optimization techniques that allow it to optimize the plan further into the runtime phase.
* **Dynamic partitioning** means that you want Hive to create partitions automatically for you. Since you have already created the partitioning table from the staging table, all you need to do is insert data to the partitioned table:
99
+
***Dynamic partitioning** means that you want Hive to create partitions automatically for you. Since you've already created the partitioning table from the staging table, all you need to do is insert data to the partitioned table:
100
100
101
101
```hive
102
102
SET hive.exec.dynamic.partition = true;
103
103
SET hive.exec.dynamic.partition.mode = nonstrict;
104
104
INSERT INTO TABLE lineitem_part
105
105
PARTITION (L_SHIPDATE)
106
-
SELECT L_ORDERKEY as L_ORDERKEY, L_PARTKEY as L_PARTKEY ,
106
+
SELECT L_ORDERKEY as L_ORDERKEY, L_PARTKEY as L_PARTKEY,
107
107
L_SUPPKEY as L_SUPPKEY, L_LINENUMBER as L_LINENUMBER,
108
108
L_QUANTITY as L_QUANTITY, L_EXTENDEDPRICE as L_EXTENDEDPRICE,
109
109
L_DISCOUNT as L_DISCOUNT, L_TAX as L_TAX, L_RETURNFLAG as L_RETURNFLAG,
110
110
L_LINESTATUS as L_LINESTATUS, L_SHIPDATE as L_SHIPDATE_PS,
111
111
L_COMMITDATE as L_COMMITDATE, L_RECEIPTDATE as L_RECEIPTDATE,
112
-
L_SHIPINSTRUCT as L_SHIPINSTRUCT, L_SHIPMODE as L_SHIPMODE,
112
+
L_SHIPINSTRUCT as L_SHIPINSTRUCT, L_SHIPMODE as L_SHIPMODE,
113
113
L_COMMENT as L_COMMENT, L_SHIPDATE as L_SHIPDATE FROM lineitem;
114
114
```
115
115
116
116
For more information, see [Partitioned Tables](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables).
117
117
118
118
## Use the ORCFile format
119
+
119
120
Hive supports different file formats. For example:
120
121
121
122
***Text**: the default file format and works with most scenarios.
@@ -146,19 +147,19 @@ Next, you insert data to the ORC table from the staging table. For example:
0 commit comments