Skip to content

Commit d7f3649

Browse files
authored
Merge pull request #96042 from dagiro/freshness51
freshness51
2 parents 0feb684 + 6a0d7d4 commit d7f3649

File tree

4 files changed

+19
-18
lines changed

4 files changed

+19
-18
lines changed

articles/hdinsight/hdinsight-hadoop-optimize-hive-query.md

Lines changed: 19 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.custom: hdinsightactive
99
ms.topic: conceptual
10-
ms.date: 03/21/2019
10+
ms.date: 11/14/2019
1111
---
1212

1313
# Optimize Apache Hive queries in Azure HDInsight
@@ -16,15 +16,15 @@ In Azure HDInsight, there are several cluster types and technologies that can ru
1616

1717
For example, choose **Interactive Query** cluster type to optimize for ad hoc, interactive queries. Choose Apache **Hadoop** cluster type to optimize for Hive queries used as a batch process. **Spark** and **HBase** cluster types can also run Hive queries. For more information on running Hive queries on various HDInsight cluster types, see [What is Apache Hive and HiveQL on Azure HDInsight?](hadoop/hdinsight-use-hive.md).
1818

19-
HDInsight clusters of Hadoop cluster type are not optimized for performance by default. This article describes some of the most common Hive performance optimization methods that you can apply to your queries.
19+
HDInsight clusters of Hadoop cluster type aren't optimized for performance by default. This article describes some of the most common Hive performance optimization methods that you can apply to your queries.
2020

2121
## Scale out worker nodes
2222

2323
Increasing the number of worker nodes in an HDInsight cluster allows the work to leverage more mappers and reducers to be run in parallel. There are two ways you can increase scale out in HDInsight:
2424

2525
* At the time when you create a cluster, you can specify the number of worker nodes using the Azure portal, Azure PowerShell, or command-line interface. For more information, see [Create HDInsight clusters](hdinsight-hadoop-provision-linux-clusters.md). The following screenshot shows the worker node configuration on the Azure portal:
2626

27-
![Azure portal cluster size nodes](./media/hdinsight-hadoop-optimize-hive-query/hdinsight-scaleout-1.png "scaleout_1")
27+
![Azure portal cluster size nodes](./media/hdinsight-hadoop-optimize-hive-query/azure-portal-cluster-configuration-pricing-hadoop.png "scaleout_1")
2828

2929
* After creation, you can also edit the number of worker nodes to scale out a cluster further without recreating one:
3030

@@ -40,8 +40,8 @@ For more information about scaling HDInsight, see [Scale HDInsight clusters](hdi
4040

4141
Tez is faster because:
4242

43-
* **Execute Directed Acyclic Graph (DAG) as a single job in the MapReduce engine**. The DAG requires each set of mappers to be followed by one set of reducers. This causes multiple MapReduce jobs to be spun off for each Hive query. Tez does not have such constraint and can process complex DAG as one job thus minimizing job startup overhead.
44-
* **Avoids unnecessary writes**. Multiple jobs are used to process the same Hive query in the MapReduce engine. The output of each MapReduce job is written to HDFS for intermediate data. Since Tez minimizes number of jobs for each Hive query, it is able to avoid unnecessary writes.
43+
* **Execute Directed Acyclic Graph (DAG) as a single job in the MapReduce engine**. The DAG requires each set of mappers to be followed by one set of reducers. This causes multiple MapReduce jobs to be spun off for each Hive query. Tez doesn't have such constraint and can process complex DAG as one job thus minimizing job startup overhead.
44+
* **Avoids unnecessary writes**. Multiple jobs are used to process the same Hive query in the MapReduce engine. The output of each MapReduce job is written to HDFS for intermediate data. Since Tez minimizes number of jobs for each Hive query, it's able to avoid unnecessary writes.
4545
* **Minimizes start-up delays**. Tez is better able to minimize start-up delay by reducing the number of mappers it needs to start and also improving optimization throughout.
4646
* **Reuses containers**. Whenever possible Tez is able to reuse containers to ensure that latency due to starting up containers is reduced.
4747
* **Continuous optimization techniques**. Traditionally optimization was done during compilation phase. However more information about the inputs is available that allow for better optimization during runtime. Tez uses continuous optimization techniques that allow it to optimize the plan further into the runtime phase.
@@ -75,7 +75,7 @@ CREATE TABLE lineitem_part
7575
(L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT,L_LINENUMBER INT,
7676
L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE,
7777
L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING,
78-
L_SHIPDATE_PS STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING,
78+
L_SHIPDATE_PS STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING,
7979
L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING)
8080
PARTITIONED BY(L_SHIPDATE STRING)
8181
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
@@ -88,34 +88,35 @@ Once the partitioned table is created, you can either create static partitioning
8888

8989
```sql
9090
INSERT OVERWRITE TABLE lineitem_part
91-
PARTITION (L_SHIPDATE = 5/23/1996 12:00:00 AM)
92-
SELECT * FROM lineitem
93-
WHERE lineitem.L_SHIPDATE = 5/23/1996 12:00:00 AM
91+
PARTITION (L_SHIPDATE = '5/23/1996 12:00:00 AM')
92+
SELECT * FROM lineitem
93+
WHERE lineitem.L_SHIPDATE = '5/23/1996 12:00:00 AM'
9494

95-
ALTER TABLE lineitem_part ADD PARTITION (L_SHIPDATE = 5/23/1996 12:00:00 AM’))
96-
LOCATION wasb://sampledata@ignitedemo.blob.core.windows.net/partitions/5_23_1996/'
95+
ALTER TABLE lineitem_part ADD PARTITION (L_SHIPDATE = '5/23/1996 12:00:00 AM')
96+
LOCATION 'wasb://[email protected]/partitions/5_23_1996/'
9797
```
9898

99-
* **Dynamic partitioning** means that you want Hive to create partitions automatically for you. Since you have already created the partitioning table from the staging table, all you need to do is insert data to the partitioned table:
99+
* **Dynamic partitioning** means that you want Hive to create partitions automatically for you. Since you've already created the partitioning table from the staging table, all you need to do is insert data to the partitioned table:
100100

101101
```hive
102102
SET hive.exec.dynamic.partition = true;
103103
SET hive.exec.dynamic.partition.mode = nonstrict;
104104
INSERT INTO TABLE lineitem_part
105105
PARTITION (L_SHIPDATE)
106-
SELECT L_ORDERKEY as L_ORDERKEY, L_PARTKEY as L_PARTKEY ,
106+
SELECT L_ORDERKEY as L_ORDERKEY, L_PARTKEY as L_PARTKEY,
107107
L_SUPPKEY as L_SUPPKEY, L_LINENUMBER as L_LINENUMBER,
108108
L_QUANTITY as L_QUANTITY, L_EXTENDEDPRICE as L_EXTENDEDPRICE,
109109
L_DISCOUNT as L_DISCOUNT, L_TAX as L_TAX, L_RETURNFLAG as L_RETURNFLAG,
110110
L_LINESTATUS as L_LINESTATUS, L_SHIPDATE as L_SHIPDATE_PS,
111111
L_COMMITDATE as L_COMMITDATE, L_RECEIPTDATE as L_RECEIPTDATE,
112-
L_SHIPINSTRUCT as L_SHIPINSTRUCT, L_SHIPMODE as L_SHIPMODE,
112+
L_SHIPINSTRUCT as L_SHIPINSTRUCT, L_SHIPMODE as L_SHIPMODE,
113113
L_COMMENT as L_COMMENT, L_SHIPDATE as L_SHIPDATE FROM lineitem;
114114
```
115115

116116
For more information, see [Partitioned Tables](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PartitionedTables).
117117

118118
## Use the ORCFile format
119+
119120
Hive supports different file formats. For example:
120121

121122
* **Text**: the default file format and works with most scenarios.
@@ -146,19 +147,19 @@ Next, you insert data to the ORC table from the staging table. For example:
146147

147148
```sql
148149
INSERT INTO TABLE lineitem_orc
149-
SELECT L_ORDERKEY as L_ORDERKEY,
150-
L_PARTKEY as L_PARTKEY ,
150+
SELECT L_ORDERKEY as L_ORDERKEY,
151+
L_PARTKEY as L_PARTKEY ,
151152
L_SUPPKEY as L_SUPPKEY,
152153
L_LINENUMBER as L_LINENUMBER,
153-
L_QUANTITY as L_QUANTITY,
154+
L_QUANTITY as L_QUANTITY,
154155
L_EXTENDEDPRICE as L_EXTENDEDPRICE,
155156
L_DISCOUNT as L_DISCOUNT,
156157
L_TAX as L_TAX,
157158
L_RETURNFLAG as L_RETURNFLAG,
158159
L_LINESTATUS as L_LINESTATUS,
159160
L_SHIPDATE as L_SHIPDATE,
160161
L_COMMITDATE as L_COMMITDATE,
161-
L_RECEIPTDATE as L_RECEIPTDATE,
162+
L_RECEIPTDATE as L_RECEIPTDATE,
162163
L_SHIPINSTRUCT as L_SHIPINSTRUCT,
163164
L_SHIPMODE as L_SHIPMODE,
164165
L_COMMENT as L_COMMENT
Loading
32.5 KB
Loading

0 commit comments

Comments
 (0)