You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/hadoop/hdinsight-use-hive.md
+26-31Lines changed: 26 additions & 31 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,14 +1,13 @@
1
1
---
2
2
title: What is Apache Hive and HiveQL - Azure HDInsight
3
3
description: Apache Hive is a data warehouse system for Apache Hadoop. You can query data stored in Hive using HiveQL, which similar to Transact-SQL. In this document, learn how to use Hive and HiveQL with Azure HDInsight.
4
-
keywords: hiveql,what is hive,hadoop hiveql,how to use hive,learn hive,what is hive
5
4
author: hrasheed-msft
6
5
ms.author: hrasheed
7
6
ms.reviewer: jasonh
8
7
ms.service: hdinsight
9
-
ms.custom: hdinsightactive,hdiseo17may2017
10
8
ms.topic: conceptual
11
-
ms.date: 10/04/2019
9
+
ms.custom: hdinsightactive,hdiseo17may2017
10
+
ms.date: 02/28/2020
12
11
---
13
12
14
13
# What is Apache Hive and HiveQL on Azure HDInsight?
@@ -19,13 +18,12 @@ Hive allows you to project structure on largely structured data. After you defin
19
18
20
19
HDInsight provides several cluster types, which are tuned for specific workloads. The following cluster types are most often used for Hive queries:
21
20
22
-
*__Interactive Query__: A Hadoop cluster that provides [Low Latency Analytical Processing (LLAP)](https://cwiki.apache.org/confluence/display/Hive/LLAP) functionality to improve response times for interactive queries. For more information, see the [Start with Interactive Query in HDInsight](../interactive-query/apache-interactive-query-get-started.md) document.
23
-
24
-
*__Hadoop__: A Hadoop cluster that is tuned for batch processing workloads. For more information, see the [Start with Apache Hadoop in HDInsight](../hadoop/apache-hadoop-linux-tutorial-get-started.md) document.
25
-
26
-
*__Spark__: Apache Spark has built-in functionality for working with Hive. For more information, see the [Start with Apache Spark on HDInsight](../spark/apache-spark-jupyter-spark-sql.md) document.
27
-
28
-
*__HBase__: HiveQL can be used to query data stored in Apache HBase. For more information, see the [Start with Apache HBase on HDInsight](../hbase/apache-hbase-tutorial-get-started-linux.md) document.
21
+
|Cluster type |Description|
22
+
|---|---|
23
+
|Interactive Query|A Hadoop cluster that provides [Low Latency Analytical Processing (LLAP)](https://cwiki.apache.org/confluence/display/Hive/LLAP) functionality to improve response times for interactive queries. For more information, see the [Start with Interactive Query in HDInsight](../interactive-query/apache-interactive-query-get-started.md) document.|
24
+
|Hadoop|A Hadoop cluster that is tuned for batch processing workloads. For more information, see the [Start with Apache Hadoop in HDInsight](../hadoop/apache-hadoop-linux-tutorial-get-started.md) document.|
25
+
|Spark|Apache Spark has built-in functionality for working with Hive. For more information, see the [Start with Apache Spark on HDInsight](../spark/apache-spark-jupyter-spark-sql.md) document.|
26
+
|HBase|HiveQL can be used to query data stored in Apache HBase. For more information, see the [Start with Apache HBase on HDInsight](../hbase/apache-hbase-tutorial-get-started-linux.md) document.|
29
27
30
28
## How to use Hive
31
29
@@ -80,10 +78,10 @@ There are two types of tables that you can create with Hive:
80
78
81
79
Use external tables when one of the following conditions apply:
82
80
83
-
* The data is also used outside of Hive. For example, the data files are updated by another process (that does not lock the files.)
81
+
* The data is also used outside of Hive. For example, the data files are updated by another process (that doesn't lock the files.)
84
82
* Data needs to remain in the underlying location, even after dropping the table.
85
83
* You need a custom location, such as a non-default storage account.
86
-
* A program other than hive manages the data format, location, etc.
84
+
* A program other than hive manages the data format, location, and so on.
87
85
88
86
For more information, see the [Hive Internal and External Tables Intro](https://blogs.msdn.microsoft.com/cindygross/2013/02/05/hdinsight-hive-internal-and-external-tables-intro/) blog post.
89
87
@@ -101,11 +99,11 @@ Hive can also be extended through **user-defined functions (UDF)**. A UDF allows
101
99
102
100
*[An example Apache Hive user-defined function to convert date/time formats to Hive timestamp](https://github.com/Azure-Samples/hdinsight-java-hive-udf)
103
101
104
-
## <aid="data"></a>Example data
102
+
## Example data
105
103
106
104
Hive on HDInsight comes pre-loaded with an internal table named `hivesampletable`. HDInsight also provides example data sets that can be used with Hive. These data sets are stored in the `/example/data` and `/HdiSamples` directories. These directories exist in the default storage for your cluster.
107
105
108
-
## <aid="job"></a>Example Hive query
106
+
## Example Hive query
109
107
110
108
The following HiveQL statements project columns onto the `/example/data/sample.log` file:
111
109
@@ -128,17 +126,14 @@ SELECT t4 AS sev, COUNT(*) AS count FROM log4jLogs
128
126
129
127
In the previous example, the HiveQL statements perform the following actions:
130
128
131
-
*`DROP TABLE`: If the table already exists, delete it.
132
-
133
-
*`CREATE EXTERNAL TABLE`: Creates a new **external** table in Hive. External tables only store the table definition in Hive. The data is left in the original location and in the original format.
134
-
135
-
*`ROW FORMAT`: Tells Hive how the data is formatted. In this case, the fields in each log are separated by a space.
136
-
137
-
*`STORED AS TEXTFILE LOCATION`: Tells Hive where the data is stored (the `example/data` directory) and that it's stored as text. The data can be in one file or spread across multiple files within the directory.
138
-
139
-
*`SELECT`: Selects a count of all rows where the column **t4** contains the value **[ERROR]**. This statement returns a value of **3** because there are three rows that contain this value.
140
-
141
-
*`INPUT__FILE__NAME LIKE '%.log'` - Hive attempts to apply the schema to all files in the directory. In this case, the directory contains files that don't match the schema. To prevent garbage data in the results, this statement tells Hive that we should only return data from files ending in .log.
129
+
|Statement |Description |
130
+
|---|---|
131
+
|DROP TABLE|If the table already exists, delete it.|
132
+
|CREATE EXTERNAL TABLE|Creates a new **external** table in Hive. External tables only store the table definition in Hive. The data is left in the original location and in the original format.|
133
+
|ROW FORMAT|Tells Hive how the data is formatted. In this case, the fields in each log are separated by a space.|
134
+
|STORED AS TEXTFILE LOCATION|Tells Hive where the data is stored (the `example/data` directory) and that it's stored as text. The data can be in one file or spread across multiple files within the directory.|
135
+
|SELECT|Selects a count of all rows where the column **t4** contains the value **[ERROR]**. This statement returns a value of **3** because there are three rows that contain this value.|
136
+
|INPUT__FILE__NAME LIKE '%.log'|Hive attempts to apply the schema to all files in the directory. In this case, the directory contains files that don't match the schema. To prevent garbage data in the results, this statement tells Hive that we should only return data from files ending in .log.|
142
137
143
138
> [!NOTE]
144
139
> External tables should be used when you expect the underlying data to be updated by an external source. For example, an automated data upload process, or MapReduce operation.
*`CREATE TABLE IF NOT EXISTS`: If the table does not exist, create it. Because the **EXTERNAL** keyword isn't used, this statement creates an internal table. The table is stored in the Hive data warehouse and is managed completely by Hive.
168
-
169
-
*`STORED AS ORC`: Stores the data in Optimized Row Columnar (ORC) format. ORC is a highly optimized and efficient format for storing Hive data.
170
-
171
-
*`INSERT OVERWRITE ... SELECT`: Selects rows from the **log4jLogs** table that contains **[ERROR]**, and then inserts the data into the **errorLogs** table.
162
+
|Statement |Description |
163
+
|---|---|
164
+
|CREATE TABLE IF NOT EXISTS|If the table doesn't exist, create it. Because the **EXTERNAL** keyword isn't used, this statement creates an internal table. The table is stored in the Hive data warehouse and is managed completely by Hive.|
165
+
|STORED AS ORC|Stores the data in Optimized Row Columnar (ORC) format. ORC is a highly optimized and efficient format for storing Hive data.|
166
+
|INSERT OVERWRITE ... SELECT|Selects rows from the **log4jLogs** table that contains **[ERROR]**, and then inserts the data into the **errorLogs** table.|
172
167
173
168
> [!NOTE]
174
169
> Unlike external tables, dropping an internal table also deletes the underlying data.
175
170
176
171
## Improve Hive query performance
177
172
178
-
### <aid="usetez"></a>Apache Tez
173
+
### Apache Tez
179
174
180
175
[Apache Tez](https://tez.apache.org) is a framework that allows data intensive applications, such as Hive, to run much more efficiently at scale. Tez is enabled by default. The [Apache Hive on Tez design documents](https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez) contains details about the implementation choices and tuning configurations.
0 commit comments