Skip to content

Commit 4407a52

Browse files
authored
Merge pull request #106099 from dagiro/freshness8
freshness8
2 parents cd0b288 + 4652eec commit 4407a52

File tree

1 file changed

+26
-31
lines changed

1 file changed

+26
-31
lines changed

articles/hdinsight/hadoop/hdinsight-use-hive.md

Lines changed: 26 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,13 @@
11
---
22
title: What is Apache Hive and HiveQL - Azure HDInsight
33
description: Apache Hive is a data warehouse system for Apache Hadoop. You can query data stored in Hive using HiveQL, which similar to Transact-SQL. In this document, learn how to use Hive and HiveQL with Azure HDInsight.
4-
keywords: hiveql,what is hive,hadoop hiveql,how to use hive,learn hive,what is hive
54
author: hrasheed-msft
65
ms.author: hrasheed
76
ms.reviewer: jasonh
87
ms.service: hdinsight
9-
ms.custom: hdinsightactive,hdiseo17may2017
108
ms.topic: conceptual
11-
ms.date: 10/04/2019
9+
ms.custom: hdinsightactive,hdiseo17may2017
10+
ms.date: 02/28/2020
1211
---
1312

1413
# What is Apache Hive and HiveQL on Azure HDInsight?
@@ -19,13 +18,12 @@ Hive allows you to project structure on largely structured data. After you defin
1918

2019
HDInsight provides several cluster types, which are tuned for specific workloads. The following cluster types are most often used for Hive queries:
2120

22-
* __Interactive Query__: A Hadoop cluster that provides [Low Latency Analytical Processing (LLAP)](https://cwiki.apache.org/confluence/display/Hive/LLAP) functionality to improve response times for interactive queries. For more information, see the [Start with Interactive Query in HDInsight](../interactive-query/apache-interactive-query-get-started.md) document.
23-
24-
* __Hadoop__: A Hadoop cluster that is tuned for batch processing workloads. For more information, see the [Start with Apache Hadoop in HDInsight](../hadoop/apache-hadoop-linux-tutorial-get-started.md) document.
25-
26-
* __Spark__: Apache Spark has built-in functionality for working with Hive. For more information, see the [Start with Apache Spark on HDInsight](../spark/apache-spark-jupyter-spark-sql.md) document.
27-
28-
* __HBase__: HiveQL can be used to query data stored in Apache HBase. For more information, see the [Start with Apache HBase on HDInsight](../hbase/apache-hbase-tutorial-get-started-linux.md) document.
21+
|Cluster type |Description|
22+
|---|---|
23+
|Interactive Query|A Hadoop cluster that provides [Low Latency Analytical Processing (LLAP)](https://cwiki.apache.org/confluence/display/Hive/LLAP) functionality to improve response times for interactive queries. For more information, see the [Start with Interactive Query in HDInsight](../interactive-query/apache-interactive-query-get-started.md) document.|
24+
|Hadoop|A Hadoop cluster that is tuned for batch processing workloads. For more information, see the [Start with Apache Hadoop in HDInsight](../hadoop/apache-hadoop-linux-tutorial-get-started.md) document.|
25+
|Spark|Apache Spark has built-in functionality for working with Hive. For more information, see the [Start with Apache Spark on HDInsight](../spark/apache-spark-jupyter-spark-sql.md) document.|
26+
|HBase|HiveQL can be used to query data stored in Apache HBase. For more information, see the [Start with Apache HBase on HDInsight](../hbase/apache-hbase-tutorial-get-started-linux.md) document.|
2927

3028
## How to use Hive
3129

@@ -80,10 +78,10 @@ There are two types of tables that you can create with Hive:
8078

8179
Use external tables when one of the following conditions apply:
8280

83-
* The data is also used outside of Hive. For example, the data files are updated by another process (that does not lock the files.)
81+
* The data is also used outside of Hive. For example, the data files are updated by another process (that doesn't lock the files.)
8482
* Data needs to remain in the underlying location, even after dropping the table.
8583
* You need a custom location, such as a non-default storage account.
86-
* A program other than hive manages the data format, location, etc.
84+
* A program other than hive manages the data format, location, and so on.
8785

8886
For more information, see the [Hive Internal and External Tables Intro](https://blogs.msdn.microsoft.com/cindygross/2013/02/05/hdinsight-hive-internal-and-external-tables-intro/) blog post.
8987

@@ -101,11 +99,11 @@ Hive can also be extended through **user-defined functions (UDF)**. A UDF allows
10199

102100
* [An example Apache Hive user-defined function to convert date/time formats to Hive timestamp](https://github.com/Azure-Samples/hdinsight-java-hive-udf)
103101

104-
## <a id="data"></a>Example data
102+
## Example data
105103

106104
Hive on HDInsight comes pre-loaded with an internal table named `hivesampletable`. HDInsight also provides example data sets that can be used with Hive. These data sets are stored in the `/example/data` and `/HdiSamples` directories. These directories exist in the default storage for your cluster.
107105

108-
## <a id="job"></a>Example Hive query
106+
## Example Hive query
109107

110108
The following HiveQL statements project columns onto the `/example/data/sample.log` file:
111109

@@ -128,17 +126,14 @@ SELECT t4 AS sev, COUNT(*) AS count FROM log4jLogs
128126

129127
In the previous example, the HiveQL statements perform the following actions:
130128

131-
* `DROP TABLE`: If the table already exists, delete it.
132-
133-
* `CREATE EXTERNAL TABLE`: Creates a new **external** table in Hive. External tables only store the table definition in Hive. The data is left in the original location and in the original format.
134-
135-
* `ROW FORMAT`: Tells Hive how the data is formatted. In this case, the fields in each log are separated by a space.
136-
137-
* `STORED AS TEXTFILE LOCATION`: Tells Hive where the data is stored (the `example/data` directory) and that it's stored as text. The data can be in one file or spread across multiple files within the directory.
138-
139-
* `SELECT`: Selects a count of all rows where the column **t4** contains the value **[ERROR]**. This statement returns a value of **3** because there are three rows that contain this value.
140-
141-
* `INPUT__FILE__NAME LIKE '%.log'` - Hive attempts to apply the schema to all files in the directory. In this case, the directory contains files that don't match the schema. To prevent garbage data in the results, this statement tells Hive that we should only return data from files ending in .log.
129+
|Statement |Description |
130+
|---|---|
131+
|DROP TABLE|If the table already exists, delete it.|
132+
|CREATE EXTERNAL TABLE|Creates a new **external** table in Hive. External tables only store the table definition in Hive. The data is left in the original location and in the original format.|
133+
|ROW FORMAT|Tells Hive how the data is formatted. In this case, the fields in each log are separated by a space.|
134+
|STORED AS TEXTFILE LOCATION|Tells Hive where the data is stored (the `example/data` directory) and that it's stored as text. The data can be in one file or spread across multiple files within the directory.|
135+
|SELECT|Selects a count of all rows where the column **t4** contains the value **[ERROR]**. This statement returns a value of **3** because there are three rows that contain this value.|
136+
|INPUT__FILE__NAME LIKE '%.log'|Hive attempts to apply the schema to all files in the directory. In this case, the directory contains files that don't match the schema. To prevent garbage data in the results, this statement tells Hive that we should only return data from files ending in .log.|
142137

143138
> [!NOTE]
144139
> External tables should be used when you expect the underlying data to be updated by an external source. For example, an automated data upload process, or MapReduce operation.
@@ -164,18 +159,18 @@ SELECT t1, t2, t3, t4, t5, t6, t7
164159

165160
These statements perform the following actions:
166161

167-
* `CREATE TABLE IF NOT EXISTS`: If the table does not exist, create it. Because the **EXTERNAL** keyword isn't used, this statement creates an internal table. The table is stored in the Hive data warehouse and is managed completely by Hive.
168-
169-
* `STORED AS ORC`: Stores the data in Optimized Row Columnar (ORC) format. ORC is a highly optimized and efficient format for storing Hive data.
170-
171-
* `INSERT OVERWRITE ... SELECT`: Selects rows from the **log4jLogs** table that contains **[ERROR]**, and then inserts the data into the **errorLogs** table.
162+
|Statement |Description |
163+
|---|---|
164+
|CREATE TABLE IF NOT EXISTS|If the table doesn't exist, create it. Because the **EXTERNAL** keyword isn't used, this statement creates an internal table. The table is stored in the Hive data warehouse and is managed completely by Hive.|
165+
|STORED AS ORC|Stores the data in Optimized Row Columnar (ORC) format. ORC is a highly optimized and efficient format for storing Hive data.|
166+
|INSERT OVERWRITE ... SELECT|Selects rows from the **log4jLogs** table that contains **[ERROR]**, and then inserts the data into the **errorLogs** table.|
172167

173168
> [!NOTE]
174169
> Unlike external tables, dropping an internal table also deletes the underlying data.
175170
176171
## Improve Hive query performance
177172

178-
### <a id="usetez"></a>Apache Tez
173+
### Apache Tez
179174

180175
[Apache Tez](https://tez.apache.org) is a framework that allows data intensive applications, such as Hive, to run much more efficiently at scale. Tez is enabled by default. The [Apache Hive on Tez design documents](https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez) contains details about the implementation choices and tuning configurations.
181176

0 commit comments

Comments
 (0)