Merge pull request #106099 from dagiro/freshness8

PMEds28 · web-flow · commit 4407a526bd00 · 2020-03-01T10:12:31.000Z
freshness8
diff --git a/articles/hdinsight/hadoop/hdinsight-use-hive.md b/articles/hdinsight/hadoop/hdinsight-use-hive.md
@@ -1,14 +1,13 @@
 ---
 title: What is Apache Hive and HiveQL - Azure HDInsight 
 description: Apache Hive is a data warehouse system for Apache Hadoop. You can query data stored in Hive using HiveQL, which similar to Transact-SQL. In this document, learn how to use Hive and HiveQL with Azure HDInsight.
-keywords: hiveql,what is hive,hadoop hiveql,how to use hive,learn hive,what is hive
 author: hrasheed-msft
 ms.author: hrasheed
 ms.reviewer: jasonh
 ms.service: hdinsight
-ms.custom: hdinsightactive,hdiseo17may2017
 ms.topic: conceptual
-ms.date: 10/04/2019
+ms.custom: hdinsightactive,hdiseo17may2017
+ms.date: 02/28/2020
 ---
 
 # What is Apache Hive and HiveQL on Azure HDInsight?
@@ -19,13 +18,12 @@ Hive allows you to project structure on largely structured data. After you defin
 
 HDInsight provides several cluster types, which are tuned for specific workloads. The following cluster types are most often used for Hive queries:
 
-* __Interactive Query__: A Hadoop cluster that provides [Low Latency Analytical Processing (LLAP)](https://cwiki.apache.org/confluence/display/Hive/LLAP) functionality to improve response times for interactive queries. For more information, see the [Start with Interactive Query in HDInsight](../interactive-query/apache-interactive-query-get-started.md) document.
-
-* __Hadoop__: A Hadoop cluster that is tuned for batch processing workloads. For more information, see the [Start with Apache Hadoop in HDInsight](../hadoop/apache-hadoop-linux-tutorial-get-started.md) document.
-
-* __Spark__: Apache Spark has built-in functionality for working with Hive. For more information, see the [Start with Apache Spark on HDInsight](../spark/apache-spark-jupyter-spark-sql.md) document.
-
-* __HBase__: HiveQL can be used to query data stored in Apache HBase. For more information, see the [Start with Apache HBase on HDInsight](../hbase/apache-hbase-tutorial-get-started-linux.md) document.
+|Cluster type |Description|
+|---|---|
+|Interactive Query|A Hadoop cluster that provides [Low Latency Analytical Processing (LLAP)](https://cwiki.apache.org/confluence/display/Hive/LLAP) functionality to improve response times for interactive queries. For more information, see the [Start with Interactive Query in HDInsight](../interactive-query/apache-interactive-query-get-started.md) document.|
+|Hadoop|A Hadoop cluster that is tuned for batch processing workloads. For more information, see the [Start with Apache Hadoop in HDInsight](../hadoop/apache-hadoop-linux-tutorial-get-started.md) document.|
+|Spark|Apache Spark has built-in functionality for working with Hive. For more information, see the [Start with Apache Spark on HDInsight](../spark/apache-spark-jupyter-spark-sql.md) document.|
+|HBase|HiveQL can be used to query data stored in Apache HBase. For more information, see the [Start with Apache HBase on HDInsight](../hbase/apache-hbase-tutorial-get-started-linux.md) document.|
 
 ## How to use Hive
 
@@ -80,10 +78,10 @@ There are two types of tables that you can create with Hive:
 
     Use external tables when one of the following conditions apply:
 
-    * The data is also used outside of Hive. For example, the data files are updated by another process (that does not lock the files.)
+    * The data is also used outside of Hive. For example, the data files are updated by another process (that doesn't lock the files.)
     * Data needs to remain in the underlying location, even after dropping the table.
     * You need a custom location, such as a non-default storage account.
-    * A program other than hive manages the data format, location, etc.
+    * A program other than hive manages the data format, location, and so on.
 
 For more information, see the [Hive Internal and External Tables Intro](https://blogs.msdn.microsoft.com/cindygross/2013/02/05/hdinsight-hive-internal-and-external-tables-intro/) blog post.
 
@@ -101,11 +99,11 @@ Hive can also be extended through **user-defined functions (UDF)**. A UDF allows
 
 * [An example Apache Hive user-defined function to convert date/time formats to Hive timestamp](https://github.com/Azure-Samples/hdinsight-java-hive-udf)
 
-## <a id="data"></a>Example data
+## Example data
 
 Hive on HDInsight comes pre-loaded with an internal table named `hivesampletable`. HDInsight also provides example data sets that can be used with Hive. These data sets are stored in the `/example/data` and `/HdiSamples` directories. These directories exist in the default storage for your cluster.
 
-## <a id="job"></a>Example Hive query
+## Example Hive query
 
 The following HiveQL statements project columns onto the `/example/data/sample.log` file:
 
@@ -128,17 +126,14 @@ SELECT t4 AS sev, COUNT(*) AS count FROM log4jLogs
 
 In the previous example, the HiveQL statements perform the following actions:
 
-* `DROP TABLE`: If the table already exists, delete it.
-
-* `CREATE EXTERNAL TABLE`: Creates a new **external** table in Hive. External tables only store the table definition in Hive. The data is left in the original location and in the original format.
-
-* `ROW FORMAT`: Tells Hive how the data is formatted. In this case, the fields in each log are separated by a space.
-
-* `STORED AS TEXTFILE LOCATION`: Tells Hive where the data is stored (the `example/data` directory) and that it's stored as text. The data can be in one file or spread across multiple files within the directory.
-
-* `SELECT`: Selects a count of all rows where the column **t4** contains the value **[ERROR]**. This statement returns a value of **3** because there are three rows that contain this value.
-
-* `INPUT__FILE__NAME LIKE '%.log'` - Hive attempts to apply the schema to all files in the directory. In this case, the directory contains files that don't match the schema. To prevent garbage data in the results, this statement tells Hive that we should only return data from files ending in .log.
+|Statement |Description |
+|---|---|
+|DROP TABLE|If the table already exists, delete it.|
+|CREATE EXTERNAL TABLE|Creates a new **external** table in Hive. External tables only store the table definition in Hive. The data is left in the original location and in the original format.|
+|ROW FORMAT|Tells Hive how the data is formatted. In this case, the fields in each log are separated by a space.|
+|STORED AS TEXTFILE LOCATION|Tells Hive where the data is stored (the `example/data` directory) and that it's stored as text. The data can be in one file or spread across multiple files within the directory.|
+|SELECT|Selects a count of all rows where the column **t4** contains the value **[ERROR]**. This statement returns a value of **3** because there are three rows that contain this value.|
+|INPUT__FILE__NAME LIKE '%.log'|Hive attempts to apply the schema to all files in the directory. In this case, the directory contains files that don't match the schema. To prevent garbage data in the results, this statement tells Hive that we should only return data from files ending in .log.|
 
 > [!NOTE]  
 > External tables should be used when you expect the underlying data to be updated by an external source. For example, an automated data upload process, or MapReduce operation.
@@ -164,18 +159,18 @@ SELECT t1, t2, t3, t4, t5, t6, t7
 
 These statements perform the following actions:
 
-* `CREATE TABLE IF NOT EXISTS`: If the table does not exist, create it. Because the **EXTERNAL** keyword isn't used, this statement creates an internal table. The table is stored in the Hive data warehouse and is managed completely by Hive.
-
-* `STORED AS ORC`: Stores the data in Optimized Row Columnar (ORC) format. ORC is a highly optimized and efficient format for storing Hive data.
-
-* `INSERT OVERWRITE ... SELECT`: Selects rows from the **log4jLogs** table that contains **[ERROR]**, and then inserts the data into the **errorLogs** table.
+|Statement |Description |
+|---|---|
+|CREATE TABLE IF NOT EXISTS|If the table doesn't exist, create it. Because the **EXTERNAL** keyword isn't used, this statement creates an internal table. The table is stored in the Hive data warehouse and is managed completely by Hive.|
+|STORED AS ORC|Stores the data in Optimized Row Columnar (ORC) format. ORC is a highly optimized and efficient format for storing Hive data.|
+|INSERT OVERWRITE ... SELECT|Selects rows from the **log4jLogs** table that contains **[ERROR]**, and then inserts the data into the **errorLogs** table.|
 
 > [!NOTE]  
 > Unlike external tables, dropping an internal table also deletes the underlying data.
 
 ## Improve Hive query performance
 
-### <a id="usetez"></a>Apache Tez
+### Apache Tez
 
 [Apache Tez](https://tez.apache.org) is a framework that allows data intensive applications, such as Hive, to run much more efficiently at scale. Tez is enabled by default.  The [Apache Hive on Tez design documents](https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez) contains details about the implementation choices and tuning configurations.