Merge pull request #113660 from nis-goel/nisgoelwiki

v-shmck · web-flow · commit 42b24a41e9a5 · 2020-05-22T16:00:45.000-07:00
Update Hive Warehouse Connector documentation
diff --git a/articles/hdinsight/TOC.yml b/articles/hdinsight/TOC.yml
@@ -762,8 +762,12 @@
         href: ./hadoop/apache-hadoop-hive-pig-udf-dotnet-csharp.md
       - name: Use Python with Apache Hive and Apache Pig
         href: ./hadoop/python-udf-hdinsight.md
-      - name: Apache Hive with Apache Spark
+      - name: HWC integration with Apache Spark and Apache Hive
         href: ./interactive-query/apache-hive-warehouse-connector.md
+      - name: HWC and Apache Spark operations
+        href: ./interactive-query/apache-hive-warehouse-connector-operations.md
+      - name: HWC integration with Apache Zeppelin
+        href: ./interactive-query/apache-hive-warehouse-connector-zeppelin.md
       - name: Apache Hive with Hadoop
         href: ./hadoop/hdinsight-use-hive.md
       - name: Use the Apache Hive View
diff --git a/articles/hdinsight/interactive-query/apache-hive-warehouse-connector-operations.md b/articles/hdinsight/interactive-query/apache-hive-warehouse-connector-operations.md
@@ -0,0 +1,142 @@
+---
+title: Apache Spark operations supported by Hive Warehouse Connector in Azure HDInsight
+description: Learn about the different capabilities of Hive Warehouse Connector on Azure HDInsight.
+author: nis-goel
+ms.author: nisgoel
+ms.reviewer: jasonh
+ms.service: hdinsight
+ms.topic: conceptual
+ms.date: 05/22/2020
+---
+
+# Apache Spark operations supported by Hive Warehouse Connector in Azure HDInsight
+
+This article shows spark-based operations supported by Hive Warehouse Connector (HWC). All examples shown below will be executed through the Apache Spark shell.
+
+## Prerequisite
+
+Complete the [Hive Warehouse Connector setup](./apache-hive-warehouse-connector.md#hive-warehouse-connector-setup) steps.
+
+## Getting started
+
+To start a spark-shell session, do the following steps:
+
+1. Use [ssh command](../hdinsight-hadoop-linux-use-ssh-unix.md) to connect to your Apache Spark cluster. Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:
+
+    ```cmd
+    ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.net
+    ```
+
+1. From your ssh session, execute the following command to note the `hive-warehouse-connector-assembly` version:
+
+    ```bash
+    ls /usr/hdp/current/hive_warehouse_connector
+    ```
+
+1. Edit the code below with the `hive-warehouse-connector-assembly` version identified above. Then execute the command to start the spark shell:
+
+    ```bash
+    spark-shell --master yarn \
+    --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-<STACK_VERSION>.jar \
+    --conf spark.security.credentials.hiveserver2.enabled=false
+    ```
+
+1. After starting the spark-shell, a Hive Warehouse Connector instance can be started using the following commands:
+
+    ```scala
+    import com.hortonworks.hwc.HiveWarehouseSession
+    val hive = HiveWarehouseSession.session(spark).build()
+    ```
+
+## Creating Spark DataFrames using Hive queries
+
+The results of all queries using the HWC library are returned as a DataFrame. The following examples demonstrate how to create a basic hive query.
+
+```scala
+hive.setDatabase("default")
+val df = hive.executeQuery("select * from hivesampletable")
+df.filter("state = 'Colorado'").show()
+```
+
+The results of the query are Spark DataFrames, which can be used with Spark libraries like MLIB and SparkSQL.
+
+## Writing out Spark DataFrames to Hive tables
+
+Spark doesn't natively support writing to Hive's managed ACID tables. However,using HWC, you can write out any DataFrame to a Hive table. You can see this functionality at work in the following example:
+
+1. Create a table called `sampletable_colorado` and specify its columns using the following command:
+
+    ```scala
+    hive.createTable("sampletable_colorado").column("clientid","string").column("querytime","string").column("market","string").column("deviceplatform","string").column("devicemake","string").column("devicemodel","string").column("state","string").column("country","string").column("querydwelltime","double").column("sessionid","bigint").column("sessionpagevieworder","bigint").create()
+    ```
+
+1. Filter the table `hivesampletable` where the column `state` equals `Colorado`. This hive query returns a Spark DataFrame ans sis saved in the Hive table `sampletable_colorado` using the `write` function.
+
+    ```scala
+    hive.table("hivesampletable").filter("state = 'Colorado'").write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").mode("append").option("table","sampletable_colorado").save()
+    ```
+
+1. View the results with the following command:
+
+    ```scala
+    hive.table("sampletable_colorado").show()
+    ```
+    
+    ![hive warehouse connector show hive table](./media/apache-hive-warehouse-connector/hive-warehouse-connector-show-hive-table.png)
+
+
+## Structured streaming writes
+
+Using Hive Warehouse Connector, you can use Spark streaming to write data into Hive tables.
+
+> [!IMPORTANT]
+> Structured streaming writes are not supported in ESP enabled Spark 4.0 clusters.
+
+Follow the steps below to ingest data from a Spark stream on localhost port 9999 into a Hive table via. Hive Warehouse Connector.
+
+1. From your open Spark shell, begin a spark stream with the following command:
+
+    ```scala
+    val lines = spark.readStream.format("socket").option("host", "localhost").option("port",9999).load()
+    ```
+
+1. Generate data for the Spark stream that you created, by doing the following steps:
+    1. Open a second SSH session on the same Spark cluster.
+    1. At the command prompt, type `nc -lk 9999`. This command uses the netcat utility to send data from the command line to the specified port.
+
+1. Return to the first SSH session and create a new Hive table to hold the streaming data. At the spark-shell, enter the following command:
+
+    ```scala
+    hive.createTable("stream_table").column("value","string").create()
+    ```
+
+1. Then write the streaming data to the newly created table using the following command:
+
+    ```scala
+    lines.filter("value = 'HiveSpark'").writeStream.format("com.hortonworks.spark.sql.hive.llap.streaming.HiveStreamingDataSource").option("database", "default").option("table","stream_table").option("metastoreUri",spark.conf.get("spark.datasource.hive.warehouse.metastoreUri")).option("checkpointLocation","/tmp/checkpoint1").start()
+    ```
+
+    >[!Important]
+    > The `metastoreUri` and `database` options must currently be set manually due to a known issue in Apache Spark. For more information about this issue, see [SPARK-25460](https://issues.apache.org/jira/browse/SPARK-25460).
+
+1. Return to the second SSH session and enter the following values:
+
+    ```bash
+    foo
+    HiveSpark
+    bar
+    ```
+
+1. Return to the first SSH session and note the brief activity. Use the following command to view the data:
+
+    ```scala
+    hive.table("stream_table").show()
+    ```
+
+Use **Ctrl + C** to stop netcat on the second SSH session. Use `:q` to exit spark-shell on the first SSH session.
+
+## Next steps
+
+* [HWC integration with Apache Spark and Apache Hive](./apache-hive-warehouse-connector.md)
+* [Use Interactive Query with HDInsight](./apache-interactive-query-get-started.md)
+* [HWC integration with Apache Zeppelin](./apache-hive-warehouse-connector-zeppelin.md)
diff --git a/articles/hdinsight/interactive-query/apache-hive-warehouse-connector-zeppelin.md b/articles/hdinsight/interactive-query/apache-hive-warehouse-connector-zeppelin.md
@@ -0,0 +1,134 @@
+---
+title: Hive Warehouse Connector - Apache Zeppelin using Livy - Azure HDInsight
+description: Learn how to integrate Hive Warehouse Connector with Apache Zeppelin on Azure HDInsight.
+author: nis-goel
+ms.author: nisgoel
+ms.reviewer: jasonh
+ms.service: hdinsight
+ms.topic: conceptual
+ms.date: 05/22/2020
+---
+
+# Integrate Apache Zeppelin with Hive Warehouse Connector in Azure HDInsight
+
+HDInsight Spark clusters include Apache Zeppelin notebooks with different interpreters. In this article, we'll focus only on the Livy interpreter to access Hive tables from Spark using Hive Warehouse Connector.
+
+## Prerequisite
+
+Complete the [Hive Warehouse Connector setup](apache-hive-warehouse-connector.md#hive-warehouse-connector-setup) steps.
+
+## Getting started
+
+1. Use [ssh command](../hdinsight-hadoop-linux-use-ssh-unix.md) to connect to your Apache Spark cluster. Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:
+
+    ```cmd
+    ssh sshuser@CLUSTERNAME-ssh.azurehdinsight.net
+    ```
+
+1. From your ssh session, execute the following command to note the versions for `hive-warehouse-connector-assembly` and `pyspark_hwc`:
+
+    ```bash
+    ls /usr/hdp/current/hive_warehouse_connector
+    ```
+
+    Save the output for later use when configuring Apache Zeppelin.
+
+## Configure Livy
+
+Following configurations are required to access hive tables from Zeppelin with the Livy interpreter.
+
+### Interactive Query Cluster
+
+1. From a web browser, navigate to `https://LLAPCLUSTERNAME.azurehdinsight.net/#/main/services/HDFS/configs` where LLAPCLUSTERNAME is the name of your Interactive Query cluster.
+
+1. Navigate to **Advanced** > **Custom core-site**. Select **Add Property...** to add the following configurations:
+
+    | Configuration                 | Value |
+    | ----------------------------- |-------|
+    | hadoop.proxyuser.livy.groups  | *     |
+    | hadoop.proxyuser.livy.hosts   | *     |
+
+1. Save changes and restart all affected components.
+
+### Spark Cluster
+
+1. From a web browser, navigate to `https://CLUSTERNAME.azurehdinsight.net/#/main/services/SPARK2/configs` where CLUSTERNAME is the name of your Apache Spark cluster.
+
+1. Expand **Custom livy2-conf**. Select **Add Property...** to add the following configuration:
+
+    | Configuration                 | Value                                      |
+    | ----------------------------- |------------------------------------------  |
+    | livy.file.local-dir-whitelist | /usr/hdp/current/hive_warehouse_connector/ |
+
+1. Save changes and restart all affected components.
+
+### Configure Livy Interpreter in Zeppelin UI (Spark Cluster)
+
+1. From a web browser, navigate to `https://CLUSTERNAME.azurehdinsight.net/zeppelin/#/interpreter`, where `CLUSTERNAME` is the name of your Apache Spark cluster.
+
+1. Navigate to **livy2**.
+
+1. Add the following configurations:
+
+    | Configuration                 | Value                                      |
+    | ----------------------------- |:------------------------------------------:|
+    | livy.spark.hadoop.hive.llap.daemon.service.hosts | @llap0 |
+    | livy.spark.security.credentials.hiveserver2.enabled | true |
+    | livy.spark.sql.hive.llap | true |
+    | livy.spark.yarn.security.credentials.hiveserver2.enabled | true |
+    | livy.superusers | livy,zeppelin |
+    | livy.spark.jars | `file:///usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-VERSION.jar`.<br>Replace VERSION with value you obtained from [Getting started](#getting-started), earlier. |
+    | livy.spark.submit.pyFiles | `file:///usr/hdp/current/hive_warehouse_connector/pyspark_hwc-VERSION.zip`.<br>Replace VERSION with value you obtained from [Getting started](#getting-started), earlier. |
+    | livy.spark.sql.hive.hiveserver2.jdbc.url | Set it to the HiveServer2 Interactive JDBC URL of the Interactive Query cluster. |
+    | spark.security.credentials.hiveserver2.enabled | true |
+
+1. For ESP clusters only, add the following configuration:
+
+    | Configuration| Value|
+    |---|---|
+    | livy.spark.sql.hive.hiveserver2.jdbc.url.principal | `hive/<headnode-FQDN>@<AAD-Domain>` |
+
+    Replace `<headnode-FQDN>` with the Fully Qualified Domain Name of the head node of the Interactive Query cluster.
+    Replace `<AAD-DOMAIN>` with the name of the Azure Active Directory (AAD) that the cluster is joined to. Use an uppercase string for the `<AAD-DOMAIN>` value, otherwise the credential won't be found. Check `/etc/krb5.conf` for the realm names if needed.
+
+1. Save the changes and restart the Livy interpreter.
+
+If Livy interpreter isn't accessible, modify the `shiro.ini` file present within Zeppelin component in Ambari. For more information, see [Configuring Apache Zeppelin Security](https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/configuring-zeppelin-security/content/enabling_access_control_for_interpreter__configuration__and_credential_settings.html).  
+
+
+## Running Queries in Zeppelin 
+
+Launch a Zeppelin notebook using Livy interpreter and execute the following
+
+```python
+%livy2
+
+import com.hortonworks.hwc.HiveWarehouseSession
+import com.hortonworks.hwc.HiveWarehouseSession._
+import org.apache.spark.sql.SaveMode
+
+# Initialize the hive context
+val hive = HiveWarehouseSession.session(spark).build()
+
+# Create a database
+hive.createDatabase("hwc_db",true)
+hive.setDatabase("hwc_db")
+
+# Create a Hive table
+hive.createTable("testers").ifNotExists().column("id", "bigint").column("name", "string").create()
+
+val dataDF = Seq( (1, "foo"), (2, "bar"), (8, "john")).toDF("id", "name")
+
+# Validate writes to the table
+dataDF.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").mode("append").option("table", "hwc_db.testers").save()
+
+# Validate reads
+hive.executeQuery("select * from testers").show()
+
+```
+
+## Next steps
+
+* [HWC and Apache Spark operations](./apache-hive-warehouse-connector-operations.md)
+* [HWC integration with Apache Spark and Apache Hive](./apache-hive-warehouse-connector.md)
+* [Use Interactive Query with HDInsight](./apache-interactive-query-get-started.md)
diff --git a/articles/hdinsight/interactive-query/apache-hive-warehouse-connector.md b/articles/hdinsight/interactive-query/apache-hive-warehouse-connector.md