Skip to content

Commit 42b24a4

Browse files
authored
Merge pull request #113660 from nis-goel/nisgoelwiki
Update Hive Warehouse Connector documentation
2 parents 9f274ca + 72e4044 commit 42b24a4

File tree

4 files changed

+378
-142
lines changed

4 files changed

+378
-142
lines changed

articles/hdinsight/TOC.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -762,8 +762,12 @@
762762
href: ./hadoop/apache-hadoop-hive-pig-udf-dotnet-csharp.md
763763
- name: Use Python with Apache Hive and Apache Pig
764764
href: ./hadoop/python-udf-hdinsight.md
765-
- name: Apache Hive with Apache Spark
765+
- name: HWC integration with Apache Spark and Apache Hive
766766
href: ./interactive-query/apache-hive-warehouse-connector.md
767+
- name: HWC and Apache Spark operations
768+
href: ./interactive-query/apache-hive-warehouse-connector-operations.md
769+
- name: HWC integration with Apache Zeppelin
770+
href: ./interactive-query/apache-hive-warehouse-connector-zeppelin.md
767771
- name: Apache Hive with Hadoop
768772
href: ./hadoop/hdinsight-use-hive.md
769773
- name: Use the Apache Hive View
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
---
2+
title: Apache Spark operations supported by Hive Warehouse Connector in Azure HDInsight
3+
description: Learn about the different capabilities of Hive Warehouse Connector on Azure HDInsight.
4+
author: nis-goel
5+
ms.author: nisgoel
6+
ms.reviewer: jasonh
7+
ms.service: hdinsight
8+
ms.topic: conceptual
9+
ms.date: 05/22/2020
10+
---
11+
12+
# Apache Spark operations supported by Hive Warehouse Connector in Azure HDInsight
13+
14+
This article shows spark-based operations supported by Hive Warehouse Connector (HWC). All examples shown below will be executed through the Apache Spark shell.
15+
16+
## Prerequisite
17+
18+
Complete the [Hive Warehouse Connector setup](./apache-hive-warehouse-connector.md#hive-warehouse-connector-setup) steps.
19+
20+
## Getting started
21+
22+
To start a spark-shell session, do the following steps:
23+
24+
1. Use [ssh command](../hdinsight-hadoop-linux-use-ssh-unix.md) to connect to your Apache Spark cluster. Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:
25+
26+
```cmd
27+
28+
```
29+
30+
1. From your ssh session, execute the following command to note the `hive-warehouse-connector-assembly` version:
31+
32+
```bash
33+
ls /usr/hdp/current/hive_warehouse_connector
34+
```
35+
36+
1. Edit the code below with the `hive-warehouse-connector-assembly` version identified above. Then execute the command to start the spark shell:
37+
38+
```bash
39+
spark-shell --master yarn \
40+
--jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-<STACK_VERSION>.jar \
41+
--conf spark.security.credentials.hiveserver2.enabled=false
42+
```
43+
44+
1. After starting the spark-shell, a Hive Warehouse Connector instance can be started using the following commands:
45+
46+
```scala
47+
import com.hortonworks.hwc.HiveWarehouseSession
48+
val hive = HiveWarehouseSession.session(spark).build()
49+
```
50+
51+
## Creating Spark DataFrames using Hive queries
52+
53+
The results of all queries using the HWC library are returned as a DataFrame. The following examples demonstrate how to create a basic hive query.
54+
55+
```scala
56+
hive.setDatabase("default")
57+
val df = hive.executeQuery("select * from hivesampletable")
58+
df.filter("state = 'Colorado'").show()
59+
```
60+
61+
The results of the query are Spark DataFrames, which can be used with Spark libraries like MLIB and SparkSQL.
62+
63+
## Writing out Spark DataFrames to Hive tables
64+
65+
Spark doesn't natively support writing to Hive's managed ACID tables. However,using HWC, you can write out any DataFrame to a Hive table. You can see this functionality at work in the following example:
66+
67+
1. Create a table called `sampletable_colorado` and specify its columns using the following command:
68+
69+
```scala
70+
hive.createTable("sampletable_colorado").column("clientid","string").column("querytime","string").column("market","string").column("deviceplatform","string").column("devicemake","string").column("devicemodel","string").column("state","string").column("country","string").column("querydwelltime","double").column("sessionid","bigint").column("sessionpagevieworder","bigint").create()
71+
```
72+
73+
1. Filter the table `hivesampletable` where the column `state` equals `Colorado`. This hive query returns a Spark DataFrame ans sis saved in the Hive table `sampletable_colorado` using the `write` function.
74+
75+
```scala
76+
hive.table("hivesampletable").filter("state = 'Colorado'").write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").mode("append").option("table","sampletable_colorado").save()
77+
```
78+
79+
1. View the results with the following command:
80+
81+
```scala
82+
hive.table("sampletable_colorado").show()
83+
```
84+
85+
![hive warehouse connector show hive table](./media/apache-hive-warehouse-connector/hive-warehouse-connector-show-hive-table.png)
86+
87+
88+
## Structured streaming writes
89+
90+
Using Hive Warehouse Connector, you can use Spark streaming to write data into Hive tables.
91+
92+
> [!IMPORTANT]
93+
> Structured streaming writes are not supported in ESP enabled Spark 4.0 clusters.
94+
95+
Follow the steps below to ingest data from a Spark stream on localhost port 9999 into a Hive table via. Hive Warehouse Connector.
96+
97+
1. From your open Spark shell, begin a spark stream with the following command:
98+
99+
```scala
100+
val lines = spark.readStream.format("socket").option("host", "localhost").option("port",9999).load()
101+
```
102+
103+
1. Generate data for the Spark stream that you created, by doing the following steps:
104+
1. Open a second SSH session on the same Spark cluster.
105+
1. At the command prompt, type `nc -lk 9999`. This command uses the netcat utility to send data from the command line to the specified port.
106+
107+
1. Return to the first SSH session and create a new Hive table to hold the streaming data. At the spark-shell, enter the following command:
108+
109+
```scala
110+
hive.createTable("stream_table").column("value","string").create()
111+
```
112+
113+
1. Then write the streaming data to the newly created table using the following command:
114+
115+
```scala
116+
lines.filter("value = 'HiveSpark'").writeStream.format("com.hortonworks.spark.sql.hive.llap.streaming.HiveStreamingDataSource").option("database", "default").option("table","stream_table").option("metastoreUri",spark.conf.get("spark.datasource.hive.warehouse.metastoreUri")).option("checkpointLocation","/tmp/checkpoint1").start()
117+
```
118+
119+
>[!Important]
120+
> The `metastoreUri` and `database` options must currently be set manually due to a known issue in Apache Spark. For more information about this issue, see [SPARK-25460](https://issues.apache.org/jira/browse/SPARK-25460).
121+
122+
1. Return to the second SSH session and enter the following values:
123+
124+
```bash
125+
foo
126+
HiveSpark
127+
bar
128+
```
129+
130+
1. Return to the first SSH session and note the brief activity. Use the following command to view the data:
131+
132+
```scala
133+
hive.table("stream_table").show()
134+
```
135+
136+
Use **Ctrl + C** to stop netcat on the second SSH session. Use `:q` to exit spark-shell on the first SSH session.
137+
138+
## Next steps
139+
140+
* [HWC integration with Apache Spark and Apache Hive](./apache-hive-warehouse-connector.md)
141+
* [Use Interactive Query with HDInsight](./apache-interactive-query-get-started.md)
142+
* [HWC integration with Apache Zeppelin](./apache-hive-warehouse-connector-zeppelin.md)
Lines changed: 134 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
---
2+
title: Hive Warehouse Connector - Apache Zeppelin using Livy - Azure HDInsight
3+
description: Learn how to integrate Hive Warehouse Connector with Apache Zeppelin on Azure HDInsight.
4+
author: nis-goel
5+
ms.author: nisgoel
6+
ms.reviewer: jasonh
7+
ms.service: hdinsight
8+
ms.topic: conceptual
9+
ms.date: 05/22/2020
10+
---
11+
12+
# Integrate Apache Zeppelin with Hive Warehouse Connector in Azure HDInsight
13+
14+
HDInsight Spark clusters include Apache Zeppelin notebooks with different interpreters. In this article, we'll focus only on the Livy interpreter to access Hive tables from Spark using Hive Warehouse Connector.
15+
16+
## Prerequisite
17+
18+
Complete the [Hive Warehouse Connector setup](apache-hive-warehouse-connector.md#hive-warehouse-connector-setup) steps.
19+
20+
## Getting started
21+
22+
1. Use [ssh command](../hdinsight-hadoop-linux-use-ssh-unix.md) to connect to your Apache Spark cluster. Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:
23+
24+
```cmd
25+
26+
```
27+
28+
1. From your ssh session, execute the following command to note the versions for `hive-warehouse-connector-assembly` and `pyspark_hwc`:
29+
30+
```bash
31+
ls /usr/hdp/current/hive_warehouse_connector
32+
```
33+
34+
Save the output for later use when configuring Apache Zeppelin.
35+
36+
## Configure Livy
37+
38+
Following configurations are required to access hive tables from Zeppelin with the Livy interpreter.
39+
40+
### Interactive Query Cluster
41+
42+
1. From a web browser, navigate to `https://LLAPCLUSTERNAME.azurehdinsight.net/#/main/services/HDFS/configs` where LLAPCLUSTERNAME is the name of your Interactive Query cluster.
43+
44+
1. Navigate to **Advanced** > **Custom core-site**. Select **Add Property...** to add the following configurations:
45+
46+
| Configuration | Value |
47+
| ----------------------------- |-------|
48+
| hadoop.proxyuser.livy.groups | * |
49+
| hadoop.proxyuser.livy.hosts | * |
50+
51+
1. Save changes and restart all affected components.
52+
53+
### Spark Cluster
54+
55+
1. From a web browser, navigate to `https://CLUSTERNAME.azurehdinsight.net/#/main/services/SPARK2/configs` where CLUSTERNAME is the name of your Apache Spark cluster.
56+
57+
1. Expand **Custom livy2-conf**. Select **Add Property...** to add the following configuration:
58+
59+
| Configuration | Value |
60+
| ----------------------------- |------------------------------------------ |
61+
| livy.file.local-dir-whitelist | /usr/hdp/current/hive_warehouse_connector/ |
62+
63+
1. Save changes and restart all affected components.
64+
65+
### Configure Livy Interpreter in Zeppelin UI (Spark Cluster)
66+
67+
1. From a web browser, navigate to `https://CLUSTERNAME.azurehdinsight.net/zeppelin/#/interpreter`, where `CLUSTERNAME` is the name of your Apache Spark cluster.
68+
69+
1. Navigate to **livy2**.
70+
71+
1. Add the following configurations:
72+
73+
| Configuration | Value |
74+
| ----------------------------- |:------------------------------------------:|
75+
| livy.spark.hadoop.hive.llap.daemon.service.hosts | @llap0 |
76+
| livy.spark.security.credentials.hiveserver2.enabled | true |
77+
| livy.spark.sql.hive.llap | true |
78+
| livy.spark.yarn.security.credentials.hiveserver2.enabled | true |
79+
| livy.superusers | livy,zeppelin |
80+
| livy.spark.jars | `file:///usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-VERSION.jar`.<br>Replace VERSION with value you obtained from [Getting started](#getting-started), earlier. |
81+
| livy.spark.submit.pyFiles | `file:///usr/hdp/current/hive_warehouse_connector/pyspark_hwc-VERSION.zip`.<br>Replace VERSION with value you obtained from [Getting started](#getting-started), earlier. |
82+
| livy.spark.sql.hive.hiveserver2.jdbc.url | Set it to the HiveServer2 Interactive JDBC URL of the Interactive Query cluster. |
83+
| spark.security.credentials.hiveserver2.enabled | true |
84+
85+
1. For ESP clusters only, add the following configuration:
86+
87+
| Configuration| Value|
88+
|---|---|
89+
| livy.spark.sql.hive.hiveserver2.jdbc.url.principal | `hive/<headnode-FQDN>@<AAD-Domain>` |
90+
91+
Replace `<headnode-FQDN>` with the Fully Qualified Domain Name of the head node of the Interactive Query cluster.
92+
Replace `<AAD-DOMAIN>` with the name of the Azure Active Directory (AAD) that the cluster is joined to. Use an uppercase string for the `<AAD-DOMAIN>` value, otherwise the credential won't be found. Check `/etc/krb5.conf` for the realm names if needed.
93+
94+
1. Save the changes and restart the Livy interpreter.
95+
96+
If Livy interpreter isn't accessible, modify the `shiro.ini` file present within Zeppelin component in Ambari. For more information, see [Configuring Apache Zeppelin Security](https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/configuring-zeppelin-security/content/enabling_access_control_for_interpreter__configuration__and_credential_settings.html).
97+
98+
99+
## Running Queries in Zeppelin
100+
101+
Launch a Zeppelin notebook using Livy interpreter and execute the following
102+
103+
```python
104+
%livy2
105+
106+
import com.hortonworks.hwc.HiveWarehouseSession
107+
import com.hortonworks.hwc.HiveWarehouseSession._
108+
import org.apache.spark.sql.SaveMode
109+
110+
# Initialize the hive context
111+
val hive = HiveWarehouseSession.session(spark).build()
112+
113+
# Create a database
114+
hive.createDatabase("hwc_db",true)
115+
hive.setDatabase("hwc_db")
116+
117+
# Create a Hive table
118+
hive.createTable("testers").ifNotExists().column("id", "bigint").column("name", "string").create()
119+
120+
val dataDF = Seq( (1, "foo"), (2, "bar"), (8, "john")).toDF("id", "name")
121+
122+
# Validate writes to the table
123+
dataDF.write.format("com.hortonworks.spark.sql.hive.llap.HiveWarehouseConnector").mode("append").option("table", "hwc_db.testers").save()
124+
125+
# Validate reads
126+
hive.executeQuery("select * from testers").show()
127+
128+
```
129+
130+
## Next steps
131+
132+
* [HWC and Apache Spark operations](./apache-hive-warehouse-connector-operations.md)
133+
* [HWC integration with Apache Spark and Apache Hive](./apache-hive-warehouse-connector.md)
134+
* [Use Interactive Query with HDInsight](./apache-interactive-query-get-started.md)

0 commit comments

Comments
 (0)