Skip to content

Commit 50fe95a

Browse files
authored
Merge pull request #90914 from dagiro/freshness14
freshness14
2 parents 1b34685 + e9caa8b commit 50fe95a

File tree

1 file changed

+69
-52
lines changed

1 file changed

+69
-52
lines changed

articles/hdinsight/interactive-query/apache-hive-warehouse-connector.md

Lines changed: 69 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ ms.author: nakhanha
66
ms.reviewer: hrasheed
77
ms.service: hdinsight
88
ms.topic: conceptual
9-
ms.date: 04/29/2019
9+
ms.date: 10/08/2019
1010
---
1111

1212
# Integrate Apache Spark and Apache Hive with the Hive Warehouse Connector
@@ -31,51 +31,54 @@ Some of the operations supported by the Hive Warehouse Connector are:
3131

3232
## Hive Warehouse Connector setup
3333

34-
Follow these steps to setup the Hive Warehouse Connector between a Spark and Interactive Query cluster in Azure HDInsight:
34+
Follow these steps to set up the Hive Warehouse Connector between a Spark and Interactive Query cluster in Azure HDInsight:
3535

36-
1. Create a HDInsight Spark 4.0 cluster using the Azure portal with a storage account and a custom Azure virtual network. For information on creating a cluster in an Azure virtual network, see [Add HDInsight to an existing virtual network](../../hdinsight/hdinsight-plan-virtual-network-deployment.md#existingvnet).
37-
1. Create a HDInsight Interactive Query (LLAP) 4.0 cluster using the Azure portal with the same storage account and Azure virtual network as the Spark cluster.
38-
1. Copy the contents of the `/etc/hosts` file on headnode0 of your Interactive Query cluster to the `/etc/hosts` file on the headnode0 of your Spark cluster. This step will allow your Spark cluster to resolve IP addresses of the nodes in Interactive Query cluster. View the contents of the updated file with `cat /etc/hosts`. The output should look something like what is shown in the screenshot below.
36+
### Create clusters
3937

40-
![hive warehouse connector hosts file](./media/apache-hive-warehouse-connector/hive-warehouse-connector-hosts-file.png)
38+
1. Create an HDInsight Spark **4.0** cluster with a storage account and a custom Azure virtual network. For information on creating a cluster in an Azure virtual network, see [Add HDInsight to an existing virtual network](../../hdinsight/hdinsight-plan-virtual-network-deployment.md#existingvnet).
4139

42-
1. Configure the Spark cluster settings by doing the following steps:
43-
1. Go to Azure portal, select HDInsight clusters, and then click on your cluster name.
44-
1. On the right side, under **Cluster dashboards**, select **Ambari home**.
45-
1. In the Ambari web UI, click **SPARK2** > **CONFIGS** > **Custom spark2-defaults**.
40+
1. Create an HDInsight Interactive Query (LLAP) **4.0** cluster with the same storage account and Azure virtual network as the Spark cluster.
4641

47-
![Apache Ambari Spark2 configuration](./media/apache-hive-warehouse-connector/hive-warehouse-connector-spark2-ambari.png)
42+
### Modify hosts file
4843

49-
1. Set `spark.hadoop.hive.llap.daemon.service.hosts` to the same value as the property **hive.llap.daemon.service.hosts** under ** Advanced hive-interactive-site**. For example, `@llap0`
44+
Copy the node information from the `/etc/hosts` file on headnode0 of your Interactive Query cluster and concatenate the information to the `/etc/hosts` file on the headnode0 of your Spark cluster. This step will allow your Spark cluster to resolve IP addresses of the nodes in Interactive Query cluster. View the contents of the updated file with `cat /etc/hosts`. The final output should look something like what is shown in the screenshot below.
5045

51-
1. Set `spark.sql.hive.hiveserver2.jdbc.url` to the JDBC connection string, which connects to Hiveserver2 on the Interactive Query cluster. The connection string for your cluster will look like URI below. `CLUSTERNAME` is the name of your Spark cluster and the `user` and `password` parameters are set to the correct values for your cluster.
46+
![hive warehouse connector hosts file](./media/apache-hive-warehouse-connector/hive-warehouse-connector-hosts-file.png)
5247

53-
```
54-
jdbc:hive2://LLAPCLUSTERNAME.azurehdinsight.net:443/;user=admin;password=PWD;ssl=true;transportMode=http;httpPath=/hive2
55-
```
48+
### Gather preliminary information
5649

57-
> [!Note]
58-
> The JDBC URL should contain credentials for connecting to Hiveserver2 including a username and password.
50+
#### From your Interactive Query cluster
5951

60-
1. Set `spark.datasource.hive.warehouse.load.staging.dir` to a suitable HDFS-compatible staging directory. If you have two different clusters, the staging directory should be a folder in the staging directory of the LLAP cluster’s storage account so that HiveServer2 has access to it. For example, `wasb://STORAGE_CONTAINER_NAME@STORAGE_ACCOUNT_NAME.blob.core.windows.net/tmp` where `STORAGE_ACCOUNT_NAME` is the name of the storage account being used by the cluster, and `STORAGE_CONTAINER_NAME` is the name of the storage container.
52+
1. Navigate to the cluster's Apache Ambari home page using `https://LLAPCLUSTERNAME.azurehdinsight.net` where `LLAPCLUSTERNAME` is the name of your Interactive Query cluster.
6153

62-
1. Set `spark.datasource.hive.warehouse.metastoreUri` with the value of the metastore URI of the Interactive Query cluster. To find the metastoreUri for your LLAP cluster, look for the **hive.metastore.uris** property in the Ambari UI for your LLAP cluster under **Hive** > **ADVANCED** > **General**. The value will look something like the following URI:
54+
1. Navigate to **Hive** > **CONFIGS** > **Advanced** > **Advanced hive-site** > **hive.zookeeper.quorum** and note the value. The value may be similar to: `zk0-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.cloudapp.net:2181,zk1-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.cloudapp.net:2181,zk4-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.cloudapp.net:2181`.
6355

64-
```
65-
thrift://hn0-hwclla.0iv2nyrmse1uvp2caa4e34jkmf.cx.internal.cloudapp.net:9083,
66-
thrift://hn1-hwclla.0iv2nyrmse1uvp2caa4e34jkmf.cx.internal.cloudapp.net:9083
67-
```
56+
1. Navigate to **Hive** > **CONFIGS** > **Advanced** > **General** > **hive.metastore.uris** and note the value. The value may be similar to: `thrift://hn0-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.cloudapp.net:9083,thrift://hn1-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.cloudapp.net:9083`.
6857

69-
1. Set `spark.security.credentials.hiveserver2.enabled` to `false` for YARN client deploy mode.
70-
1. Set `spark.hadoop.hive.zookeeper.quorum` to the Zookeeper quorum of your LLAP Cluster. To find the Zookeeper quorum for your LLAP cluster, look for the **hive.zookeeper.quorum** property in the Ambari UI for your LLAP cluster under **Hive** > **ADVANCED** > **Advanced hive-site**. The value will look something like the following string:
58+
#### From your Apache Spark cluster
7159

72-
```
73-
zk1-nkhvne.0iv2nyrmse1uvp2caa4e34jkmf.cx.internal.cloudapp.net:2181,
74-
zk4-nkhvne.0iv2nyrmse1uvp2caa4e34jkmf.cx.internal.cloudapp.net:2181,
75-
zk6-nkhvne.0iv2nyrmse1uvp2caa4e34jkmf.cx.internal.cloudapp.net:2181
76-
```
60+
1. Navigate to the cluster's Apache Ambari home page using `https://SPARKCLUSTERNAME.azurehdinsight.net` where `SPARKCLUSTERNAME` is the name of your Apache Spark cluster.
7761

78-
To test the configuration of your Hive Warehouse Connector, follow the steps in the section [Connecting and running queries](#connecting-and-running-queries).
62+
1. Navigate to **Hive** > **CONFIGS** > **Advanced** > **Advanced hive-interactive-site** > **hive.llap.daemon.service.hosts** and note the value. The value may be similar to: `@llap0`.
63+
64+
### Configure Spark cluster settings
65+
66+
From your Spark Ambari web UI, navigate to **Spark2** > **CONFIGS** > **Custom spark2-defaults**.
67+
68+
![Apache Ambari Spark2 configuration](./media/apache-hive-warehouse-connector/hive-warehouse-connector-spark2-ambari.png)
69+
70+
Select **Add Property...** as needed to add/update the following:
71+
72+
| Key | Value |
73+
|----|----|
74+
|`spark.hadoop.hive.llap.daemon.service.hosts`|The value you obtained earlier from **hive.llap.daemon.service.hosts**.|
75+
|`spark.sql.hive.hiveserver2.jdbc.url`|`jdbc:hive2://LLAPCLUSTERNAME.azurehdinsight.net:443/;user=admin;password=PWD;ssl=true;transportMode=http;httpPath=/hive2`. Set to the JDBC connection string, which connects to Hiveserver2 on the Interactive Query cluster. REPLACE `LLAPCLUSTERNAME` with the name of your Interactive Query cluster. Replace `PWD` with the actual password.|
76+
|`spark.datasource.hive.warehouse.load.staging.dir`|`wasbs://STORAGE_CONTAINER_NAME@STORAGE_ACCOUNT_NAME.blob.core.windows.net/tmp`. Set to a suitable HDFS-compatible staging directory. If you have two different clusters, the staging directory should be a folder in the staging directory of the LLAP cluster’s storage account so that HiveServer2 has access to it. Replace `STORAGE_ACCOUNT_NAME` with the name of the storage account being used by the cluster, and `STORAGE_CONTAINER_NAME` with the name of the storage container.|
77+
|`spark.datasource.hive.warehouse.metastoreUri`|The value you obtained earlier from **hive.metastore.uris**.|
78+
|`spark.security.credentials.hiveserver2.enabled`|`false` for YARN client deploy mode.|
79+
|`spark.hadoop.hive.zookeeper.quorum`|The value you obtained earlier from **hive.zookeeper.quorum**.|
80+
81+
Save changes and restart components as needed.
7982

8083
## Using the Hive Warehouse Connector
8184

@@ -93,17 +96,17 @@ All examples provided in this article will be executed through spark-shell.
9396

9497
To start a spark-shell session, do the following steps:
9598

96-
1. SSH into the headnode for your cluster. For more information about connecting to your cluster with SSH, see [Connect to HDInsight (Apache Hadoop) using SSH](../../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md).
97-
1. Change into the correct directory by typing `cd /usr/hdp/current/hive_warehouse_connector` or provide the full path to all jar files used as parameters in the spark-shell command.
99+
1. SSH into the headnode for your Apache Spark cluster. For more information about connecting to your cluster with SSH, see [Connect to HDInsight (Apache Hadoop) using SSH](../../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md).
100+
98101
1. Enter the following command to start the spark shell:
99102

100103
```bash
101104
spark-shell --master yarn \
102-
--jars /usr/hdp/3.0.1.0-183/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-183.jar \
105+
--jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.2.1-8.jar \
103106
--conf spark.security.credentials.hiveserver2.enabled=false
104107
```
105108

106-
1. You will see a welcome message and a `scala>` prompt where you can enter commands.
109+
You will see a welcome message and a `scala>` prompt where you can enter commands.
107110

108111
1. After starting the spark-shell, a Hive Warehouse Connector instance can be started using the following commands:
109112

@@ -116,8 +119,10 @@ To start a spark-shell session, do the following steps:
116119

117120
The Enterprise Security Package (ESP) provides enterprise-grade capabilities like Active Directory-based authentication, multi-user support, and role-based access control for Apache Hadoop clusters in Azure HDInsight. For more information on ESP, see [Use Enterprise Security Package in HDInsight](../domain-joined/apache-domain-joined-architecture.md).
118121

119-
1. Follow the initial steps 1 and 2 under [Connecting and running queries](#connecting-and-running-queries).
122+
1. SSH into the headnode for your Apache Spark cluster. For more information about connecting to your cluster with SSH, see [Connect to HDInsight (Apache Hadoop) using SSH](../../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md).
123+
120124
1. Type `kinit` and login with a domain user.
125+
121126
1. Start spark-shell with the full list of configuration parameters as shown below. All of the values in all capital letters between angle brackets must be specified based on your cluster. If you need to find out the values to input for any of the parameters below, see the section on [Hive Warehouse Connector setup](#hive-warehouse-connector-setup).:
122127

123128
```bash
@@ -131,7 +136,7 @@ The Enterprise Security Package (ESP) provides enterprise-grade capabilities lik
131136
--conf spark.hadoop.hive.zookeeper.quorum='<ZOOKEEPER_QUORUM>'
132137
```
133138

134-
### Creating Spark DataFrames from Hive queries
139+
### Creating Spark DataFrames from Hive queries
135140

136141
The results of all queries using the HWC library are returned as a DataFrame. The following examples demonstrate how to create a basic query.
137142

@@ -143,7 +148,7 @@ df.filter("state = 'Colorado'").show()
143148

144149
The results of the query are Spark DataFrames, which can be used with Spark libraries like MLIB and SparkSQL.
145150

146-
### Writing out Spark DataFrames to Hive tables
151+
### Writing out Spark DataFrames to Hive tables
147152

148153
Spark doesn’t natively support writing to Hive’s managed ACID tables. Using HWC, however, you can write out any DataFrame to a Hive table. You can see this functionality at work in the following example:
149154

@@ -153,43 +158,45 @@ Spark doesn’t natively support writing to Hive’s managed ACID tables. Using
153158
hive.createTable("sampletable_colorado").column("clientid","string").column("querytime","string").column("market","string").column("deviceplatform","string").column("devicemake","string").column("devicemodel","string").column("state","string").column("country","string").column("querydwelltime","double").column("sessionid","bigint").column("sessionpagevieworder","bigint").create()
154159
```
155160

156-
2. Filter the table `hivesampletable` where the column `state` equals `Colorado`. This query of the Hive table is returned as a Spark DataFrame. Then the DataFrame is saved in the Hive table `sampletable_colorado` using the `write` function.
161+
1. Filter the table `hivesampletable` where the column `state` equals `Colorado`. This query of the Hive table is returned as a Spark DataFrame. Then the DataFrame is saved in the Hive table `sampletable_colorado` using the `write` function.
157162

158163
```scala
159164
hive.table("hivesampletable").filter("state = 'Colorado'").write.format(HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR).option("table","sampletable_colorado").save()
160165
```
161166

162-
You can see the resulting table in the screenshot below.
167+
1. View the results with the following command:
163168

164-
![hive warehouse connector show hive table](./media/apache-hive-warehouse-connector/hive-warehouse-connector-show-hive-table.png)
169+
```scala
170+
hive.table("sampletable_colorado").show()
171+
```
172+
173+
![hive warehouse connector show hive table](./media/apache-hive-warehouse-connector/hive-warehouse-connector-show-hive-table.png)
165174

166-
### Structured streaming writes
175+
### Structured streaming writes
167176

168177
Using Hive Warehouse Connector, you can use Spark streaming to write data into Hive tables.
169178

170179
Follow the steps below to create a Hive Warehouse Connector example that ingests data from a Spark stream on localhost port 9999 into a Hive table.
171180

172-
1. Open a terminal on your Spark cluster.
181+
1. Follow the steps under [Connecting and running queries](#connecting-and-running-queries).
182+
173183
1. Begin the spark stream with the following command:
174184

175185
```scala
176-
val lines = spark.readStream.format("socket").option("host", "localhost").option("port",9988).load()
186+
val lines = spark.readStream.format("socket").option("host", "localhost").option("port",9999).load()
177187
```
178188

179189
1. Generate data for the Spark stream that you created, by doing the following steps:
180-
1. Open another terminal on the same Spark cluster.
190+
1. Open a second SSH session on the same Spark cluster.
181191
1. At the command prompt, type `nc -lk 9999`. This command uses the netcat utility to send data from the command line to the specified port.
182-
1. Type the words that you would like the Spark stream to ingest, followed by carriage return.
183192

184-
![input data to Apache spark stream](./media/apache-hive-warehouse-connector/hive-warehouse-connector-spark-stream-data-input.png)
185-
186-
1. Create a new Hive table to hold the streaming data. At the spark-shell, type the following commands:
193+
1. Return to the first SSH session and create a new Hive table to hold the streaming data. At the spark-shell, enter the following command:
187194

188195
```scala
189196
hive.createTable("stream_table").column("value","string").create()
190197
```
191198

192-
1. Write the streaming data to the newly created table using the following command:
199+
1. Then write the streaming data to the newly created table using the following command:
193200

194201
```scala
195202
lines.filter("value = 'HiveSpark'").writeStream.format(HiveWarehouseSession.STREAM_TO_STREAM).option("database", "default").option("table","stream_table").option("metastoreUri",spark.conf.get("spark.datasource.hive.warehouse.metastoreUri")).option("checkpointLocation","/tmp/checkpoint1").start()
@@ -198,12 +205,22 @@ Follow the steps below to create a Hive Warehouse Connector example that ingests
198205
>[!Important]
199206
> The `metastoreUri` and `database` options must currently be set manually due to a known issue in Apache Spark. For more information about this issue, see [SPARK-25460](https://issues.apache.org/jira/browse/SPARK-25460).
200207

201-
1. You can view the data inserted into the table with the following command:
208+
1. Return to the second SSH session and enter the following values:
209+
210+
```bash
211+
foo
212+
HiveSpark
213+
bar
214+
```
215+
216+
1. Return to the first SSH session and note the brief activity. Use the following command to view the data:
202217

203218
```scala
204219
hive.table("stream_table").show()
205220
```
206221

222+
Use **Ctrl + C** to stop netcat on the second SSH session. Use `:q` to exit spark-shell on the first SSH session.
223+
207224
### Securing data on Spark ESP clusters
208225

209226
1. Create a table `demo` with some sample data by entering the following commands:

0 commit comments

Comments
 (0)