You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/interactive-query/apache-hive-warehouse-connector.md
+69-52Lines changed: 69 additions & 52 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,7 +6,7 @@ ms.author: nakhanha
6
6
ms.reviewer: hrasheed
7
7
ms.service: hdinsight
8
8
ms.topic: conceptual
9
-
ms.date: 04/29/2019
9
+
ms.date: 10/08/2019
10
10
---
11
11
12
12
# Integrate Apache Spark and Apache Hive with the Hive Warehouse Connector
@@ -31,51 +31,54 @@ Some of the operations supported by the Hive Warehouse Connector are:
31
31
32
32
## Hive Warehouse Connector setup
33
33
34
-
Follow these steps to setup the Hive Warehouse Connector between a Spark and Interactive Query cluster in Azure HDInsight:
34
+
Follow these steps to set up the Hive Warehouse Connector between a Spark and Interactive Query cluster in Azure HDInsight:
35
35
36
-
1. Create a HDInsight Spark 4.0 cluster using the Azure portal with a storage account and a custom Azure virtual network. For information on creating a cluster in an Azure virtual network, see [Add HDInsight to an existing virtual network](../../hdinsight/hdinsight-plan-virtual-network-deployment.md#existingvnet).
37
-
1. Create a HDInsight Interactive Query (LLAP) 4.0 cluster using the Azure portal with the same storage account and Azure virtual network as the Spark cluster.
38
-
1. Copy the contents of the `/etc/hosts` file on headnode0 of your Interactive Query cluster to the `/etc/hosts` file on the headnode0 of your Spark cluster. This step will allow your Spark cluster to resolve IP addresses of the nodes in Interactive Query cluster. View the contents of the updated file with `cat /etc/hosts`. The output should look something like what is shown in the screenshot below.
1. Create an HDInsight Spark **4.0** cluster with a storage account and a custom Azure virtual network. For information on creating a cluster in an Azure virtual network, see [Add HDInsight to an existing virtual network](../../hdinsight/hdinsight-plan-virtual-network-deployment.md#existingvnet).
41
39
42
-
1. Configure the Spark cluster settings by doing the following steps:
43
-
1. Go to Azure portal, select HDInsight clusters, and then click on your cluster name.
44
-
1. On the right side, under **Cluster dashboards**, select **Ambari home**.
45
-
1. In the Ambari web UI, click **SPARK2** > **CONFIGS** > **Custom spark2-defaults**.
40
+
1. Create an HDInsight Interactive Query (LLAP) **4.0** cluster with the same storage account and Azure virtual network as the Spark cluster.
1. Set `spark.hadoop.hive.llap.daemon.service.hosts` to the same value as the property **hive.llap.daemon.service.hosts** under ** Advanced hive-interactive-site**. For example, `@llap0`
44
+
Copy the node information from the `/etc/hosts`file on headnode0 of your Interactive Query cluster and concatenate the information to the `/etc/hosts` file on the headnode0 of your Spark cluster. This step will allow your Spark cluster to resolve IP addresses of the nodes in Interactive Query cluster. View the contents of the updated file with `cat /etc/hosts`. The final output should look something like what is shown in the screenshot below.
50
45
51
-
1. Set `spark.sql.hive.hiveserver2.jdbc.url` to the JDBC connection string, which connects to Hiveserver2 on the Interactive Query cluster. The connection string for your cluster will look like URI below. `CLUSTERNAME` is the name of your Spark cluster and the `user` and `password` parameters are set to the correct values for your cluster.
> The JDBC URL should contain credentials for connecting to Hiveserver2 including a username and password.
50
+
#### From your Interactive Query cluster
59
51
60
-
1. Set `spark.datasource.hive.warehouse.load.staging.dir` to a suitable HDFS-compatible staging directory. If you have two different clusters, the staging directory should be a folder in the staging directory of the LLAP cluster’s storage account so that HiveServer2 has access to it. For example, `wasb://STORAGE_CONTAINER_NAME@STORAGE_ACCOUNT_NAME.blob.core.windows.net/tmp` where `STORAGE_ACCOUNT_NAME` is the name of the storage account being used by the cluster, and `STORAGE_CONTAINER_NAME` is the name of the storage container.
52
+
1.Navigate to the cluster's Apache Ambari home page using `https://LLAPCLUSTERNAME.azurehdinsight.net` where `LLAPCLUSTERNAME` is the name of your Interactive Query cluster.
61
53
62
-
1. Set `spark.datasource.hive.warehouse.metastoreUri` with the value of the metastore URI of the Interactive Query cluster. To find the metastoreUri for your LLAP cluster, look for the **hive.metastore.uris** property in the Ambari UI for your LLAP cluster under **Hive** > **ADVANCED** > **General**. The value will look something like the following URI:
54
+
1.Navigate to **Hive** > **CONFIGS**> **Advanced** > **Advanced hive-site** > **hive.zookeeper.quorum** and note the value. The value may be similar to: `zk0-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.cloudapp.net:2181,zk1-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.cloudapp.net:2181,zk4-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.cloudapp.net:2181`.
1. Navigate to **Hive** > **CONFIGS** > **Advanced** > **General** > **hive.metastore.uris** and note the value. The value may be similar to: `thrift://hn0-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.cloudapp.net:9083,thrift://hn1-iqgiro.rekufuk2y2cezcbowjkbwfnyvd.bx.internal.cloudapp.net:9083`.
68
57
69
-
1. Set `spark.security.credentials.hiveserver2.enabled` to `false` for YARN client deploy mode.
70
-
1. Set `spark.hadoop.hive.zookeeper.quorum` to the Zookeeper quorum of your LLAP Cluster. To find the Zookeeper quorum for your LLAP cluster, look for the **hive.zookeeper.quorum** property in the Ambari UI for your LLAP cluster under **Hive** > **ADVANCED** > **Advanced hive-site**. The value will look something like the following string:
1. Navigate to the cluster's Apache Ambari home page using `https://SPARKCLUSTERNAME.azurehdinsight.net` where `SPARKCLUSTERNAME` is the name of your Apache Spark cluster.
77
61
78
-
To test the configuration of your Hive Warehouse Connector, follow the steps in the section [Connecting and running queries](#connecting-and-running-queries).
62
+
1. Navigate to **Hive** > **CONFIGS** > **Advanced** > **Advanced hive-interactive-site** > **hive.llap.daemon.service.hosts** and note the value. The value may be similar to: `@llap0`.
63
+
64
+
### Configure Spark cluster settings
65
+
66
+
From your Spark Ambari web UI, navigate to **Spark2** > **CONFIGS** > **Custom spark2-defaults**.
Select **Add Property...** as needed to add/update the following:
71
+
72
+
| Key | Value |
73
+
|----|----|
74
+
|`spark.hadoop.hive.llap.daemon.service.hosts`|The value you obtained earlier from **hive.llap.daemon.service.hosts**.|
75
+
|`spark.sql.hive.hiveserver2.jdbc.url`|`jdbc:hive2://LLAPCLUSTERNAME.azurehdinsight.net:443/;user=admin;password=PWD;ssl=true;transportMode=http;httpPath=/hive2`. Set to the JDBC connection string, which connects to Hiveserver2 on the Interactive Query cluster. REPLACE `LLAPCLUSTERNAME` with the name of your Interactive Query cluster. Replace `PWD` with the actual password.|
76
+
|`spark.datasource.hive.warehouse.load.staging.dir`|`wasbs://STORAGE_CONTAINER_NAME@STORAGE_ACCOUNT_NAME.blob.core.windows.net/tmp`. Set to a suitable HDFS-compatible staging directory. If you have two different clusters, the staging directory should be a folder in the staging directory of the LLAP cluster’s storage account so that HiveServer2 has access to it. Replace `STORAGE_ACCOUNT_NAME` with the name of the storage account being used by the cluster, and `STORAGE_CONTAINER_NAME` with the name of the storage container.|
77
+
|`spark.datasource.hive.warehouse.metastoreUri`|The value you obtained earlier from **hive.metastore.uris**.|
78
+
|`spark.security.credentials.hiveserver2.enabled`|`false` for YARN client deploy mode.|
79
+
|`spark.hadoop.hive.zookeeper.quorum`|The value you obtained earlier from **hive.zookeeper.quorum**.|
80
+
81
+
Save changes and restart components as needed.
79
82
80
83
## Using the Hive Warehouse Connector
81
84
@@ -93,17 +96,17 @@ All examples provided in this article will be executed through spark-shell.
93
96
94
97
To start a spark-shell session, do the following steps:
95
98
96
-
1. SSH into the headnode for your cluster. For more information about connecting to your cluster with SSH, see [Connect to HDInsight (Apache Hadoop) using SSH](../../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md).
97
-
1. Change into the correct directory by typing `cd /usr/hdp/current/hive_warehouse_connector` or provide the full path to all jar files used as parameters in the spark-shell command.
99
+
1. SSH into the headnode for your Apache Spark cluster. For more information about connecting to your cluster with SSH, see [Connect to HDInsight (Apache Hadoop) using SSH](../../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md).
100
+
98
101
1. Enter the following command to start the spark shell:
1. You will see a welcome message and a `scala>` prompt where you can enter commands.
109
+
You will see a welcome message and a `scala>` prompt where you can enter commands.
107
110
108
111
1. After starting the spark-shell, a Hive Warehouse Connector instance can be started using the following commands:
109
112
@@ -116,8 +119,10 @@ To start a spark-shell session, do the following steps:
116
119
117
120
The Enterprise Security Package (ESP) provides enterprise-grade capabilities like Active Directory-based authentication, multi-user support, and role-based access control forApache Hadoop clustersin Azure HDInsight. For more information on ESP, see [Use Enterprise Security Package in HDInsight](../domain-joined/apache-domain-joined-architecture.md).
118
121
119
-
1. Follow the initial steps 1 and 2 under [Connecting and running queries](#connecting-and-running-queries).
122
+
1. SSH into the headnode for your Apache Spark cluster. For more information about connecting to your cluster with SSH, see [Connect to HDInsight (Apache Hadoop) using SSH](../../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md).
123
+
120
124
1. Type `kinit` and login with a domain user.
125
+
121
126
1. Start spark-shell with the full list of configuration parameters as shown below. All of the values in all capital letters between angle brackets must be specified based on your cluster. If you need to find out the values to input for any of the parameters below, see the section on [Hive Warehouse Connector setup](#hive-warehouse-connector-setup).:
122
127
123
128
```bash
@@ -131,7 +136,7 @@ The Enterprise Security Package (ESP) provides enterprise-grade capabilities lik
The results of the query are Spark DataFrames, which can be used with Spark libraries like MLIB and SparkSQL.
145
150
146
-
###Writing out Spark DataFrames to Hive tables
151
+
###Writing out Spark DataFrames to Hive tables
147
152
148
153
Spark doesn’t natively support writing to Hive’s managed ACID tables. Using HWC, however, you can write out any DataFrame to a Hive table. You can see this functionality at work in the following example:
149
154
@@ -153,43 +158,45 @@ Spark doesn’t natively support writing to Hive’s managed ACID tables. Using
2. Filter the table `hivesampletable` where the column `state` equals `Colorado`. This query of the Hive table is returned asaSparkDataFrame. Then the DataFrame is saved in the Hive table `sampletable_colorado` using the `write` function.
161
+
1. Filter the table `hivesampletable` where the column `state` equals `Colorado`. This query of the Hive table is returned as a Spark DataFrame. Then the DataFrame is saved in the Hive table `sampletable_colorado` using the `write` function.
@@ -198,12 +205,22 @@ Follow the steps below to create a Hive Warehouse Connector example that ingests
198
205
>[!Important]
199
206
> The `metastoreUri` and `database` options must currently be set manually due to a known issue in Apache Spark. For more information about this issue, see [SPARK-25460](https://issues.apache.org/jira/browse/SPARK-25460).
200
207
201
-
1. You can view the data inserted into the table with the following command:
208
+
1. Return to the second SSH session and enter the following values:
209
+
210
+
```bash
211
+
foo
212
+
HiveSpark
213
+
bar
214
+
```
215
+
216
+
1. Return to the first SSH session and note the brief activity. Use the following command to view the data:
202
217
203
218
```scala
204
219
hive.table("stream_table").show()
205
220
```
206
221
222
+
Use **Ctrl + C** to stop netcat on the second SSH session. Use `:q` to exit spark-shell on the first SSH session.
223
+
207
224
### Securing data on Spark ESP clusters
208
225
209
226
1. Create a table `demo` with some sample data by entering the following commands:
0 commit comments