Skip to content

Commit 4dd419b

Browse files
authored
Merge pull request #113126 from dagiro/freshness_c59
freshness_c59
2 parents 2cb2f66 + 403c8d8 commit 4dd419b

File tree

1 file changed

+14
-13
lines changed

1 file changed

+14
-13
lines changed

articles/hdinsight/interactive-query/apache-hive-warehouse-connector.md

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,17 @@ ms.author: hrasheed
66
ms.reviewer: hrasheed
77
ms.service: hdinsight
88
ms.topic: conceptual
9-
ms.date: 03/02/2020
9+
ms.custom: seoapr2020
10+
ms.date: 04/28/2020
1011
---
1112

1213
# Integrate Apache Spark and Apache Hive with the Hive Warehouse Connector
1314

14-
The Apache Hive Warehouse Connector (HWC) is a library that allows you to work more easily with Apache Spark and Apache Hive by supporting tasks such as moving data between Spark DataFrames and Hive tables, and also directing Spark streaming data into Hive tables. Hive Warehouse Connector works like a bridge between Spark and Hive. It supports Scala, Java, and Python for development.
15+
The Apache Hive Warehouse Connector (HWC) is a library that allows you to work more easily with Apache Spark and Apache Hive. Easier by supporting tasks such as moving data between Spark DataFrames and Hive tables. And directing Spark streaming data into Hive tables. Hive Warehouse Connector works like a bridge between Spark and Hive. It supports Scala, Java, and Python for development.
1516

16-
The Hive Warehouse Connector allows you to take advantage of the unique features of Hive and Spark to build powerful big-data applications. Apache Hive offers support for database transactions that are Atomic, Consistent, Isolated, and Durable (ACID). For more information on ACID and transactions in Hive, see [Hive Transactions](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions). Hive also offers detailed security controls through Apache Ranger and Low Latency Analytical Processing not available in Apache Spark.
17+
The Hive Warehouse Connector allows you to take advantage of Hive and Spark unique features. Features used to build powerful big-data applications. Apache Hive offers support for database transactions that are Atomic, Consistent, Isolated, and Durable (ACID). For more information on ACID and transactions in Hive, see [Hive Transactions](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions). Hive also offers detailed security controls through Apache Ranger and Low Latency Analytical Processing not available in Apache Spark.
1718

18-
Apache Spark, has a Structured Streaming API that gives streaming capabilities not available in Apache Hive. Beginning with HDInsight 4.0, Apache Spark 2.3.1 and Apache Hive 3.1.0 have separate metastores, which can make interoperability difficult. The Hive Warehouse Connector makes it easier to use Spark and Hive together. The HWC library loads data from LLAP daemons to Spark executors in parallel, making it more efficient and scalable than using a standard JDBC connection from Spark to Hive.
19+
Apache Spark, has a Structured Streaming API that gives streaming capabilities not available in Apache Hive. Beginning with HDInsight 4.0, Apache Spark 2.3.1 and Apache Hive 3.1.0 have separate metastores. These separate metastores can make interoperability difficult. The Hive Warehouse Connector makes it easier to use Spark and Hive together. The HWC library loads data from LLAP (Low Latency Analytical Processing) daemons to Spark executors in parallel. This action makes it more efficient and adaptable than using a standard JDBC connection from Spark to Hive.
1920

2021
![hive warehouse connector architecture](./media/apache-hive-warehouse-connector/hive-warehouse-connector-architecture.png)
2122

@@ -67,7 +68,7 @@ From your Spark Ambari web UI, navigate to **Spark2** > **CONFIGS** > **Custom s
6768

6869
![Apache Ambari Spark2 configuration](./media/apache-hive-warehouse-connector/hive-warehouse-connector-spark2-ambari.png)
6970

70-
Select **Add Property...** as needed to add/update the following:
71+
Select **Add Property...** as needed to add/update the following value:
7172

7273
| Key | Value |
7374
|----|----|
@@ -106,7 +107,7 @@ To start a spark-shell session, do the following steps:
106107
--conf spark.security.credentials.hiveserver2.enabled=false
107108
```
108109

109-
You will see a welcome message and a `scala>` prompt where you can enter commands.
110+
You'll see a welcome message and a `scala>` prompt where you can enter commands.
110111
111112
1. After starting the spark-shell, a Hive Warehouse Connector instance can be started using the following commands:
112113
@@ -117,13 +118,13 @@ To start a spark-shell session, do the following steps:
117118
118119
### Connecting and running queries on Enterprise Security Package (ESP) clusters
119120
120-
The Enterprise Security Package (ESP) provides enterprise-grade capabilities like Active Directory-based authentication, multi-user support, and role-based access control for Apache Hadoop clusters in Azure HDInsight. For more information on ESP, see [Use Enterprise Security Package in HDInsight](../domain-joined/apache-domain-joined-architecture.md).
121+
The Enterprise Security Package (ESP) provides enterprise-grade capabilities like Active Directory-based authentication. And multi-user support, and role-based access control for Apache Hadoop clusters in Azure HDInsight. For more information on ESP, see [Use Enterprise Security Package in HDInsight](../domain-joined/apache-domain-joined-architecture.md).
121122
122-
1. SSH into the headnode for your Apache Spark cluster. For more information about connecting to your cluster with SSH, see [Connect to HDInsight (Apache Hadoop) using SSH](../../hdinsight/hdinsight-hadoop-linux-use-ssh-unix.md).
123+
1. SSH into the headnode for your Apache Spark cluster.
123124
124125
1. Type `kinit` and login with a domain user.
125126
126-
1. Start spark-shell with the full list of configuration parameters as shown below. All of the values in all capital letters between angle brackets must be specified based on your cluster. If you need to find out the values to input for any of the parameters below, see the section on [Hive Warehouse Connector setup](#hive-warehouse-connector-setup).:
127+
1. Start spark-shell with the full list of configuration parameters as shown below. All of the values in all capital letters between angle brackets must be specified based on your cluster. If you need to find out the values to input for any of the parameters below, see the section on [Hive Warehouse Connector setup](#hive-warehouse-connector-setup).
127128
128129
```bash
129130
spark-shell --master yarn \
@@ -176,7 +177,7 @@ Spark doesn’t natively support writing to Hive’s managed ACID tables. Using
176177
177178
Using Hive Warehouse Connector, you can use Spark streaming to write data into Hive tables.
178179
179-
Follow the steps below to create a Hive Warehouse Connector example that ingests data from a Spark stream on localhost port 9999 into a Hive table.
180+
Follow the steps below to create a Hive Warehouse Connector. The example ingests data from a Spark stream on localhost port 9999 into a Hive table.
180181
181182
1. Follow the steps under [Connecting and running queries](#connecting-and-running-queries).
182183
@@ -188,7 +189,7 @@ Follow the steps below to create a Hive Warehouse Connector example that ingests
188189
189190
1. Generate data for the Spark stream that you created, by doing the following steps:
190191
1. Open a second SSH session on the same Spark cluster.
191-
1. At the command prompt, type `nc -lk 9999`. This command uses the netcat utility to send data from the command line to the specified port.
192+
1. At the command prompt, type `nc -lk 9999`. This command uses the `netcat` utility to send data from the command line to the specified port.
192193
193194
1. Return to the first SSH session and create a new Hive table to hold the streaming data. At the spark-shell, enter the following command:
194195
@@ -219,7 +220,7 @@ Follow the steps below to create a Hive Warehouse Connector example that ingests
219220
hive.table("stream_table").show()
220221
```
221222
222-
Use **Ctrl + C** to stop netcat on the second SSH session. Use `:q` to exit spark-shell on the first SSH session.
223+
Use **Ctrl + C** to stop `netcat` on the second SSH session. Use `:q` to exit spark-shell on the first SSH session.
223224
224225
### Securing data on Spark ESP clusters
225226
@@ -248,7 +249,7 @@ Use **Ctrl + C** to stop netcat on the second SSH session. Use `:q` to exit spar
248249

249250
![hive warehouse connector ranger hive policy list](./media/apache-hive-warehouse-connector/hive-warehouse-connector-ranger-hive-policy-list.png)
250251

251-
a. Provide a desired policy name. Select database: **Default**, Hive table: **demo**, Hive column: **name**, User: **rsadmin2**, Access Types: **select**, and **Partial mask: show last 4** from the **Select Masking Option** menu. Click **Add**.
252+
a. Provide a policy name. Select database: **Default**, Hive table: **demo**, Hive column: **name**, User: **rsadmin2**, Access Types: **select**, and **Partial mask: show last 4** from the **Select Masking Option** menu. Click **Add**.
252253
![create policy](./media/apache-hive-warehouse-connector/hive-warehouse-connector-ranger-create-policy.png)
253254
1. View the table's contents again. After applying the ranger policy, we can see only the last four characters of the column.
254255

0 commit comments

Comments
 (0)