Skip to content

Commit 5c073d5

Browse files
authored
Improved Acrolinx Score
Improved Acrolinx Score
1 parent 849dc5b commit 5c073d5

File tree

1 file changed

+9
-9
lines changed

1 file changed

+9
-9
lines changed

articles/hdinsight/interactive-query/apache-hive-warehouse-connector.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ author: reachnijel
55
ms.author: nijelsf
66
ms.service: hdinsight
77
ms.topic: how-to
8-
ms.date: 07/18/2022
8+
ms.date: 12/09/2022
99
---
1010

1111
# Integrate Apache Spark and Apache Hive with Hive Warehouse Connector in Azure HDInsight
@@ -16,7 +16,7 @@ The Hive Warehouse Connector allows you to take advantage of the unique features
1616

1717
Apache Hive offers support for database transactions that are Atomic, Consistent, Isolated, and Durable (ACID). For more information on ACID and transactions in Hive, see [Hive Transactions](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions). Hive also offers detailed security controls through Apache Ranger and Low Latency Analytical Processing (LLAP) not available in Apache Spark.
1818

19-
Apache Spark, has a Structured Streaming API that gives streaming capabilities not available in Apache Hive. Beginning with HDInsight 4.0, Apache Spark 2.3.1 & above, and Apache Hive 3.1.0 have separate metastore catalogs which make interoperability difficult.
19+
Apache Spark, has a Structured Streaming API that gives streaming capabilities not available in Apache Hive. Beginning with HDInsight 4.0, Apache Spark 2.3.1 & above, and Apache Hive 3.1.0 have separate metastore catalogs, which make interoperability difficult.
2020

2121
The Hive Warehouse Connector (HWC) makes it easier to use Spark and Hive together. The HWC library loads data from LLAP daemons to Spark executors in parallel. This process makes it more efficient and adaptable than a standard JDBC connection from Spark to Hive. This brings out two different execution modes for HWC:
2222
> - Hive JDBC mode via HiveServer2
@@ -44,7 +44,7 @@ Some of the operations supported by the Hive Warehouse Connector are:
4444
> - Hive Warehouse Connector (HWC) Library is not supported for use with Interactive Query Clusters where Workload Management (WLM) feature is enabled. <br>
4545
In a scenario where you only have Spark workloads and want to use HWC Library, ensure Interactive Query cluster doesn't have Workload Management feature enabled (`hive.server2.tez.interactive.queue` configuration is not set in Hive configs). <br>
4646
For a scenario where both Spark workloads (HWC) and LLAP native workloads exists, You need to create two separate Interactive Query Clusters with shared metastore database. One cluster for native LLAP workloads where WLM feature can be enabled on need basis and other cluster for HWC only workload where WLM feature shouldn't be configured.
47-
It is important to note that you can view the WLM resource plans from both the clusters even if it is enabled in only one cluster. Don't make any changes to resource plans in the cluster where WLM feature is disabled as it might impact the WLM functionality in other cluster.
47+
It's important to note that you can view the WLM resource plans from both the clusters even if it's enabled in only one cluster. Don't make any changes to resource plans in the cluster where WLM feature is disabled as it might impact the WLM functionality in other cluster.
4848
> - Although Spark supports R computing language for simplifying its data analysis, Hive Warehouse Connector (HWC) Library is not supported to be used with R. To execute HWC workloads, you can execute queries from Spark to Hive using the JDBC-style HiveWarehouseSession API that supports only Scala, Java, and Python.
4949
> - Executing queries (both read and write) through HiveServer2 via JDBC mode is not supported for complex data types like Arrays/Struct/Map types.
5050
> - HWC supports writing only in ORC file formats. Non-ORC writes (eg: parquet and text file formats) are not supported via HWC.
@@ -91,7 +91,7 @@ value. The value may be similar to: `thrift://iqgiro.rekufuk2y2cezcbowjkbwfnyvd.
9191

9292
| Configuration | Value |
9393
|----|----|
94-
|`spark.datasource.hive.warehouse.load.staging.dir`| If you are using ADLS Gen2 Storage Account, use `abfss://STORAGE_CONTAINER_NAME@STORAGE_ACCOUNT_NAME.dfs.core.windows.net/tmp`<br>If you are using Azure Blob Storage Account, use `wasbs://STORAGE_CONTAINER_NAME@STORAGE_ACCOUNT_NAME.blob.core.windows.net/tmp`. <br> Set to a suitable HDFS-compatible staging directory. If you have two different clusters, the staging directory should be a folder in the staging directory of the LLAP cluster's storage account so that HiveServer2 has access to it. Replace `STORAGE_ACCOUNT_NAME` with the name of the storage account being used by the cluster, and `STORAGE_CONTAINER_NAME` with the name of the storage container. |
94+
|`spark.datasource.hive.warehouse.load.staging.dir`| If you're using ADLS Gen2 Storage Account, use `abfss://STORAGE_CONTAINER_NAME@STORAGE_ACCOUNT_NAME.dfs.core.windows.net/tmp`<br>If you're using Azure Blob Storage Account, use `wasbs://STORAGE_CONTAINER_NAME@STORAGE_ACCOUNT_NAME.blob.core.windows.net/tmp`. <br> Set to a suitable HDFS-compatible staging directory. If you've two different clusters, the staging directory should be a folder in the staging directory of the LLAP cluster's storage account so that HiveServer2 has access to it. Replace `STORAGE_ACCOUNT_NAME` with the name of the storage account being used by the cluster, and `STORAGE_CONTAINER_NAME` with the name of the storage container. |
9595
|`spark.sql.hive.hiveserver2.jdbc.url`| The value you obtained earlier from **HiveServer2 Interactive JDBC URL** |
9696
|`spark.datasource.hive.warehouse.metastoreUri`| The value you obtained earlier from **hive.metastore.uris**. |
9797
|`spark.security.credentials.hiveserver2.enabled`|`true` for YARN cluster mode and `false` for YARN client mode. |
@@ -114,7 +114,7 @@ Apart from the configurations mentioned in the previous section, add the followi
114114
|----|----|
115115
| `spark.sql.hive.hiveserver2.jdbc.url.principal` | `hive/<llap-headnode>@<AAD-Domain>` |
116116

117-
* From a web browser, navigate to `https://CLUSTERNAME.azurehdinsight.net/#/main/services/HIVE/summary` where CLUSTERNAME is the name of your Interactive Query cluster. Click on **HiveServer2 Interactive**. You will see the Fully Qualified Domain Name (FQDN) of the head node on which LLAP is running as shown in the screenshot. Replace `<llap-headnode>` with this value.
117+
* From a web browser, navigate to `https://CLUSTERNAME.azurehdinsight.net/#/main/services/HIVE/summary` where CLUSTERNAME is the name of your Interactive Query cluster. Click on **HiveServer2 Interactive**. You'll see the Fully Qualified Domain Name (FQDN) of the head node on which LLAP is running as shown in the screenshot. Replace `<llap-headnode>` with this value.
118118

119119
:::image type="content" source="./media/apache-hive-warehouse-connector/head-node-hive-server-interactive.png" alt-text="hive warehouse connector Head Node" border="true":::
120120

@@ -161,7 +161,7 @@ This is a way to run Spark interactively through a modified version of the Scala
161161
--conf spark.security.credentials.hiveserver2.enabled=false
162162
```
163163
164-
1. After starting the spark shell, a Hive Warehouse Connector instance can be started using the following commands:
164+
1. After you start the spark shell, a Hive Warehouse Connector instance can be started using the following commands:
165165
166166
```scala
167167
import com.hortonworks.hwc.HiveWarehouseSession
@@ -172,7 +172,7 @@ This is a way to run Spark interactively through a modified version of the Scala
172172
173173
Spark-submit is a utility to submit any Spark program (or job) to Spark clusters.
174174
175-
The spark-submit job will setup and configure Spark and Hive Warehouse Connector as per our instructions, execute the program we pass to it, then cleanly release the resources that were being used.
175+
The spark-submit job will set up and configure Spark and Hive Warehouse Connector as per our instructions, execute the program we pass to it, then cleanly release the resources that were being used.
176176
177177
Once you build the scala/java code along with the dependencies into an assembly jar, use the below command to launch a Spark application. Replace `<VERSION>`, and `<APP_JAR_PATH>` with the actual values.
178178
@@ -201,7 +201,7 @@ Once you build the scala/java code along with the dependencies into an assembly
201201
202202
This utility is also used when we have written the entire application in pySpark and packaged into py files (Python), so that we can submit the entire code to Spark cluster for execution.
203203
204-
For Python applications, simply pass a .py file in the place of `/<APP_JAR_PATH>/myHwcAppProject.jar`, and add the below configuration (Python .zip) file to the search path with `--py-files`.
204+
For Python applications, pass a .py file in the place of `/<APP_JAR_PATH>/myHwcAppProject.jar`, and add the below configuration (Python .zip) file to the search path with `--py-files`.
205205
206206
```python
207207
--py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-<VERSION>.zip
@@ -226,7 +226,7 @@ kinit USERNAME
226226
INSERT INTO demo VALUES ('InteractiveQuery');
227227
```
228228

229-
1. View the table's contents with the following command. Before applying the policy, the `demo` table shows the full column.
229+
1. View the table's contents with the following command. Before you apply the policy, the `demo` table shows the full column.
230230

231231
```scala
232232
hive.executeQuery("SELECT * FROM demo").show()

0 commit comments

Comments
 (0)