Skip to content

Commit 8de9a96

Browse files
authored
Merge pull request #185096 from warriersruthi/patch-1
Detailing 'spark-submit' utility & unsupportability of R
2 parents e0704d1 + 61e97e3 commit 8de9a96

File tree

1 file changed

+21
-2
lines changed

1 file changed

+21
-2
lines changed

articles/hdinsight/interactive-query/apache-hive-warehouse-connector.md

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,14 @@ The Hive Warehouse Connector allows you to take advantage of the unique features
1616

1717
Apache Hive offers support for database transactions that are Atomic, Consistent, Isolated, and Durable (ACID). For more information on ACID and transactions in Hive, see [Hive Transactions](https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions). Hive also offers detailed security controls through Apache Ranger and Low Latency Analytical Processing (LLAP) not available in Apache Spark.
1818

19-
Apache Spark, has a Structured Streaming API that gives streaming capabilities not available in Apache Hive. Beginning with HDInsight 4.0, Apache Spark 2.3.1 and Apache Hive 3.1.0 have separate metastores. The separate metastores can make interoperability difficult. The Hive Warehouse Connector makes it easier to use Spark and Hive together. The HWC library loads data from LLAP daemons to Spark executors in parallel. This process makes it more efficient and adaptable than a standard JDBC connection from Spark to Hive.
19+
Apache Spark, has a Structured Streaming API that gives streaming capabilities not available in Apache Hive. Beginning with HDInsight 4.0, Apache Spark 2.3.1 & above, and Apache Hive 3.1.0 have separate metastore catalogs which make interoperability difficult.
20+
21+
The Hive Warehouse Connector (HWC) makes it easier to use Spark and Hive together. The HWC library loads data from LLAP daemons to Spark executors in parallel. This process makes it more efficient and adaptable than a standard JDBC connection from Spark to Hive. This brings out two different execution modes for HWC:
22+
> - Hive JDBC mode via HiveServer2
23+
> - Hive LLAP mode using LLAP daemons **[Recommended]**
24+
25+
By default, HWC is configured to use Hive LLAP daemons.
26+
For executing Hive queries (both read and write) using the above modes with their respective APIs, see [HWC APIs] (./hive-warehouse-connector-apis.md).
2027

2128
:::image type="content" source="./media/apache-hive-warehouse-connector/hive-warehouse-connector-architecture.png" alt-text="hive warehouse connector architecture" border="true":::
2229

@@ -38,6 +45,9 @@ Some of the operations supported by the Hive Warehouse Connector are:
3845
In a scenario where you only have Spark workloads and want to use HWC Library, ensure Interactive Query cluster doesn't have Workload Management feature enabled (`hive.server2.tez.interactive.queue` configuration is not set in Hive configs). <br>
3946
For a scenario where both Spark workloads (HWC) and LLAP native workloads exists, You need to create two separate Interactive Query Clusters with shared metastore database. One cluster for native LLAP workloads where WLM feature can be enabled on need basis and other cluster for HWC only workload where WLM feature shouldn't be configured.
4047
It is important to note that you can view the WLM resource plans from both the clusters even if it is enabled in only one cluster. Don't make any changes to resource plans in the cluster where WLM feature is disabled as it might impact the WLM functionality in other cluster.
48+
> - Although Spark supports R computing language for simplifying its data analysis, Hive Warehouse Connector (HWC) Library is not supported to be used with R. To execute HWC workloads, you can execute queries from Spark to Hive using the JDBC-style HiveWarehouseSession API that supports only Scala, Java, and Python.
49+
> - Executing queries (both read and write) through HiveServer2 via JDBC mode is not supported for complex data types like Arrays/Struct/Map types.
50+
> - HWC supports writing only in ORC file formats. Non-ORC writes (eg: parquet and text file formats) are not supported via HWC.
4151
4252
Hive Warehouse Connector needs separate clusters for Spark and Interactive Query workloads. Follow these steps to set up these clusters in Azure HDInsight.
4353

@@ -122,6 +132,8 @@ Below are some examples to connect to HWC from Spark.
122132

123133
### Spark-shell
124134

135+
This is a way to run Spark interactively through a modified version of the Scala shell.
136+
125137
1. Use [ssh command](../hdinsight-hadoop-linux-use-ssh-unix.md) to connect to your Apache Spark cluster. Edit the command below by replacing CLUSTERNAME with the name of your cluster, and then enter the command:
126138

127139
```cmd
@@ -151,6 +163,10 @@ Below are some examples to connect to HWC from Spark.
151163
152164
### Spark-submit
153165
166+
Spark-submit is a utility to submit any Spark program (or job) to Spark clusters.
167+
168+
The spark-submit job will setup and configure Spark and Hive Warehouse Connector as per our instructions, execute the program we pass to it, then cleanly release the resources that were being used.
169+
154170
Once you build the scala/java code along with the dependencies into an assembly jar, use the below command to launch a Spark application. Replace `<VERSION>`, and `<APP_JAR_PATH>` with the actual values.
155171
156172
* YARN Client mode
@@ -176,7 +192,9 @@ Once you build the scala/java code along with the dependencies into an assembly
176192
/<APP_JAR_PATH>/myHwcAppProject.jar
177193
```
178194
179-
For Python, add the following configuration as well.
195+
This utility is also used when we have written the entire application in pySpark and packaged into py files (Python), so that we can submit the entire code to Spark cluster for execution.
196+
197+
For Python applications, simply pass a .py file in the place of `/<APP_JAR_PATH>/myHwcAppProject.jar`, and add the below configuration (Python .zip) file to the search path with `--py-files`.
180198
181199
```python
182200
--py-files /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-<VERSION>.zip
@@ -229,3 +247,4 @@ kinit USERNAME
229247
* [Use Interactive Query with HDInsight](./apache-interactive-query-get-started.md)
230248
* [HWC integration with Apache Zeppelin](./apache-hive-warehouse-connector-zeppelin.md)
231249
* [Examples of interacting with Hive Warehouse Connector using Zeppelin, Livy, spark-submit, and pyspark](https://community.hortonworks.com/articles/223626/integrating-apache-hive-with-apache-spark-hive-war.html)
250+
* [Submitting Spark Applications via Spark-submit utility](https://spark.apache.org/docs/2.4.0/submitting-applications.html)

0 commit comments

Comments
 (0)