Skip to content

Commit 7329be0

Browse files
authored
Merge pull request #184165 from konjac/external-hms-ga
Remove preview flag of external Hive Metastore
2 parents 592f9ae + dd3b4a7 commit 7329be0

File tree

1 file changed

+37
-36
lines changed

1 file changed

+37
-36
lines changed

articles/synapse-analytics/spark/apache-spark-external-metastore.md

Lines changed: 37 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
title: Use external Hive Metastore for Azure Synapse Spark Pool
33
description: Learn how to set up external Hive Metastore for Azure Synapse Spark Pool.
4-
keywords: external Hive metastore,share,Synapse
4+
keywords: external Hive Metastore,share,Synapse
55
ms.service: synapse-analytics
66
ms.topic: conceptual
77
ms.subservice: spark
@@ -10,26 +10,25 @@ author: yanancai
1010
ms.date: 09/08/2021
1111
---
1212

13-
# Use external Hive Metastore for Synapse Spark Pool (Preview)
13+
# Use external Hive Metastore for Synapse Spark Pool
1414

15-
Azure Synapse Analytics allows Apache Spark pools in the same workspace to share a managed HMS (Hive Metastore Service) compatible metastore as their catalog. When customers want to persist the Hive catalog outside of the workspace, and share catalog objects with other computational engines outside of the workspace, such as HDInsight and Azure Databricks, they can connect to an external Hive Metastore. In this article, you learn how to connect Synapse Spark to an external Apache Hive Metastore.
15+
Azure Synapse Analytics allows Apache Spark pools in the same workspace to share a managed HMS (Hive Metastore) compatible metastore as their catalog. When customers want to persist the Hive catalog metadata outside of the workspace, and share catalog objects with other computational engines outside of the workspace, such as HDInsight and Azure Databricks, they can connect to an external Hive Metastore. In this article, you can learn how to connect Synapse Spark to an external Apache Hive Metastore.
1616

17-
## Supported Hive metastore versions
18-
19-
The feature works with both Spark 2.4 and Spark 3.0. The following table shows the supported Hive metastore service (HMS) versions for each Spark version.
17+
## Supported Hive Metastore versions
2018

19+
The feature works with both Spark 2.4 and Spark 3.1. The following table shows the supported Hive Metastore versions for each Spark version.
2120

2221
|Spark Version|HMS 0.13.X|HMS 1.2.X|HMS 2.1.X|HMS 2.3.x|HMS 3.1.X|
2322
|--|--|--|--|--|--|
2423
|2.4|Yes|Yes|Yes|Yes|No|
25-
|3|Yes|Yes|Yes|Yes|Yes|
24+
|3.1|Yes|Yes|Yes|Yes|Yes|
2625

27-
## Set up Hive metastore linked service
26+
## Set up linked service to Hive Metastore
2827

2928
> [!NOTE]
30-
> Only Azure SQL Database and Azure Database for MySQL are supported as an external Hive metastore.
29+
> Only Azure SQL Database and Azure Database for MySQL are supported as an external Hive Metastore. And currently we only support User-Password authentication. If the provided database is blank, please provision it via [Hive Schema Tool](https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool) to create database schema.
3130
32-
Follow below steps to set up a linked service to the external Hive metastore in Synapse workspace.
31+
Follow below steps to set up a linked service to the external Hive Metastore in Synapse workspace.
3332

3433
1. Open Synapse Studio, go to **Manage > Linked services** at left, click **New** to create a new linked service.
3534

@@ -39,18 +38,18 @@ Follow below steps to set up a linked service to the external Hive metastore in
3938

4039
3. Provide **Name** of the linked service. Record the name of the linked service, this info will be used to configure Spark shortly.
4140

42-
4. You can either select **Azure SQL Database**/**Azure Database for MySQL** for the external Hive metastore from Azure subscription list, or enter the info manually.
41+
4. You can either select **Azure SQL Database**/**Azure Database for MySQL** for the external Hive Metastore from Azure subscription list, or enter the info manually.
4342

44-
5. Currently we only support User-Password authentication. Provide **User name** and **Password** to set up the connection.
43+
5. Provide **User name** and **Password** to set up the connection.
4544

4645
6. **Test connection** to verify the username and password.
4746

4847
7. Click **Create** to create the linked service.
4948

5049
### Test connection and get the metastore version in notebook
51-
Some network security rule settings may block access from Spark pool to the external Hive metastore DB. Before you configure the Spark pool, run below code in any Spark pool notebook to test connection to the external Hive metastore DB.
50+
Some network security rule settings may block access from Spark pool to the external Hive Metastore DB. Before you configure the Spark pool, run below code in any Spark pool notebook to test connection to the external Hive Metastore DB.
5251

53-
You can also get your Hive metastore version from the output results. The Hive metastore version will be used in the Spark configuration.
52+
You can also get your Hive Metastore version from the output results. The Hive Metastore version will be used in the Spark configuration.
5453

5554
#### Connection testing code for Azure SQL
5655
```scala
@@ -62,9 +61,9 @@ try {
6261
val connection = DriverManager.getConnection(url)
6362
val result = connection.createStatement().executeQuery("select t.SCHEMA_VERSION from VERSION t")
6463
result.next();
65-
println(s"Successful to test connection. Hive metastore version is ${result.getString(1)}")
64+
println(s"Successful to test connection. Hive Metastore version is ${result.getString(1)}")
6665
} catch {
67-
case ex: Throwable =>println(s"Failed to establish connection:\n $ex")
66+
case ex: Throwable => println(s"Failed to establish connection:\n $ex")
6867
}
6968
```
7069

@@ -78,27 +77,27 @@ try {
7877
val connection = DriverManager.getConnection(url, "{your_username_here}", "{your_password_here}");
7978
val result = connection.createStatement().executeQuery("select t.SCHEMA_VERSION from VERSION t")
8079
result.next();
81-
println(s"Successful to test connection. Hive metastore version is ${result.getString(1)}")
80+
println(s"Successful to test connection. Hive Metastore version is ${result.getString(1)}")
8281
} catch {
83-
case ex: Throwable =>println(s"Failed to establish connection:\n $ex")
82+
case ex: Throwable => println(s"Failed to establish connection:\n $ex")
8483
}
8584
```
8685

87-
## Configure Spark to use the external Hive metastore
88-
After creating the linked service to the external Hive metastore successfully, you need to setup a few configurations in the Spark to use the external Hive metastore. You can both set up the configuration at Spark pool level, or at Spark session level.
86+
## Configure Spark to use the external Hive Metastore
87+
After creating the linked service to the external Hive Metastore successfully, you need to setup a few Spark configurations to use the external Hive Metastore. You can both set up the configuration at Spark pool level, or at Spark session level.
8988

9089
Here are the configurations and descriptions:
9190

9291
> [!NOTE]
93-
> Synapse aims to works smoothly with computes from HDI. However HMS 3.1 in HDI 4.0 is not full compatible with the OSS HMS 3.1. For OSS HMS 3.1, please check [here](#hms-schema-change-for-oss-hms-31).
92+
> Synapse aims to work smoothly with computes from HDI. However HMS 3.1 in HDI 4.0 is not fully compatible with the OSS HMS 3.1. For OSS HMS 3.1, please check [here](#hms-schema-change-for-oss-hms-31).
9493
9594
|Spark config|Description|
9695
|--|--|
9796
|`spark.sql.hive.metastore.version`|Supported versions: <ul><li>`0.13`</li><li>`1.2`</li><li>`2.1`</li><li>`2.3`</li><li>`3.1`</li></ul> Make sure you use the first 2 parts without the 3rd part|
9897
|`spark.sql.hive.metastore.jars`|<ul><li>Version 0.13: `/opt/hive-metastore/lib-0.13/*:/usr/hdp/current/hadoop-client/lib/*` </li><li>Version 1.2: `/opt/hive-metastore/lib-1.2/*:/usr/hdp/current/hadoop-client/lib/*` </li><li>Version 2.1: `/opt/hive-metastore/lib-2.1/*:/usr/hdp/current/hadoop-client/lib/*` </li><li>Version 2.3: `/opt/hive-metastore/lib-2.3/*:/usr/hdp/current/hadoop-client/lib/*` </li><li>Version 3.1: `/opt/hive-metastore/lib-3.1/*:/usr/hdp/current/hadoop-client/lib/*`</li></ul>|
9998
|`spark.hadoop.hive.synapse.externalmetastore.linkedservice.name`|Name of your linked service|
10099

101-
### Configure Spark pool
100+
### Configure at Spark pool level
102101
When creating the Spark pool, under **Additional Settings** tab, put below configurations in a text file and upload it in **Apache Spark configuration** section. You can also use the context menu for an existing Spark pool, choose Apache Spark configuration to add these configurations.
103102

104103
:::image type="content" source="./media/use-external-metastore/config-spark-pool.png" alt-text="Configure the Spark pool":::
@@ -107,7 +106,7 @@ Update metastore version and linked service name, and save below configs in a te
107106

108107
```properties
109108
spark.sql.hive.metastore.version <your hms version, Make sure you use the first 2 parts without the 3rd part>
110-
spark.hadoop.hive.synapse.externalmetastore.linkedservice.name <your linked service name to Azure SQL DB>
109+
spark.hadoop.hive.synapse.externalmetastore.linkedservice.name <your linked service name>
111110
spark.sql.hive.metastore.jars /opt/hive-metastore/lib-<your hms version, 2 parts>/*:/usr/hdp/current/hadoop-client/lib/*
112111
```
113112

@@ -119,8 +118,8 @@ spark.hadoop.hive.synapse.externalmetastore.linkedservice.name HiveCatalog21
119118
spark.sql.hive.metastore.jars /opt/hive-metastore/lib-2.1/*:/usr/hdp/current/hadoop-client/lib/*
120119
```
121120

122-
### Configure a Spark session
123-
If you don't want to configure your Spark pool, you can also configure the Spark session in notebook using %%configure magic command. Here is the code. Same configuration can also be applied to a Spark batch job.
121+
### Configure at Spark session level
122+
For notebook session, you can also configure the Spark session in notebook using `%%configure` magic command. Here is the code.
124123

125124
```json
126125
%%configure -f
@@ -133,16 +132,18 @@ If you don't want to configure your Spark pool, you can also configure the Spark
133132
}
134133
```
135134

135+
For batch job, same configuration can also be applied via `SparkConf`.
136+
136137
### Run queries to verify the connection
137138
After all these settings, try listing catalog objects by running below query in Spark notebook to check the connectivity to the external Hive Metastore.
138139
```python
139140
spark.sql("show databases").show()
140141
```
141142

142143
## Set up storage connection
143-
The linked service to Hive metastore database just provides access to Hive catalog metadata. To query the existing tables, you need to set up connection to the storage account that stores the underlying data for your Hive tables as well.
144+
The linked service to Hive Metastore database just provides access to Hive catalog metadata. To query the existing tables, you need to set up connection to the storage account that stores the underlying data for your Hive tables as well.
144145

145-
### Set up connection to ADLS Gen 2
146+
### Set up connection to Azure Data Lake Storage Gen 2
146147
#### Workspace primary storage account
147148
If the underlying data of your Hive tables is stored in the workspace primary storage account, you don't need to do extra settings. It will just work as long as you followed storage setting up instructions during workspace creation.
148149

@@ -157,7 +158,7 @@ If the underlying data of your Hive tables are stored in Azure Blob storage acco
157158
:::image type="content" source="./media/use-external-metastore/connect-to-storage-account.png" alt-text="Connect to storage account" border="true":::
158159

159160
2. Choose **Azure Blob Storage** and click **Continue**.
160-
3. Provide **Name** of the linked service. Record the name of the linked service, this info will be used in Spark session configuration shortly.
161+
3. Provide **Name** of the linked service. Record the name of the linked service, this info will be used in Spark configuration shortly.
161162
4. Select the Azure Blob Storage account. Make sure Authentication method is **Account key**. Currently Spark pool can only access Blob Storage account via account key.
162163
5. **Test connection** and click **Create**.
163164
6. After creating the linked service to Blob Storage account, when you run Spark queries, make sure you run below Spark code in the notebook to get access to the the Blob Storage account for the Spark session. Learn more about why you need to do this [here](./apache-spark-secure-credentials-with-tokenlibrary.md).
@@ -178,7 +179,7 @@ After setting up storage connections, you can query the existing tables in the H
178179
## Known limitations
179180

180181
- Synapse Studio object explorer will continue to show objects in managed Synapse metastore instead of the external HMS, we are improving the experience of this.
181-
- [SQL <-> spark synchronization](../sql/develop-storage-files-spark-tables.md) doesn't work when using external HMS.
182+
- [SQL <-> Spark synchronization](../sql/develop-storage-files-spark-tables.md) doesn't work when using external HMS.
182183
- Only Azure SQL Database and Azure Database for MySQL are supported as external Hive Metastore database. Only SQL authorization is supported.
183184
- Currently Spark only works on external Hive tables and non-transactional/non-ACID managed Hive tables. It doesn't support Hive ACID/transactional tables now.
184185
- Apache Ranger integration is not supported as of now.
@@ -203,11 +204,11 @@ spark.conf.set('fs.azure.sas.%s.%s.blob.core.windows.net' % (blob_container_name
203204
```
204205

205206
### See below error when query a table stored in ADLS Gen2 account
206-
```
207+
```text
207208
Operation failed: "This request is not authorized to perform this operation using this permission.", 403, HEAD
208209
```
209210

210-
This could happen because the user who run Spark query doesn't have enough access to the underlying storage account. Make sure the users who run Spark queries have **Storage Blob Data Contributor** role on the ADLS Gen2 storage account. This step can be done later after creating the linked service.
211+
This could happen because the user who runs Spark query doesn't have enough access to the underlying storage account. Make sure the user who runs Spark queries has **Storage Blob Data Contributor** role on the ADLS Gen2 storage account. This step can be done after creating the linked service.
211212

212213
### HMS schema related settings
213214
To avoid changing HMS backend schema/version, following hive configs are set by system by default:
@@ -218,7 +219,7 @@ spark.hadoop.datanucleus.fixedDatastore true
218219
spark.hadoop.datanucleus.schema.autoCreateAll false
219220
```
220221

221-
If your HMS version is 1.2.1 or 1.2.2, there's an issue in Hive that claims requiring only 1.2.0 if you turn spark.hadoop.hive.metastore.schema.verification to true. Our suggestion is either you can modify your HMS version to 1.2.0, or overwrite below two configurations to work around:
222+
If your HMS version is `1.2.1` or `1.2.2`, there's an issue in Hive that claims requiring only `1.2.0` if you turn `spark.hadoop.hive.metastore.schema.verification` to `true`. Our suggestion is either you can modify your HMS version to `1.2.0`, or overwrite below two configurations to work around:
222223

223224
```properties
224225
spark.hadoop.hive.metastore.schema.verification false
@@ -228,17 +229,17 @@ spark.hadoop.hive.synapse.externalmetastore.schema.usedefault false
228229
If you need to migrate your HMS version, we recommend using [hive schema tool](https://cwiki.apache.org/confluence/display/Hive/Hive+Schema+Tool). And if the HMS has been used by HDInsight clusters, we suggest using [HDI provided version](../../hdinsight/interactive-query/apache-hive-migrate-workloads.md).
229230

230231
### HMS schema change for OSS HMS 3.1
231-
Synapse aims to works smoothly with computes from HDI. However HMS 3.1 in HDI 4.0 is not full compatible with the OSS HMS 3.1. So please apply the following manually to your HMS 3.1 if it’s not provisioned by HDI.
232+
Synapse aims to work smoothly with computes from HDI. However HMS 3.1 in HDI 4.0 is not fully compatible with the OSS HMS 3.1. So please apply the following manually to your HMS 3.1 if it’s not provisioned by HDI.
232233

233234
```sql
234235
-- HIVE-19416
235236
ALTER TABLE TBLS ADD WRITE_ID bigint NOT NULL DEFAULT(0);
236237
ALTER TABLE PARTITIONS ADD WRITE_ID bigint NOT NULL DEFAULT(0);
237238
```
238239

239-
### When sharing the metastore with HDInsight 4.0 Spark clusters, I cannot see the tables
240+
### When sharing the metastore with HDInsight 4.0 Spark cluster, I cannot see the tables
240241
If you want to share the Hive catalog with a spark cluster in HDInsight 4.0, please ensure your property `spark.hadoop.metastore.catalog.default` in Synapse spark aligns with the value in HDInsight spark. The default value for HDI spark is `spark` and the default value for Synapse spark is `hive`.
241242

242-
### When sharing the Hive metastore with HDInsight 4.0 Hive clusters, I can list the tables successfully, but only get empty result when I query the table
243-
As mentioned in the limitations, Synapse Spark pool only supports external hive tables and non-transactional/ACID managed tables, it doesn't support Hive ACID/transactional tables currently. By default in HDInsight 4.0 Hive clusters, all managed tables are created as ACID/transactional tables by default, that's why you get empty results when querying those tables.
243+
### When sharing the Hive Metastore with HDInsight 4.0 Hive cluster, I can list the tables successfully, but only get empty result when I query the table
244+
As mentioned in the limitations, Synapse Spark pool only supports external hive tables and non-transactional/ACID managed tables, it doesn't support Hive ACID/transactional tables currently. In HDInsight 4.0 Hive clusters, all managed tables are created as ACID/transactional tables by default, that's why you get empty results when querying those tables.
244245

0 commit comments

Comments
 (0)