Skip to content

Commit ee4b0c0

Browse files
authored
Merge pull request #95816 from dagiro/hive1
hive1
2 parents 2b33500 + f5d92b8 commit ee4b0c0

File tree

1 file changed

+108
-24
lines changed

1 file changed

+108
-24
lines changed

articles/hdinsight/interactive-query/apache-hive-migrate-workloads.md

Lines changed: 108 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,14 @@
11
---
22
title: Migrate Azure HDInsight 3.6 Hive workloads to HDInsight 4.0
33
description: Learn how to migrate Apache Hive workloads on HDInsight 3.6 to HDInsight 4.0.
4-
ms.service: hdinsight
54
author: msft-tacox
65
ms.author: tacox
76
ms.reviewer: jasonh
7+
ms.service: hdinsight
88
ms.topic: conceptual
9-
ms.date: 04/24/2019
9+
ms.date: 11/13/2019
1010
---
11+
1112
# Migrate Azure HDInsight 3.6 Hive workloads to HDInsight 4.0
1213

1314
This document shows you how to migrate Apache Hive and LLAP workloads on HDInsight 3.6 to HDInsight 4.0. HDInsight 4.0 provides newer Hive and LLAP features such as materialized views and query result caching. When you migrate your workloads to HDInsight 4.0, you can use many newer features of Hive 3 that aren't available on HDInsight 3.6.
@@ -19,42 +20,104 @@ This article covers the following subjects:
1920
* Preservation of Hive security policies across HDInsight versions
2021
* Query execution and debugging from HDInsight 3.6 to HDInsight 4.0
2122

22-
## Migrate Apache Hive metadata to HDInsight 4.0
23+
One advantage of Hive is the ability to export metadata to an external database (referred to as the Hive Metastore). The **Hive Metastore** is responsible for storing table statistics, including the table storage location, column names, and table index information. The metastore database schema differs between Hive versions. The recommended way to upgrade the Hive metastore safely is to create a copy and upgrade the copy instead of the current production environment.
24+
25+
## Copy metastore
26+
27+
HDInsight 3.6 and HDInsight 4.0 require different metastore schemas and can't share a single metastore.
28+
29+
### External metastore
30+
31+
Create a new copy of your external metastore. If you're using an external metastore, one of the safe and easy ways to make a copy of the metastore is to [restore the Database](../../sql-database/sql-database-recovery-using-backups.md#point-in-time-restore) with a different name using the SQL Database restore function. See [Use external metadata stores in Azure HDInsight](../hdinsight-use-external-metadata-stores.md) to learn more about attaching an external metastore to an HDInsight cluster.
32+
33+
### Internal metastore
34+
35+
If you're using the internal metastore, you can use queries to export object definitions in the Hive metastore, and import them into a new database.
36+
37+
1. Connect to the HDInsight cluster by using a [Secure Shell (SSH) client](../hdinsight-hadoop-linux-use-ssh-unix.md).
38+
39+
1. Connect to HiveServer2 with your [Beeline client](../hadoop/apache-hadoop-use-hive-beeline.md) from your open SSH session by entering the following command:
40+
41+
```hiveql
42+
for d in `beeline -u "jdbc:hive2://localhost:10001/;transportMode=http" --showHeader=false --silent=true --outputformat=tsv2 -e "show databases;"`; do echo "create database $d; use $d;" >> alltables.sql; for t in `beeline -u "jdbc:hive2://localhost:10001/$d;transportMode=http" --showHeader=false --silent=true --outputformat=tsv2 -e "show tables;"` ; do ddl=`beeline -u "jdbc:hive2://localhost:10001/$d;transportMode=http" --showHeader=false --silent=true --outputformat=tsv2 -e "show create table $t;"`; echo "$ddl ;" >> alltables.sql ; echo "$ddl" | grep -q "PARTITIONED\s*BY" && echo "MSCK REPAIR TABLE $t ;" >> alltables.sql ; done; done
43+
```
44+
45+
This command generates a file named **alltables.sql**. Because default database can't be deleted/re-created, please remove `create database default;` statement in **alltables.sql**.
46+
47+
1. Exit your SSH session. Then enter a scp command to download **alltables.sql** locally.
48+
49+
```bash
50+
scp [email protected]:alltables.sql c:/hdi
51+
```
52+
53+
1. Upload **alltables.sql** to the *new* HDInsight cluster.
54+
55+
```bash
56+
scp c:/hdi/alltables.sql [email protected]:/home/sshuser/
57+
```
58+
59+
1. Then use SSH to connect to the *new* HDInsight cluster. Run the following code from the SSH session:
60+
61+
```bash
62+
beeline -u "jdbc:hive2://localhost:10001/;transportMode=http" -i alltables.sql
63+
```
2364
24-
One advantage of Hive is the ability to export metadata to an external database (referred to as the Hive Metastore). The **Hive Metastore** is responsible for storing table statistics, including the table storage location, column names, and table index information. The metastore database schema differs between Hive versions. Do the following to upgrade a HDInsight 3.6 Hive Metastore so that it's compatible with HDInsight 4.0.
65+
## Upgrade metastore
2566
26-
1. Create a new copy of your external metastore. HDInsight 3.6 and HDInsight 4.0 require different metastore schemas and can't share a single metastore. See [Use external metadata stores in Azure HDInsight](../hdinsight-use-external-metadata-stores.md) to learn more about attaching an external metastore to an HDInsight cluster.
27-
2. Launch a script action against your HDI 3.6 cluster, with "Head nodes" as the node type for execution. Paste the following URI into the textbox marked "Bash Script URI": https://hdiconfigactions.blob.core.windows.net/hivemetastoreschemaupgrade/launch-schema-upgrade.sh.
28-
In the textbox marked "Arguments", enter the servername, database, username and password for the **copied** Hive metastore, separated by spaces. Do not include ".database.windows.net" when specifying the servername.
67+
Once the metastore **copy** is complete, run a schema upgrade script in [Script Action](../hdinsight-hadoop-customize-cluster-linux.md) on the existing HDInsight 3.6 cluster to upgrade the new metastore to Hive 3 schema. This allows the database to be attached as HDInsight 4.0 metastore.
2968
30-
> [!Warning]
69+
Use the values in the table further below. Replace `SQLSERVERNAME DATABASENAME USERNAME PASSWORD` with the appropriate values for the **copied** Hive metastore, separated by spaces. Don't include ".database.windows.net" when specifying the SQL server name.
70+
71+
|Property | Value |
72+
|---|---|
73+
|Script type|- Custom|
74+
|Name|Hive upgrade|
75+
|Bash script URI|`https://hdiconfigactions.blob.core.windows.net/hivemetastoreschemaupgrade/launch-schema-upgrade.sh`|
76+
|Node type(s)|Head|
77+
|Parameters|SQLSERVERNAME DATABASENAME USERNAME PASSWORD|
78+
79+
> [!Warning]
3180
> The upgrade which converts the HDInsight 3.6 metadata schema to the HDInsight 4.0 schema, cannot be reversed.
3281
82+
You can verify the upgrade by running the following sql query against the database:
83+
84+
```sql
85+
select * from dbo.version
86+
```
87+
3388
## Migrate Hive tables to HDInsight 4.0
3489

3590
After completing the previous set of steps to migrate the Hive Metastore to HDInsight 4.0, the tables and databases recorded in the metastore will be visible from within the HDInsight 4.0 cluster by executing `show tables` or `show databases` from within the cluster. See [Query execution across HDInsight versions](#query-execution-across-hdinsight-versions) for information on query execution in HDInsight 4.0 clusters.
3691

3792
The actual data from the tables, however, isn't accessible until the cluster has access to the necessary storage accounts. To make sure your HDInsight 4.0 cluster can access the same data as your old HDInsight 3.6 cluster, complete the following steps:
3893

39-
1. Determine the Azure storage account of your table or database using describe formatted.
40-
2. If your HDInsight 4.0 cluster is already running, attach the Azure storage account to the cluster via Ambari. If you haven't yet created the HDInsight 4.0 cluster, make sure the Azure storage account is specified as either the primary or a secondary cluster storage account. For more information about adding storage accounts to HDInsight clusters, see [Add additional storage accounts to HDInsight](../hdinsight-hadoop-add-storage.md).
94+
1. Determine the Azure storage account of your table or database.
95+
96+
1. If your HDInsight 4.0 cluster is already running, attach the Azure storage account to the cluster via Ambari. If you haven't yet created the HDInsight 4.0 cluster, make sure the Azure storage account is specified as either the primary or a secondary cluster storage account. For more information about adding storage accounts to HDInsight clusters, see [Add additional storage accounts to HDInsight](../hdinsight-hadoop-add-storage.md).
97+
98+
## Deploy new HDInsight 4.0 and connect to the new metastore
99+
100+
After the schema upgrade is complete, deploy a new HDInsight 4.0 cluster and connect the upgraded metastore. If you've already deployed 4.0, set it so that you can connect to the metastore from Ambari.
101+
102+
## Run schema migration script from HDInsight 4.0
41103

42-
> [!Note]
43-
> Tables are treated differently in HDInsight 3.6 and HDInsight 4.0. For this reason, you cannot share the same tables for clusters of different versions. If you want to use HDInsight 3.6 at the same time as HDInsight 4.0, you must have separate copies of the data for each version.
104+
Tables are treated differently in HDInsight 3.6 and HDInsight 4.0. For this reason, you can't share the same tables for clusters of different versions. If you want to use HDInsight 3.6 at the same time as HDInsight 4.0, you must have separate copies of the data for each version.
44105

45106
Your Hive workload may include a mix of ACID and non-ACID tables. One key difference between Hive on HDInsight 3.6 (Hive 2) and Hive on HDInsight 4.0 (Hive 3) is ACID-compliance for tables. In HDInsight 3.6, enabling Hive ACID-compliance requires additional configuration, but in HDInsight 4.0 tables are ACID-compliant by default. The only action required before migration is to run a major compaction against the ACID table on the 3.6 cluster. From the Hive view or from Beeline, run the following query:
46107

47-
```bash
108+
```sql
48109
alter table myacidtable compact 'major';
49110
```
50111

51112
This compaction is required because HDInsight 3.6 and HDInsight 4.0 ACID tables understand ACID deltas differently. Compaction enforces a clean slate that guarantees consistency. Section 4 of the [Hive migration documentation](https://docs.hortonworks.com/HDPDocuments/Ambari-2.7.3.0/bk_ambari-upgrade-major/content/prepare_hive_for_upgrade.html) contains guidance for bulk compaction of HDInsight 3.6 ACID tables.
52113

53-
Once you have completed the metastore migration and compaction steps, you can migrate the actual warehouse. After you complete the Hive warehouse migration, the HDInsight 4.0 warehouse will have the following properties:
114+
Once you've completed the metastore migration and compaction steps, you can migrate the actual warehouse. After you complete the Hive warehouse migration, the HDInsight 4.0 warehouse will have the following properties:
54115

55-
* External tables in HDInsight 3.6 will be external tables in HDInsight 4.0
56-
* Non-transactional managed tables in HDInsight 3.6 will be external tables in HDInsight 4.0
57-
* Transactional managed tables in HDInsight 3.6 will be managed tables in HDInsight 4.0
116+
|3.6 |4.0 |
117+
|---|---|
118+
|External tables|External tables|
119+
|Non-transactional managed tables|External tables|
120+
|Transactional managed tables|Managed tables|
58121

59122
You may need to adjust the properties of your warehouse before executing the migration. For example, if you expect that some table will be accessed by a third party (such as an HDInsight 3.6 cluster), that table must be external once the migration is complete. In HDInsight 4.0, all managed tables are transactional. Therefore, managed tables in HDInsight 4.0 should only be accessed by HDInsight 4.0 clusters.
60123

@@ -63,15 +126,15 @@ Once your table properties are set correctly, execute the Hive warehouse migrati
63126
1. Connect to your cluster headnode using SSH. For instructions, see [Connect to HDInsight using SSH](../hdinsight-hadoop-linux-use-ssh-unix.md)
64127
1. Open a login shell as the Hive user by running `sudo su - hive`
65128
1. Determine the data platform stack version by executing `ls /usr/hdp`. This will display a version string that you should use in the next command.
66-
1. Execute the following command from the shell. Replace `${{STACK_VERSION}}` with the version string from the previous step:
129+
1. Execute the following command from the shell. Replace `STACK_VERSION` with the version string from the previous step:
67130

68131
```bash
69-
/usr/hdp/${{STACK_VERSION}}/hive/bin/hive --config /etc/hive/conf --service strictmanagedmigration --hiveconf hive.strict.managed.tables=true -m automatic --modifyManagedTables
132+
/usr/hdp/STACK_VERSION/hive/bin/hive --config /etc/hive/conf --service strictmanagedmigration --hiveconf hive.strict.managed.tables=true -m automatic --modifyManagedTables
70133
```
71134

72-
After the migration tool completes, your Hive warehouse will be ready for HDInsight 4.0.
135+
After the migration tool completes, your Hive warehouse will be ready for HDInsight 4.0.
73136

74-
> [!Important]
137+
> [!Important]
75138
> Managed tables in HDInsight 4.0 (including tables migrated from 3.6) should not be accessed by other services or applications, including HDInsight 3.6 clusters.
76139
77140
## Secure Hive across HDInsight versions
@@ -84,17 +147,38 @@ Since HDInsight 3.6, HDInsight integrates with Azure Active Directory using HDIn
84147
4. Navigate to the **Ranger Service Manager** panel in your HDInsight 4.0 cluster.
85148
5. Navigate to the policy named **HIVE** and import the ranger policy json from step 2.
86149

150+
## Check compatibility and modify codes as needed in test app
151+
152+
When migrating workloads such as existing programs and queries, please check the release notes and documentation for changes and apply changes as necessary. If your HDInsight 3.6 cluster is using a shared Spark and Hive metastore, [additional configuration using Hive Warehouse Connector](./apache-hive-warehouse-connector.md) is required.
153+
154+
## Deploy new app for production
155+
156+
To switch to the new cluster, e.g. you can install a new client application and use it as a new production environment, or you can upgrade your existing client application and switch to HDInsight 4.0.
157+
158+
## Switch HDInsight 4.0 to the production
159+
160+
If differences were created in the metastore while testing, you'll need to update the changes just before switching. In this case, you can export & import the metastore and then upgrade again.
161+
162+
## Remove the old production
163+
164+
Once you've confirmed that the release is complete and fully operational, you can remove version 3.6 and the previous metastore. Please make sure that everything is migrated before deleting the environment.
165+
87166
## Query execution across HDInsight versions
88167

89168
There are two ways to execute and debug Hive/LLAP queries within an HDInsight 3.6 cluster. HiveCLI provides a command-line experience and the Tez view/Hive view provides a GUI-based workflow.
90169

91170
In HDInsight 4.0, HiveCLI has been replaced with Beeline. HiveCLI is a thrift client for Hiveserver 1, and Beeline is a JDBC client that provides access to Hiveserver 2. Beeline can also be used to connect to any other JDBC-compatible database endpoint. Beeline is available out-of-box on HDInsight 4.0 without any installation needed.
92171

93-
In HDInsight 3.6, the GUI client for interacting with Hive server is the Ambari Hive View. HDInsight 4.0 replaces the Hive View with Hortonworks Data Analytics Studio (DAS). DAS doesn't ship with HDInsight clusters out-of-box and is not an officially supported package. However, DAS can be installed on the cluster as follows:
172+
In HDInsight 3.6, the GUI client for interacting with Hive server is the Ambari Hive View. HDInsight 4.0 replaces the Hive View with Hortonworks Data Analytics Studio (DAS). DAS doesn't ship with HDInsight clusters out-of-box and isn't an officially supported package. However, DAS can be installed on the cluster using a [script action](../hdinsight-hadoop-customize-cluster-linux.md) as follows:
94173

95-
Launch a script action against your cluster, with "Head nodes" as the node type for execution. Paste the following URI into the textbox marked "Bash Script URI": https://hdiconfigactions.blob.core.windows.net/dasinstaller/LaunchDASInstaller.sh
174+
|Property | Value |
175+
|---|---|
176+
|Script type|- Custom|
177+
|Name|DAS|
178+
|Bash script URI|`https://hdiconfigactions.blob.core.windows.net/dasinstaller/LaunchDASInstaller.sh`|
179+
|Node type(s)|Head|
96180

97-
Wait 5 to 10 minutes, then launch Data Analytics Studio by using this URL: https://\<clustername>.azurehdinsight.net/das/
181+
Wait 5 to 10 minutes, then launch Data Analytics Studio by using this URL: `https://CLUSTERNAME.azurehdinsight.net/das/`.
98182

99183
Once DAS is installed, if you don't see the queries you’ve run in the queries viewer, do the following steps:
100184

0 commit comments

Comments
 (0)