Skip to content

Commit e71b8a9

Browse files
authored
Merge pull request #92296 from dagiro/freshness25
freshness25
2 parents 4e53e23 + 8ecd1ba commit e71b8a9

File tree

3 files changed

+40
-25
lines changed

3 files changed

+40
-25
lines changed

articles/hdinsight/hdinsight-use-external-metadata-stores.md

Lines changed: 40 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,12 @@
22
title: Use external metadata stores - Azure HDInsight
33
description: Use external metadata stores with Azure HDInsight clusters, and best practices.
44
author: hrasheed-msft
5-
ms.reviewer: jasonh
65
ms.author: hrasheed
6+
ms.reviewer: jasonh
77
ms.service: hdinsight
88
ms.custom: hdinsightactive
99
ms.topic: conceptual
10-
ms.date: 05/27/2019
10+
ms.date: 10/17/2019
1111
---
1212

1313
# Use external metadata stores in Azure HDInsight
@@ -24,30 +24,37 @@ There are two ways you can set up a metastore for your HDInsight clusters:
2424
## Default metastore
2525

2626
By default, HDInsight creates a metastore with every cluster type. You can instead specify a custom metastore. The default metastore includes the following considerations:
27-
- No additional cost. HDInsight creates a metastore with every cluster type without any additional cost to you.
28-
- Each default metastore is part of the cluster lifecycle. When you delete a cluster, the corresponding metastore and metadata are also deleted.
29-
- You cannot share the default metastore with other clusters.
30-
- The default metastore uses the basic Azure SQL DB, which has a five DTU (database transaction unit) limit.
31-
This default metastore is typically used for relatively simple workloads that don't require multiple clusters and don’t need metadata preserved beyond the cluster's lifecycle.
3227

28+
* No additional cost. HDInsight creates a metastore with every cluster type without any additional cost to you.
29+
30+
* Each default metastore is part of the cluster lifecycle. When you delete a cluster, the corresponding metastore and metadata are also deleted.
31+
32+
* You can't share the default metastore with other clusters.
33+
34+
* The default metastore uses the basic Azure SQL DB, which has a five DTU (database transaction unit) limit.
35+
This default metastore is typically used for relatively simple workloads that don't require multiple clusters and don’t need metadata preserved beyond the cluster's lifecycle.
3336

3437
## Custom metastore
3538

3639
HDInsight also supports custom metastores, which are recommended for production clusters:
37-
- You specify your own Azure SQL Database as the metastore.
38-
- The lifecycle of the metastore is not tied to a clusters lifecycle, so you can create and delete clusters without losing metadata. Metadata such as your Hive schemas will persist even after you delete and re-create the HDInsight cluster.
39-
- A custom metastore lets you attach multiple clusters and cluster types to that metastore. For example, a single metastore can be shared across Interactive Query, Hive, and Spark clusters in HDInsight.
40-
- You pay for the cost of a metastore (Azure SQL DB) according to the performance level you choose.
41-
- You can scale up the metastore as needed.
4240

43-
![HDInsight Hive Metadata Store Use Case](./media/hdinsight-use-external-metadata-stores/metadata-store-use-case.png)
41+
* You specify your own Azure SQL Database as the metastore.
42+
43+
* The lifecycle of the metastore isn't tied to a clusters lifecycle, so you can create and delete clusters without losing metadata. Metadata such as your Hive schemas will persist even after you delete and re-create the HDInsight cluster.
44+
45+
* A custom metastore lets you attach multiple clusters and cluster types to that metastore. For example, a single metastore can be shared across Interactive Query, Hive, and Spark clusters in HDInsight.
46+
47+
* You pay for the cost of a metastore (Azure SQL DB) according to the performance level you choose.
4448

49+
* You can scale up the metastore as needed.
50+
51+
![HDInsight Hive Metadata Store Use Case](./media/hdinsight-use-external-metadata-stores/metadata-store-use-case.png)
4552

4653
### Select a custom metastore during cluster creation
4754

4855
You can point your cluster to a previously created Azure SQL Database during cluster creation, or you can configure the SQL Database after the cluster is created. This option is specified with the **Storage > Metastore settings** while creating a new Hadoop, Spark, or interactive Hive cluster from Azure portal.
4956

50-
![HDInsight Hive Metadata Store Azure portal](./media/hdinsight-use-external-metadata-stores/metadata-store-azure-portal.png)
57+
![HDInsight Hive Metadata Store Azure portal](./media/hdinsight-use-external-metadata-stores/azure-portal-cluster-storage-metastore.png)
5158

5259
You can also add additional clusters to a custom metastore from Azure portal or from Ambari configurations (Hive > Advanced)
5360

@@ -57,22 +64,30 @@ You can also add additional clusters to a custom metastore from Azure portal or
5764

5865
Here are some general HDInsight Hive metastore best practices:
5966

60-
- Use a custom metastore whenever possible, to help separate compute resources (your running cluster) and metadata (stored in the metastore).
61-
- Start with an S2 tier, which provides 50 DTU and 250 GB of storage. If you see a bottleneck, you can scale the database up.
62-
- If you intend multiple HDInsight clusters to access separate data, use a separate database for the metastore on each cluster. If you share a metastore across multiple HDInsight clusters, it means that the clusters use the same metadata and underlying user data files.
63-
- Back up your custom metastore periodically. Azure SQL Database generates backups automatically, but the backup retention timeframe varies. For more information, see [Learn about automatic SQL Database backups](../sql-database/sql-database-automated-backups.md).
64-
- Locate your metastore and HDInsight cluster in the same region, for highest performance and lowest network egress charges.
65-
- Monitor your metastore for performance and availability using Azure SQL Database Monitoring tools, such as the Azure portal or Azure Monitor logs.
66-
- When a new, higher version of Azure HDInsight is created against an existing custom metastore database, the system upgrades the schema of the metastore, which is irreversible without restoring the database from backup.
67-
- If you share a metastore across multiple clusters, ensure all the clusters are the same HDInsight version. Different Hive versions use different metastore database schemas. For example, you cannot share a metastore across Hive 1.2 and Hive 2.1 versioned clusters.
68-
- In HDInsight 4.0, Spark and Hive use independent catalogs for accessing SparkSQL or Hive tables. A table created by Spark resides in the Spark catalog. A table created by Hive resides in the Hive catalog. This is different than HDInsight 3.6 where Hive and Spark shared common catalog. Hive and Spark Integration in HDInsight 4.0 relies on Hive Warehouse Connector (HWC). HWC works as a bridge between Spark and Hive. [Learn about Hive Warehouse Connector](../hdinsight/interactive-query/apache-hive-warehouse-connector.md).
67+
* Use a custom metastore whenever possible, to help separate compute resources (your running cluster) and metadata (stored in the metastore).
68+
69+
* Start with an S2 tier, which provides 50 DTU and 250 GB of storage. If you see a bottleneck, you can scale the database up.
70+
71+
* If you intend multiple HDInsight clusters to access separate data, use a separate database for the metastore on each cluster. If you share a metastore across multiple HDInsight clusters, it means that the clusters use the same metadata and underlying user data files.
72+
73+
* Back up your custom metastore periodically. Azure SQL Database generates backups automatically, but the backup retention timeframe varies. For more information, see [Learn about automatic SQL Database backups](../sql-database/sql-database-automated-backups.md).
74+
75+
* Locate your metastore and HDInsight cluster in the same region, for highest performance and lowest network egress charges.
76+
77+
* Monitor your metastore for performance and availability using Azure SQL Database Monitoring tools, such as the Azure portal or Azure Monitor logs.
78+
79+
* When a new, higher version of Azure HDInsight is created against an existing custom metastore database, the system upgrades the schema of the metastore, which is irreversible without restoring the database from backup.
80+
81+
* If you share a metastore across multiple clusters, ensure all the clusters are the same HDInsight version. Different Hive versions use different metastore database schemas. For example, you can't share a metastore across Hive 1.2 and Hive 2.1 versioned clusters.
82+
83+
* In HDInsight 4.0, Spark and Hive use independent catalogs for accessing SparkSQL or Hive tables. A table created by Spark resides in the Spark catalog. A table created by Hive resides in the Hive catalog. This is different than HDInsight 3.6 where Hive and Spark shared common catalog. Hive and Spark Integration in HDInsight 4.0 relies on Hive Warehouse Connector (HWC). HWC works as a bridge between Spark and Hive. [Learn about Hive Warehouse Connector](../hdinsight/interactive-query/apache-hive-warehouse-connector.md).
6984

70-
## Apache Oozie Metastore
85+
## Apache Oozie metastore
7186

7287
Apache Oozie is a workflow coordination system that manages Hadoop jobs. Oozie supports Hadoop jobs for Apache MapReduce, Pig, Hive, and others. Oozie uses a metastore to store details about current and completed workflows. To increase performance when using Oozie, you can use Azure SQL Database as a custom metastore. The metastore can also provide access to Oozie job data after you delete your cluster.
7388

7489
For instructions on creating an Oozie metastore with Azure SQL Database, see [Use Apache Oozie for workflows](hdinsight-use-oozie-linux-mac.md).
7590

7691
## Next steps
7792

78-
- [Set up clusters in HDInsight with Apache Hadoop, Apache Spark, Apache Kafka, and more](./hdinsight-hadoop-provision-linux-clusters.md)
93+
* [Set up clusters in HDInsight with Apache Hadoop, Apache Spark, Apache Kafka, and more](./hdinsight-hadoop-provision-linux-clusters.md)
256 KB
Loading

0 commit comments

Comments
 (0)