Skip to content

Commit cac95cc

Browse files
Merge pull request #251769 from v-akarnase/patch-24
Update hdinsight-high-availability-components.md
2 parents 7a9272d + c807353 commit cac95cc

File tree

1 file changed

+11
-11
lines changed

1 file changed

+11
-11
lines changed

articles/hdinsight/hdinsight-high-availability-components.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3,11 +3,11 @@ title: High availability components in Azure HDInsight
33
description: Overview of the various high availability components used by HDInsight clusters.
44
ms.service: hdinsight
55
ms.topic: conceptual
6-
ms.date: 04/28/2022
6+
ms.date: 09/28/2023
77
---
88
# High availability services supported by Azure HDInsight
99

10-
In order to provide you with optimal levels of availability for your analytics components, HDInsight was developed with a unique architecture for ensuring high availability (HA) of critical services. Some components of this architecture were developed by Microsoft to provide automatic failover. Other components are standard Apache components that are deployed to support specific services. This article explains the architecture of the HA service model in HDInsight, how HDInsight supports failover for HA services, and best practices to recover from other service interruptions.
10+
In order to provide you with optimal levels of availability for your analytics components, HDInsight was developed with a unique architecture for ensuring high availability (HA) of critical services. Microsoft developed some components of this architecture to provide automatic failover. Other components are standard Apache components that are deployed to support specific services. This article explains the architecture of the HA service model in HDInsight, how HDInsight supports failover for HA services, and best practices to recover from other service interruptions.
1111

1212
> [!NOTE]
1313
> This article contains references to the term *slave*, a term that Microsoft no longer uses. When the term is removed from the software, we'll remove it from this article.
@@ -21,7 +21,7 @@ HDInsight provides customized infrastructure to ensure that four primary service
2121
- Job History Server for Hadoop MapReduce
2222
- Apache Livy
2323

24-
This infrastructure consists of a number of services and software components, some of which are designed by Microsoft. The following components are unique to the HDInsight platform:
24+
This infrastructure consists of many services and software components, some of which are designed by Microsoft. The following components are unique to the HDInsight platform:
2525

2626
- Slave failover controller
2727
- Master failover controller
@@ -36,7 +36,7 @@ There are also other high availability services, which are supported by open-sou
3636
- YARN ResourceManager
3737
- HBase Master
3838

39-
The following sections will provide more detail about how these services work together.
39+
The following sections provide more detail about how these services work together.
4040

4141
## HDInsight high availability services
4242

@@ -56,15 +56,15 @@ Microsoft provides support for the four Apache services in the following table i
5656

5757
Each HDInsight cluster has two headnodes in active and standby modes, respectively. The HDInsight HA services run on headnodes only. These services should always be running on the active headnode, and stopped and put in maintenance mode on the standby headnode.
5858

59-
To maintain the correct states of HA services and provide a fast failover, HDInsight utilizes Apache ZooKeeper, which is a coordination service for distributed applications, to conduct active headnode election. HDInsight also provisions a few background Java processes, which coordinate the failover procedure for HDInsight HA services. These services are the following: the master failover controller, the slave failover controller, the *master-ha-service*, and the *slave-ha-service*.
59+
To maintain the correct states of HA services and provide a fast failover, HDInsight utilizes Apache ZooKeeper, which is a coordination service for distributed applications, to conduct active headnode election. HDInsight also provisions a few background Java processes, which coordinate the failover procedure for HDInsight HA services. These services are: the master failover controller, the slave failover controller, the *master-ha-service*, and the *slave-ha-service*.
6060

6161
### Apache ZooKeeper
6262

63-
Apache ZooKeeper is a high-performance coordination service for distributed applications. In production, ZooKeeper usually runs in replicated mode where a replicated group of ZooKeeper servers form a quorum. Each HDInsight cluster has three ZooKeeper nodes that allow three ZooKeeper servers to form a quorum. HDInsight has two ZooKeeper quorums running in parallel with each other. One quorum decides the active headnode in a cluster on which HDInsight HA services should run. Another quorum is used to coordinate HA services provided by Apache, as detailed in later sections.
63+
Apache ZooKeeper is a high-performance coordination service for distributed applications. In production, ZooKeeper usually runs in replicated mode where a replicated group of ZooKeeper server forms a quorum. Each HDInsight cluster has three ZooKeeper nodes that allow three ZooKeeper servers to form a quorum. HDInsight has two ZooKeeper quorums running in parallel with each other. One quorum decides the active headnode in a cluster on which HDInsight HA services should run. Another quorum is used to coordinate HA services provided by Apache, as detailed in later sections.
6464

6565
### Slave failover controller
6666

67-
The slave failover controller runs on every node in an HDInsight cluster. This controller is responsible for starting the Ambari agent and *slave-ha-service* on each node. It periodically queries the first ZooKeeper quorum about the active headnode. When the active and standby headnodes change, the slave failover controller performs the following:
67+
The slave failover controller runs on every node in an HDInsight cluster. This controller is responsible for starting the Ambari agent and *slave-ha-service* on each node. It periodically queries the first ZooKeeper quorum about the active headnode. When the active and standby headnodes change, the slave failover controller performs the following steps:
6868

6969
1. Updates the host configuration file.
7070
1. Restarts Ambari agent.
@@ -87,15 +87,15 @@ The master-ha-service only runs on the active headnode, it stops the HDInsight H
8787

8888
:::image type="content" source="./media/hdinsight-high-availability-components/failover-steps.png" alt-text="failover process" border="false":::
8989

90-
A health monitor runs on each headnode along with the master failover controller to send heartbeat notifications to the Zookeeper quorum. The headnode is regarded as an HA service in this scenario. The health monitor checks to see if each high availability service is healthy and if it's ready to join in the leadership election. If yes, this headnode will compete in the election. If not, it will quit the election until it becomes ready again.
90+
A health monitor runs on each headnode along with the master failover controller to send heartbeat notifications to the Zookeeper quorum. The headnode is regarded as an HA service in this scenario. The health monitor checks to see if each high availability service is healthy and if it's ready to join in the leadership election. If yes, this headnode compete in the election. If not, it quits the election until it becomes ready again.
9191

92-
If the standby headnode ever achieves leadership and becomes active (such as in the case of a failure with the previous active node), its master failover controller will start all HDInsight HA services on it. The master failover controller will also stop these services on the other headnode.
92+
If the standby headnode ever achieves leadership and becomes active (such as in the case of a failure with the previous active node), its master failover controller starts all HDInsight HA services on it. The master failover controller stops these services on the other headnode.
9393

9494
For HDInsight HA service failures, such as a service being down or unhealthy, the master failover controller should automatically restart or stop the services according to the headnode status. Users shouldn't manually start HDInsight HA services on both head nodes. Instead, allow automatic or manual failover to help the service recover.
9595

9696
### Inadvertent manual intervention
9797

98-
HDInsight HA services should only run on the active headnode, and will be automatically restarted when necessary. Since individual HA services don't have their own health monitor, failover can't be triggered at the level of the individual service. Failover is ensured at the node level and not at the service level.
98+
HDInsight HA services should only run on the active headnode, and it automatically restart when necessary. Since individual HA services don't have their own health monitor, failover can't be triggered at the level of the individual service. Failover is ensured at the node level and not at the service level.
9999

100100
### Some known issues
101101

@@ -109,7 +109,7 @@ Apache provides high availability for HDFS NameNode, YARN ResourceManager, and H
109109

110110
### Hadoop Distributed File System (HDFS) NameNode
111111

112-
HDInsight clusters based on Apache Hadoop 2.0 or higher provide NameNode high availability. There are two NameNodes running on the headnodes, which are configured for automatic failover. The NameNodes use the *ZKFailoverController* to communicate with Zookeeper to elect for active/standby status. The *ZKFailoverController* runs on both headnodes, and works in the same way as the master failover controller above.
112+
HDInsight clusters based on Apache Hadoop 2.0 or higher provide NameNode high availability. There are two NameNodes running on the headnodes, which are configured for automatic failover. The NameNodes use the *ZKFailoverController* to communicate with Zookeeper to elect for active/standby status. The *ZKFailoverController* runs on both headnodes, and works in the same way as the master failover controller.
113113

114114
The second Zookeeper quorum is independent of the first quorum, so the active NameNode may not run on the active headnode. When the active NameNode is dead or unhealthy, the standby NameNode wins the election and becomes active.
115115

0 commit comments

Comments
 (0)