You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/high-availability-components.md
+16-16Lines changed: 16 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,21 +14,21 @@ ms.date: 11/06/2019
14
14
15
15
## High availability infrastructure
16
16
17
-
HDInsight provides customized infrastructure to ensure that four primary services are high availability:
17
+
HDInsight provides customized infrastructure to ensure that four primary services are high availability with automatic failover capabilities:
18
18
19
19
- Apache Ambari server
20
20
- Application Timeline Server for Apache YARN
21
21
- Job History Server for Hadoop MapReduce
22
22
- Apache Livy
23
23
24
-
To achieve this level of dependability, HDInsight has developed a unique reliability infrastructure to support these services and provide automatic failover capabilities. This infrastructure consists of a number of services and software components, some of which are designed by HDInsight. The following components are unique to the HDInsight platform:
24
+
This infrastructure consists of a number of services and software components, some of which are designed by HDInsight. The following components are unique to the HDInsight platform:
25
25
26
26
- Slave failover controller
27
27
- Master failover controller
28
28
- Slave high availability service
29
29
- Master high availability service
30
30
31
-
There are also other high availability services, which are supported by open source Apache reliability services. These components are also present on HDInsight clusters, but are not developed by HDInsight:
31
+
There are also other high availability services, which are supported by open source Apache reliability components. These components are also present on HDInsight clusters, but are not developed by HDInsight:
32
32
33
33
- Hadoop File System (HDFS) NameNode
34
34
- YARN Resource Manager
@@ -38,7 +38,7 @@ The following sections will provide more detail about how these services work to
38
38
39
39
## HDInsight High Availability Services
40
40
41
-
Microsoft provides support for the four Apache services in the following table in HDInsight clusters. To distinguish them from availability services provided by Apache, they are called HDInsight HA services.
41
+
Microsoft provides support for the four Apache services in the following table in HDInsight clusters. To distinguish them from availability services supported by components from Apache, they are called *HDInsight HA services*.
@@ -52,22 +52,22 @@ Microsoft provides support for the four Apache services in the following table i
52
52
53
53
### Architecture
54
54
55
-
Each HDInsight cluster has two headnodes in active/standby modes, respectively. The HDInsight HA services run on headnodes only. These services should always be running on the active headnode, and stopped and put in maintenance mode on the standby headnode.
55
+
Each HDInsight cluster has two headnodes in active and standby modes, respectively. The HDInsight HA services run on headnodes only. These services should always be running on the active headnode, and stopped and put in maintenance mode on the standby headnode.
56
56
57
-
To maintain the correct states of HA services and provide a fast failover, HDInsight utilizes Apache ZooKeeper, which is a coordination service for distributed applications, to conduct active headnode election. HDInsight also provisions master failover controller, slave failover controller, master-ha-service, and slave-ha-service, which are Java processes running in background to coordinate the failover procedure for HDInsight HA services.
57
+
To maintain the correct states of HA services and provide a fast failover, HDInsight utilizes Apache ZooKeeper, which is a coordination service for distributed applications, to conduct active headnode election. HDInsight also provisions a few background Java processes, which coordinate the failover procedure for HDInsight HA services. These services are the following: the master failover controller, the slave failover controller, the *master-ha-service*, and the *slave-ha-service*.
58
58
59
59
### Apache ZooKeeper
60
60
61
61
Apache ZooKeeper is a high-performance coordination service for distributed applications. In production, ZooKeeper usually runs in replicated mode where a replicated group of ZooKeeper servers form a quorum. Each HDInsight cluster has three ZooKeeper nodes that allow three ZooKeeper servers to form a quorum. HDInsight has two ZooKeeper quorums running in parallel with each other. One quorum decides the active headnode in a cluster on which HDInsight HA services should run. Another quorum is used to coordinate HA services provided by Apache, as detailed in later sections.
62
62
63
63
### Slave failover controller
64
64
65
-
The slave failover controller runs on every node in an HDInsight cluster. This controller is responsible for starting the Ambari agent and `slave-ha-service` on each node. It periodically queries the first ZooKeeper quorum about the active headnode. When the active and standby headnodes change, the slave failover controller performs the following:
65
+
The slave failover controller runs on every node in an HDInsight cluster. This controller is responsible for starting the Ambari agent and *slave-ha-service* on each node. It periodically queries the first ZooKeeper quorum about the active headnode. When the active and standby headnodes change, the slave failover controller performs the following:
66
66
67
67
1. Updates the host configuration file.
68
68
1. Restarts Ambari agent.
69
69
70
-
The `slave-ha-service` is responsible for stopping the HDInsight HA services (except Ambari server) on the standby headnode.
70
+
The *slave-ha-service* is responsible for stopping the HDInsight HA services (except Ambari server) on the standby headnode.
71
71
72
72
### Master failover controller
73
73
@@ -83,15 +83,15 @@ The master-ha-service only runs on the active headnode, it stops the HDInsight H
83
83
84
84
### The failover process
85
85
86
-
A health monitor runs along with each master failover controller to perform heartbeats for the headnodes. The headnode is regarded as an HA service in this scenario. The health monitor checks if each high availability service is healthy and if it's ready to join in the leadership election. If yes, this HA service will compete in the election. If not, it will quit the election until it becomes ready again.
86
+
A health monitor runs on each headnode along with the master failover controller to send hearbeat notifications to the Zookeeper quorum. The headnode is regarded as an HA service in this scenario. The health monitor checks to see if each high availability service is healthy and if it's ready to join in the leadership election. If yes, this headnode will compete in the election. If not, it will quit the election until it becomes ready again.
87
87
88
-
For active headnode failures, such as a headnode crash or reboot, if the standby headnode achieves the leadership and becomes active, its master failover controller will start all HDInsight HA services on it. The master failover controller will also stop these services on the other headnode.
88
+
If the standby headnode ever achieves leadership and becomes active (such as in the case of a failure with the previous active node), its master failover controller will start all HDInsight HA services on it. The master failover controller will also stop these services on the other headnode.
89
89
90
90
For HDInsight HA service failures, such as a service being down or unhealthy, the master failover controller should automatically restart or stop the services according to the headnode status. Users shouldn't manually start HDInsight HA services on both head nodes. Instead, allow automatic or manual failover to help the service recover.
91
91
92
92
### Inadvertent manual intervention
93
93
94
-
It's expected that HDInsight HA services should only be running on the active headnode, and automatically restarted when necessary. Since individual HA services don't have their own health monitor, failover can't be triggered at the level of the individual service. Failover is ensured at the node level and not at the service level.
94
+
HDInsight HA services should only run on the active headnode, and will be automatically restarted when necessary. Since individual HA services don't have their own health monitor, failover can't be triggered at the level of the individual service. Failover is ensured at the node level and not at the service level.
95
95
96
96
### Some known issues
97
97
@@ -101,21 +101,21 @@ It's expected that HDInsight HA services should only be running on the active he
101
101
102
102
## Apache High Availability Services
103
103
104
-
Apache provides high availability for HDFS NameNode, YARN Resource Manager, and HBase Master, which are also available in HDInsight clusters. Unlike HDInsight HA services, they are supported in ESP clusters. Apache HA services communicate with the second ZooKeeper quorum (described in the above section) to elect active/standby states and conduct automatic failover. Following sections detail how these services work.
104
+
Apache provides high availability for HDFS NameNode, YARN Resource Manager, and HBase Master, which are also available in HDInsight clusters. Unlike HDInsight HA services, they are supported in ESP clusters. Apache HA services communicate with the second ZooKeeper quorum (described in the above section) to elect active/standby states and conduct automatic failover. The following sections detail how these services work.
105
105
106
106
### Hadoop Distributed File System (HDFS) NameNode
107
107
108
-
HDInsight clusters based on Apache Hadoop 2.0 or higher provide NameNode high availability. There are two NameNodes running on two headnodes, respectively, which are configured for automatic failover. The NameNodes use ZKFailoverController to communicate with Zookeeper to elect for active/standby status. ZKFailoverController runs on both headnodes, and works in the same way as the master failover controller above.
108
+
HDInsight clusters based on Apache Hadoop 2.0 or higher provide NameNode high availability. There are two NameNodes running on the headnodes, which are configured for automatic failover. The NameNodes use the *ZKFailoverController* to communicate with Zookeeper to elect for active/standby status. The *ZKFailoverController* runs on both headnodes, and works in the same way as the master failover controller above.
109
109
110
110
The second Zookeeper quorum is independent of the first quorum, so the active NameNode may not run on the active headnode. When the active NameNode is dead or unhealthy, the standby NameNode wins the election and becomes active.
111
111
112
112
### YARN Resource Manager
113
113
114
-
HDInsight clusters based on Apache Hadoop 2.4 or higher support YARN Resource Manager high availability. There are two resource managers, rm1 and rm2, running on headnode-0 and headnode-1, respectively. Like NameNode, YARN Resource Manager is also configured for automatic failover. Another Resource Manager is automatically elected to be active when the current active resource manager goes down or unresponsive.
114
+
HDInsight clusters based on Apache Hadoop 2.4 or higher, support YARN Resource Manager high availability. There are two resource managers, rm1 and rm2, running on headnode0 and headnode1, respectively. Like NameNode, YARN Resource Manager is also configured for automatic failover. Another Resource Manager is automatically elected to be active when the current active resource manager goes down or unresponsive.
115
115
116
-
YARN Resource Manager uses its embedded ActiveStandbyElector as a failure detector and leader elector. Unlike HDFS NodeManager, YARN Resource Manager doesn't need a separate ZKFC daemon. The active resource manager writes its states into Apache Zookeeper.
116
+
YARN Resource Manager uses its embedded *ActiveStandbyElector* as a failure detector and leader elector. Unlike HDFS NodeManager, YARN Resource Manager doesn't need a separate ZKFC daemon. The active resource manager writes its states into Apache Zookeeper.
117
117
118
-
YARN Resource Manager high availability is independent from NameNode and HDInsight HA services, the active resource manager may not run on active headnode or headnode that the active NameNode is running. For more information about YARN Resource Manager high availability, see [Resource Manager High Availability](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html).
118
+
The high availability of the YARN Resource Manager is independent from NameNode and other HDInsight HA services. The active resource manager may not run on the active headnode or the headnode where the active NameNode is running. For more information about YARN Resource Manager high availability, see [Resource Manager High Availability](https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html).
0 commit comments