You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/spark/zookeeper-troubleshoot-quorum-fails.md
+62-47Lines changed: 62 additions & 47 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,20 +6,19 @@ ms.topic: troubleshooting
6
6
author: hrasheed-msft
7
7
ms.author: hrasheed
8
8
ms.reviewer: jasonh
9
-
ms.date: 08/20/2019
9
+
ms.date: 05/20/2020
10
10
---
11
-
12
11
# Apache ZooKeeper server fails to form a quorum in Azure HDInsight
13
12
14
-
This article describes troubleshooting steps and possible resolutions for issues related to zookeepers in Azure HDInsight clusters.
13
+
This article describes troubleshooting steps and possible resolutions for issues related to Zookeepers in Azure HDInsight clusters.
15
14
16
15
## Symptoms
17
16
18
-
* Both the resource managers go to standby mode
19
-
* Namenodes are both in standby mode
20
-
* Spark / hive / yarn jobs or hive queries fail because of zookeeper connection failures
21
-
* LLAP daemons fail to start on secure spark or secure interactive hive clusters
22
-
17
+
* Both the resource managers go to standby mode
18
+
* Namenodes are both in standby mode
19
+
* Spark / Hive / Yarn jobs or Hive queries fail because of Zookeeper connection failures
20
+
* LLAP daemons fail to start on secure Spark or secure interactive Hive clusters
21
+
23
22
## Sample log
24
23
25
24
You may see an error message similar to:
@@ -33,58 +32,74 @@ Message
33
32
2020-05-05 03:17:08.3890350|Transitioning to standby state
34
33
```
35
34
36
-
## Contra indicators
35
+
## Related issues
37
36
38
-
* HA services like Yarn / NameNode / Livy can go down due to many reasons.
39
-
* Please confirm from the logs that it is related to zookeeper connections
40
-
* Please make sure that the issue happens repeatedly (do not do these mitigations for one off cases)
41
-
* Jobs can fail temporarily due to zookeeper connection issues
37
+
* High availability services like Yarn, NameNode, and Livy can go down for many reasons.
38
+
* Confirm from the logs that it is related to Zookeeper connections
39
+
* Make sure that the issue happens repeatedly (do not use these solutions for one off cases)
40
+
* Jobs can fail temporarily due to Zookeeper connection issues
42
41
43
-
## Further reading
44
42
45
-
[ZooKeeper Strengths and Limitations](https://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html#sc_strengthsAndLimitations).
46
-
47
-
## Common causes
43
+
## Common causes for Zookeeper failure
48
44
49
45
* High CPU usage on the zookeeper servers
50
46
* In the Ambari UI, if you see near 100% sustained CPU usage on the zookeeper servers, then the zookeeper sessions open during that time can expire and timeout
51
47
* Zookeeper clients are reporting frequent timeouts
52
-
* The transaction logs and the snapshots are being written to the same disk. This can cause I/O bottlenecks
48
+
* The transaction logs and the snapshots are being written to the same disk. This can cause I/O bottlenecks.
53
49
54
50
## Check for zookeeper status
55
-
* Find the zookeeper servers from the /etc/hosts file or from Ambari UI
56
-
* Run the following command
57
-
*echo stat | nc {zk_host_ip} 2181 (or 2182)
58
-
* Port 2181 is the apache zookeeper instance
59
-
* Port 2182 is used by the HDI zookeeper (to provide HA for services that are not natively HA)
51
+
52
+
* Find the zookeeper servers from the /etc/hosts file or from Ambari UI
53
+
* Run the following command `echo stat | nc {zk_host_ip} 2181 (or 2182)`
54
+
* Port 2181 is the apache zookeeper instance
55
+
* Port 2182 is used by the HDInsight zookeeper (to provide HA for services that are not natively HA)
60
56
61
57
## CPU load peaks up every hour
62
-
* Login to the zookeeper server and check the /etc/crontab
63
-
* Are there any hourly jobs running at this time?
64
-
* If so, randomize the start time across different zookeeper servers
58
+
59
+
* Log in to the zookeeper server and check the /etc/crontab
60
+
* If there are any hourly jobs running at this time, randomize the start time across different zookeeper servers.
65
61
66
62
## Purging old snapshots
67
-
* HDI Zookeepers are configured to auto purge old snapshots
68
-
* By default, last 30 snapshots are retained
69
-
* This controlled by the configuration key autopurge.snapRetainCount
70
-
* /etc/zookeeper/conf/zoo.cfg for hadoop zookeeper
71
-
* /etc/hdinsight-zookeeper/conf/zoo.cfg for HDI zookeeper
72
-
* Set this to a value =3 and restart the zookeeper servers
73
-
* Hadoop zookeeper can be restarted through Ambari
74
-
* HDI zookeeper has to be stopped manually and restarted manually
75
-
* sudo lsof -i :2182 will give you the process id to kill
0 commit comments