You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/spark/zookeeper-troubleshoot-quorum-fails.md
+67-24Lines changed: 67 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -11,37 +11,80 @@ ms.date: 08/20/2019
11
11
12
12
# Apache ZooKeeper server fails to form a quorum in Azure HDInsight
13
13
14
-
This article describes troubleshooting steps and possible resolutions for issues when interacting with Azure HDInsight clusters.
14
+
This article describes troubleshooting steps and possible resolutions for issues related to zookeepers in Azure HDInsight clusters.
15
15
16
-
## Issue
16
+
## Symptoms
17
17
18
-
Apache ZooKeeper server is unhealthy, symptoms could include: both Resource Managers/Name Nodes are in standby mode, simple HDFS operations do not work, `zkFailoverController` is stopped and cannot be started, Yarn/Spark/Livy jobs fail due to Zookeeper errors. LLAP Daemons may also fail to start on Secure Spark or Interactive Hive clusters. You may see an error message similar to:
18
+
* Both the resource managers go to standby mode
19
+
* Names nodes are both in standby mode
20
+
* Spark / hive / yarn jobs or hive queries fail because of zookeeper connection failures
21
+
* LLAP daemons fail to start on secure spark or secure interactive hive clusters
2020-05-05 03:17:18.3916720|Lost contact with Zookeeper. Transitioning to standby in 10000 ms if connection is not reestablished.
29
+
Message
30
+
2020-05-05 03:17:07.7924490|Received RMFatalEvent of type STATE_STORE_FENCED, caused by org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth
31
+
...
32
+
2020-05-05 03:17:08.3890350|State store operation failed
33
+
2020-05-05 03:17:08.3890350|Transitioning to standby state
30
34
```
31
35
32
-
## Cause
33
-
34
-
When the volume of snapshot files is large or snapshot files are corrupted, ZooKeeper server will fail to form a quorum, which causes ZooKeeper related services unhealthy. ZooKeeper server will not remove old snapshot files from its data directory, instead, it is a periodic task to be performed by users to maintain the healthiness of ZooKeeper. For more information, see [ZooKeeper Strengths and Limitations](https://zookeeper.apache.org/doc/r3.3.5/zookeeperAdmin.html#sc_strengthsAndLimitations).
35
-
36
-
## Resolution
37
-
38
-
Check ZooKeeper data directory `/hadoop/zookeeper/version-2` and `/hadoop/hdinsight-zookeeper/version-2` to find out if the snapshots file size is large. Take the following steps if large snapshots exist:
39
-
40
-
1. Check the status of other ZooKeeper servers in the same quorum to make sure they are working fine with the command “`echo stat | nc {zk_host_ip} 2181 (or 2182)`”.
41
-
42
-
1. Login the problematic ZooKeeper host, backup snapshots and transaction logs in `/hadoop/zookeeper/version-2` and `/hadoop/hdinsight-zookeeper/version-2`, then cleanup these files in the two directories.
43
-
44
-
1. Restart the problematic ZooKeeper server in Ambari or the ZooKeeper host. Then restart the service which has problems.
36
+
## Contra indicators
37
+
38
+
* HA services like Yarn / NameNode / Livy can go down due to many reasons.
39
+
* Please confirm from the logs that it is related to zookeeper connections
40
+
* Please make sure that the issue happens repeatedly (do not do these mitigations for one off cases)
41
+
* Job failures can fail temporarily due to zookeeper connection issues
42
+
43
+
## Further reading
44
+
45
+
[ZooKeeper Strengths and Limitations](https://zookeeper.apache.org/doc/r3.3.5/zookeeperAdmin.html#sc_strengthsAndLimitations).
46
+
47
+
## Common causes
48
+
49
+
* High CPU usage on the zookeeper servers
50
+
* In the Ambari UI, if you see near 100% sustained CPU usage on the zookeeper servers, then the sessions open during that time can expire and timeout
51
+
* Zookeeper is busy consolidating snapshots that it doesn't respond to clients / requests on time
52
+
* Zookeeper servers have a sustained CPU load of 5 or above (as seen in Ambari UI)
53
+
54
+
## Check for zookeeper status
55
+
* Find the zookeeper servers from the /etc/hosts file or from Ambari UI
56
+
* Run the following command
57
+
* echo stat | nc {zk_host_ip} 2181 (or 2182)
58
+
* Port 2181 is the apache zookeeper instance
59
+
* Port 2182 is used by the HDI zookeeper (to provide HA for services that are not natively HA)
60
+
61
+
## CPU load peaks up every hour
62
+
* Login to the zookeeper server and check the /etc/crontab
63
+
* Are there any hourly jobs running at this time?
64
+
* If so, randomize the start time across different zookeeper servers
65
+
66
+
## Purging old snapshots
67
+
* HDI Zookeepers are configured to auto purge old snapshots
68
+
* By default, last 30 snapshots are retained
69
+
* This controlled by the configuration key autopurge.snapRetainCount
70
+
* /etc/zookeeper/conf/zoo.cfg for hadoop zookeeper
71
+
* /etc/hdinsight-zookeeper/conf/zoo.cfg for HDI zookeeper
72
+
* Set this to a value >=3 and restart the zookeeper servers
73
+
* Hadoop zookeeper can be restarted through Ambari
74
+
* HDI zookeeper has to be stopped manually and restarted manually
75
+
* sudo lsof -i :2182 will give you the process id to kill
0 commit comments