Skip to content

Commit 6e818ab

Browse files
authored
Zookeeper TSG update
Updated the symptoms, popular causes and the mitigations. Removed the steps around deleting the snapshot files.
1 parent 8e3e81b commit 6e818ab

File tree

1 file changed

+67
-24
lines changed

1 file changed

+67
-24
lines changed

articles/hdinsight/spark/zookeeper-troubleshoot-quorum-fails.md

Lines changed: 67 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -11,37 +11,80 @@ ms.date: 08/20/2019
1111

1212
# Apache ZooKeeper server fails to form a quorum in Azure HDInsight
1313

14-
This article describes troubleshooting steps and possible resolutions for issues when interacting with Azure HDInsight clusters.
14+
This article describes troubleshooting steps and possible resolutions for issues related to zookeepers in Azure HDInsight clusters.
1515

16-
## Issue
16+
## Symptoms
1717

18-
Apache ZooKeeper server is unhealthy, symptoms could include: both Resource Managers/Name Nodes are in standby mode, simple HDFS operations do not work, `zkFailoverController` is stopped and cannot be started, Yarn/Spark/Livy jobs fail due to Zookeeper errors. LLAP Daemons may also fail to start on Secure Spark or Interactive Hive clusters. You may see an error message similar to:
18+
* Both the resource managers go to standby mode
19+
* Names nodes are both in standby mode
20+
* Spark / hive / yarn jobs or hive queries fail because of zookeeper connection failures
21+
* LLAP daemons fail to start on secure spark or secure interactive hive clusters
22+
23+
## Sample log
1924

20-
```
21-
19/06/19 08:27:08 ERROR ZooKeeperStateStore: Fatal Zookeeper error. Shutting down Livy server.
22-
19/06/19 08:27:08 INFO LivyServer: Shutting down Livy server.
23-
```
24-
25-
In the Zookeeper Server logs on any Zookeeper host at /var/log/zookeeper/zookeeper-zookeeper-server-\*.out, you may also see the following error:
25+
You may see an error message similar to:
2626

2727
```
28-
2020-02-12 00:31:52,513 - ERROR [CommitProcessor:1:NIOServerCnxn@178] - Unexpected Exception:
29-
java.nio.channels.CancelledKeyException
28+
2020-05-05 03:17:18.3916720|Lost contact with Zookeeper. Transitioning to standby in 10000 ms if connection is not reestablished.
29+
Message
30+
2020-05-05 03:17:07.7924490|Received RMFatalEvent of type STATE_STORE_FENCED, caused by org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth
31+
...
32+
2020-05-05 03:17:08.3890350|State store operation failed
33+
2020-05-05 03:17:08.3890350|Transitioning to standby state
3034
```
3135

32-
## Cause
33-
34-
When the volume of snapshot files is large or snapshot files are corrupted, ZooKeeper server will fail to form a quorum, which causes ZooKeeper related services unhealthy. ZooKeeper server will not remove old snapshot files from its data directory, instead, it is a periodic task to be performed by users to maintain the healthiness of ZooKeeper. For more information, see [ZooKeeper Strengths and Limitations](https://zookeeper.apache.org/doc/r3.3.5/zookeeperAdmin.html#sc_strengthsAndLimitations).
35-
36-
## Resolution
37-
38-
Check ZooKeeper data directory `/hadoop/zookeeper/version-2` and `/hadoop/hdinsight-zookeeper/version-2` to find out if the snapshots file size is large. Take the following steps if large snapshots exist:
39-
40-
1. Check the status of other ZooKeeper servers in the same quorum to make sure they are working fine with the command “`echo stat | nc {zk_host_ip} 2181 (or 2182)`”.
41-
42-
1. Login the problematic ZooKeeper host, backup snapshots and transaction logs in `/hadoop/zookeeper/version-2` and `/hadoop/hdinsight-zookeeper/version-2`, then cleanup these files in the two directories.
43-
44-
1. Restart the problematic ZooKeeper server in Ambari or the ZooKeeper host. Then restart the service which has problems.
36+
## Contra indicators
37+
38+
* HA services like Yarn / NameNode / Livy can go down due to many reasons.
39+
* Please confirm from the logs that it is related to zookeeper connections
40+
* Please make sure that the issue happens repeatedly (do not do these mitigations for one off cases)
41+
* Job failures can fail temporarily due to zookeeper connection issues
42+
43+
## Further reading
44+
45+
[ZooKeeper Strengths and Limitations](https://zookeeper.apache.org/doc/r3.3.5/zookeeperAdmin.html#sc_strengthsAndLimitations).
46+
47+
## Common causes
48+
49+
* High CPU usage on the zookeeper servers
50+
* In the Ambari UI, if you see near 100% sustained CPU usage on the zookeeper servers, then the sessions open during that time can expire and timeout
51+
* Zookeeper is busy consolidating snapshots that it doesn't respond to clients / requests on time
52+
* Zookeeper servers have a sustained CPU load of 5 or above (as seen in Ambari UI)
53+
54+
## Check for zookeeper status
55+
* Find the zookeeper servers from the /etc/hosts file or from Ambari UI
56+
* Run the following command
57+
* echo stat | nc {zk_host_ip} 2181 (or 2182)
58+
* Port 2181 is the apache zookeeper instance
59+
* Port 2182 is used by the HDI zookeeper (to provide HA for services that are not natively HA)
60+
61+
## CPU load peaks up every hour
62+
* Login to the zookeeper server and check the /etc/crontab
63+
* Are there any hourly jobs running at this time?
64+
* If so, randomize the start time across different zookeeper servers
65+
66+
## Purging old snapshots
67+
* HDI Zookeepers are configured to auto purge old snapshots
68+
* By default, last 30 snapshots are retained
69+
* This controlled by the configuration key autopurge.snapRetainCount
70+
* /etc/zookeeper/conf/zoo.cfg for hadoop zookeeper
71+
* /etc/hdinsight-zookeeper/conf/zoo.cfg for HDI zookeeper
72+
* Set this to a value >=3 and restart the zookeeper servers
73+
* Hadoop zookeeper can be restarted through Ambari
74+
* HDI zookeeper has to be stopped manually and restarted manually
75+
* sudo lsof -i :2182 will give you the process id to kill
76+
* sudo python /opt/startup_scripts/startup_hdinsight_zookeeper.py
77+
* Manually purging snapshots
78+
* DO NOT delete the snapshot files directly as this could result in data loss
79+
* zookeeper
80+
* sudo java -cp /usr/hdp/current/zookeeper-server/zookeeper.jar:/usr/hdp/current/zookeeper-server/lib/* org.apache.zookeeper.server.PurgeTxnLog /hadoop/zookeeper/ /hadoop/zookeeper/ 3
81+
* hdinsight-zookeeper
82+
* sudo java -cp /usr/hdp/current/zookeeper-server/zookeeper.jar:/usr/hdp/current/zookeeper-server/lib/* org.apache.zookeeper.server.PurgeTxnLog /hadoop/hdinsight-zookeeper/ /hadoop/hdinsight-zookeeper/ 3
83+
84+
## CancelledKeyException in the zookeeper server log
85+
* This exception usually means that the client is no longer active and the server is unable to send a message
86+
* This is indication of a symptom that the zookeeper client is terminating sessions prematurely
87+
* Look for the other symptoms outlined in this document
4588

4689
## Next steps
4790

0 commit comments

Comments
 (0)