You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/spark/zookeeper-troubleshoot-quorum-fails.md
+8-8Lines changed: 8 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,7 +16,7 @@ This article describes troubleshooting steps and possible resolutions for issues
16
16
## Symptoms
17
17
18
18
* Both the resource managers go to standby mode
19
-
*Names nodes are both in standby mode
19
+
*Namenodes are both in standby mode
20
20
* Spark / hive / yarn jobs or hive queries fail because of zookeeper connection failures
21
21
* LLAP daemons fail to start on secure spark or secure interactive hive clusters
22
22
@@ -38,18 +38,18 @@ Message
38
38
* HA services like Yarn / NameNode / Livy can go down due to many reasons.
39
39
* Please confirm from the logs that it is related to zookeeper connections
40
40
* Please make sure that the issue happens repeatedly (do not do these mitigations for one off cases)
41
-
*Job failures can fail temporarily due to zookeeper connection issues
41
+
*Jobs can fail temporarily due to zookeeper connection issues
42
42
43
43
## Further reading
44
44
45
-
[ZooKeeper Strengths and Limitations](https://zookeeper.apache.org/doc/r3.3.5/zookeeperAdmin.html#sc_strengthsAndLimitations).
45
+
[ZooKeeper Strengths and Limitations](https://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html#sc_strengthsAndLimitations).
46
46
47
47
## Common causes
48
48
49
49
* High CPU usage on the zookeeper servers
50
-
* In the Ambari UI, if you see near 100% sustained CPU usage on the zookeeper servers, then the sessions open during that time can expire and timeout
51
-
* Zookeeper is busy consolidating snapshots that it doesn't respond to clients / requests on time
52
-
* Zookeeper servers have a sustained CPU load of 5 or above (as seen in Ambari UI)
50
+
* In the Ambari UI, if you see near 100% sustained CPU usage on the zookeeper servers, then the zookeeper sessions open during that time can expire and timeout
51
+
* Zookeeper clients are reporting frequent timeouts
52
+
* The transaction logs and the snapshots are being written to the same disk. This can cause I/O bottlenecks
53
53
54
54
## Check for zookeeper status
55
55
* Find the zookeeper servers from the /etc/hosts file or from Ambari UI
@@ -69,13 +69,13 @@ Message
69
69
* This controlled by the configuration key autopurge.snapRetainCount
70
70
* /etc/zookeeper/conf/zoo.cfg for hadoop zookeeper
71
71
* /etc/hdinsight-zookeeper/conf/zoo.cfg for HDI zookeeper
72
-
* Set this to a value >=3 and restart the zookeeper servers
72
+
* Set this to a value =3 and restart the zookeeper servers
73
73
* Hadoop zookeeper can be restarted through Ambari
74
74
* HDI zookeeper has to be stopped manually and restarted manually
75
75
* sudo lsof -i :2182 will give you the process id to kill
0 commit comments