Skip to content

Commit 10c23d3

Browse files
authored
Update zookeeper-troubleshoot-quorum-fails.md
Addressed review comments
1 parent 6e818ab commit 10c23d3

File tree

1 file changed

+8
-8
lines changed

1 file changed

+8
-8
lines changed

articles/hdinsight/spark/zookeeper-troubleshoot-quorum-fails.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ This article describes troubleshooting steps and possible resolutions for issues
1616
## Symptoms
1717

1818
* Both the resource managers go to standby mode
19-
* Names nodes are both in standby mode
19+
* Namenodes are both in standby mode
2020
* Spark / hive / yarn jobs or hive queries fail because of zookeeper connection failures
2121
* LLAP daemons fail to start on secure spark or secure interactive hive clusters
2222

@@ -38,18 +38,18 @@ Message
3838
* HA services like Yarn / NameNode / Livy can go down due to many reasons.
3939
* Please confirm from the logs that it is related to zookeeper connections
4040
* Please make sure that the issue happens repeatedly (do not do these mitigations for one off cases)
41-
* Job failures can fail temporarily due to zookeeper connection issues
41+
* Jobs can fail temporarily due to zookeeper connection issues
4242

4343
## Further reading
4444

45-
[ZooKeeper Strengths and Limitations](https://zookeeper.apache.org/doc/r3.3.5/zookeeperAdmin.html#sc_strengthsAndLimitations).
45+
[ZooKeeper Strengths and Limitations](https://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html#sc_strengthsAndLimitations).
4646

4747
## Common causes
4848

4949
* High CPU usage on the zookeeper servers
50-
* In the Ambari UI, if you see near 100% sustained CPU usage on the zookeeper servers, then the sessions open during that time can expire and timeout
51-
* Zookeeper is busy consolidating snapshots that it doesn't respond to clients / requests on time
52-
* Zookeeper servers have a sustained CPU load of 5 or above (as seen in Ambari UI)
50+
* In the Ambari UI, if you see near 100% sustained CPU usage on the zookeeper servers, then the zookeeper sessions open during that time can expire and timeout
51+
* Zookeeper clients are reporting frequent timeouts
52+
* The transaction logs and the snapshots are being written to the same disk. This can cause I/O bottlenecks
5353

5454
## Check for zookeeper status
5555
* Find the zookeeper servers from the /etc/hosts file or from Ambari UI
@@ -69,13 +69,13 @@ Message
6969
* This controlled by the configuration key autopurge.snapRetainCount
7070
* /etc/zookeeper/conf/zoo.cfg for hadoop zookeeper
7171
* /etc/hdinsight-zookeeper/conf/zoo.cfg for HDI zookeeper
72-
* Set this to a value >=3 and restart the zookeeper servers
72+
* Set this to a value =3 and restart the zookeeper servers
7373
* Hadoop zookeeper can be restarted through Ambari
7474
* HDI zookeeper has to be stopped manually and restarted manually
7575
* sudo lsof -i :2182 will give you the process id to kill
7676
* sudo python /opt/startup_scripts/startup_hdinsight_zookeeper.py
7777
* Manually purging snapshots
78-
* DO NOT delete the snapshot files directly as this could result in data loss
78+
* **DO NOT delete the snapshot files directly as this could result in data loss**
7979
* zookeeper
8080
* sudo java -cp /usr/hdp/current/zookeeper-server/zookeeper.jar:/usr/hdp/current/zookeeper-server/lib/* org.apache.zookeeper.server.PurgeTxnLog /hadoop/zookeeper/ /hadoop/zookeeper/ 3
8181
* hdinsight-zookeeper

0 commit comments

Comments
 (0)