Skip to content

Commit 43a21dd

Browse files
committed
updates
1 parent c2df943 commit 43a21dd

File tree

1 file changed

+62
-47
lines changed

1 file changed

+62
-47
lines changed

articles/hdinsight/spark/zookeeper-troubleshoot-quorum-fails.md

Lines changed: 62 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -6,20 +6,19 @@ ms.topic: troubleshooting
66
author: hrasheed-msft
77
ms.author: hrasheed
88
ms.reviewer: jasonh
9-
ms.date: 08/20/2019
9+
ms.date: 05/20/2020
1010
---
11-
1211
# Apache ZooKeeper server fails to form a quorum in Azure HDInsight
1312

14-
This article describes troubleshooting steps and possible resolutions for issues related to zookeepers in Azure HDInsight clusters.
13+
This article describes troubleshooting steps and possible resolutions for issues related to Zookeepers in Azure HDInsight clusters.
1514

1615
## Symptoms
1716

18-
* Both the resource managers go to standby mode
19-
* Namenodes are both in standby mode
20-
* Spark / hive / yarn jobs or hive queries fail because of zookeeper connection failures
21-
* LLAP daemons fail to start on secure spark or secure interactive hive clusters
22-
17+
* Both the resource managers go to standby mode
18+
* Namenodes are both in standby mode
19+
* Spark / Hive / Yarn jobs or Hive queries fail because of Zookeeper connection failures
20+
* LLAP daemons fail to start on secure Spark or secure interactive Hive clusters
21+
2322
## Sample log
2423

2524
You may see an error message similar to:
@@ -33,58 +32,74 @@ Message
3332
2020-05-05 03:17:08.3890350|Transitioning to standby state
3433
```
3534

36-
## Contra indicators
35+
## Related issues
3736

38-
* HA services like Yarn / NameNode / Livy can go down due to many reasons.
39-
* Please confirm from the logs that it is related to zookeeper connections
40-
* Please make sure that the issue happens repeatedly (do not do these mitigations for one off cases)
41-
* Jobs can fail temporarily due to zookeeper connection issues
37+
* High availability services like Yarn, NameNode, and Livy can go down for many reasons.
38+
* Confirm from the logs that it is related to Zookeeper connections
39+
* Make sure that the issue happens repeatedly (do not use these solutions for one off cases)
40+
* Jobs can fail temporarily due to Zookeeper connection issues
4241

43-
## Further reading
4442

45-
[ZooKeeper Strengths and Limitations](https://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html#sc_strengthsAndLimitations).
46-
47-
## Common causes
43+
## Common causes for Zookeeper failure
4844

4945
* High CPU usage on the zookeeper servers
5046
* In the Ambari UI, if you see near 100% sustained CPU usage on the zookeeper servers, then the zookeeper sessions open during that time can expire and timeout
5147
* Zookeeper clients are reporting frequent timeouts
52-
* The transaction logs and the snapshots are being written to the same disk. This can cause I/O bottlenecks
48+
* The transaction logs and the snapshots are being written to the same disk. This can cause I/O bottlenecks.
5349

5450
## Check for zookeeper status
55-
* Find the zookeeper servers from the /etc/hosts file or from Ambari UI
56-
* Run the following command
57-
* echo stat | nc {zk_host_ip} 2181 (or 2182)
58-
* Port 2181 is the apache zookeeper instance
59-
* Port 2182 is used by the HDI zookeeper (to provide HA for services that are not natively HA)
51+
52+
* Find the zookeeper servers from the /etc/hosts file or from Ambari UI
53+
* Run the following command `echo stat | nc {zk_host_ip} 2181 (or 2182)`
54+
* Port 2181 is the apache zookeeper instance
55+
* Port 2182 is used by the HDInsight zookeeper (to provide HA for services that are not natively HA)
6056

6157
## CPU load peaks up every hour
62-
* Login to the zookeeper server and check the /etc/crontab
63-
* Are there any hourly jobs running at this time?
64-
* If so, randomize the start time across different zookeeper servers
58+
59+
* Log in to the zookeeper server and check the /etc/crontab
60+
* If there are any hourly jobs running at this time, randomize the start time across different zookeeper servers.
6561

6662
## Purging old snapshots
67-
* HDI Zookeepers are configured to auto purge old snapshots
68-
* By default, last 30 snapshots are retained
69-
* This controlled by the configuration key autopurge.snapRetainCount
70-
* /etc/zookeeper/conf/zoo.cfg for hadoop zookeeper
71-
* /etc/hdinsight-zookeeper/conf/zoo.cfg for HDI zookeeper
72-
* Set this to a value =3 and restart the zookeeper servers
73-
* Hadoop zookeeper can be restarted through Ambari
74-
* HDI zookeeper has to be stopped manually and restarted manually
75-
* sudo lsof -i :2182 will give you the process id to kill
76-
* sudo python /opt/startup_scripts/startup_hdinsight_zookeeper.py
77-
* Manually purging snapshots
78-
* **DO NOT delete the snapshot files directly as this could result in data loss**
79-
* zookeeper
80-
* sudo java -cp /usr/hdp/current/zookeeper-server/zookeeper.jar:/usr/hdp/current/zookeeper-server/lib/* org.apache.zookeeper.server.PurgeTxnLog /hadoop/zookeeper/ /hadoop/zookeeper/ 3
81-
* hdinsight-zookeeper
82-
* sudo java -cp /usr/hdp/current/zookeeper-server/zookeeper.jar:/usr/hdp/current/zookeeper-server/lib/* org.apache.zookeeper.server.PurgeTxnLog /hadoop/hdinsight-zookeeper/ /hadoop/hdinsight-zookeeper/ 3
83-
84-
## CancelledKeyException in the zookeeper server log
85-
* This exception usually means that the client is no longer active and the server is unable to send a message
86-
* This is indication of a symptom that the zookeeper client is terminating sessions prematurely
87-
* Look for the other symptoms outlined in this document
63+
64+
### Auto purging configuration
65+
66+
* HDInsight Zookeepers are configured to auto purge old snapshots
67+
* By default, the last 30 snapshots are retained
68+
* This controlled by the configuration key `autopurge.snapRetainCount`
69+
* `/etc/zookeeper/conf/zoo.cfg` for hadoop zookeeper
70+
* `/etc/hdinsight-zookeeper/conf/zoo.cfg` for HDInsight zookeeper
71+
* Set this to a value =3 and restart the zookeeper servers
72+
* Hadoop zookeeper can be restarted through Ambari
73+
* HDInsight zookeeper has to be stopped manually and restarted manually
74+
* `sudo lsof -i :2182` will give you the process ID to kill
75+
* `sudo python /opt/startup_scripts/startup_hdinsight_zookeeper.py`
76+
77+
> [!Note]
78+
> Don't delete the snapshot files directly as this could result in data loss.
79+
80+
### Manually purging snapshots
81+
82+
Use the following command to manually purge zookeeper snapshots.
83+
84+
```
85+
sudo java -cp /usr/hdp/current/zookeeper-server/zookeeper.jar:/usr/hdp/current/zookeeper-server/lib/* org.apache.zookeeper.server.PurgeTxnLog /hadoop/zookeeper/ /hadoop/zookeeper/ 3
86+
```
87+
88+
Use the following command to manually purge hdinsight-zookeeper snapshots.
89+
90+
```
91+
sudo java -cp /usr/hdp/current/zookeeper-server/zookeeper.jar:/usr/hdp/current/zookeeper-server/lib/* org.apache.zookeeper.server.PurgeTxnLog /hadoop/hdinsight-zookeeper/ /hadoop/hdinsight-zookeeper/ 3
92+
```
93+
94+
## CancelledKeyException in the Zookeeper server log
95+
96+
* This exception usually means that the client is no longer active and the server is unable to send a message
97+
* This is indication of a symptom that the zookeeper client is terminating sessions prematurely
98+
* Look for the other symptoms outlined in this document
99+
100+
## Further reading
101+
102+
[ZooKeeper Strengths and Limitations](https://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html#sc_strengthsAndLimitations).
88103

89104
## Next steps
90105

0 commit comments

Comments
 (0)