Skip to content

Commit 219104a

Browse files
committed
Merge branch 'patch-1' of https://github.com/vijaysr/azure-docs-pr into hdi_zookeepertsg
2 parents 43a21dd + 4c96f84 commit 219104a

File tree

1 file changed

+57
-47
lines changed

1 file changed

+57
-47
lines changed

articles/hdinsight/spark/zookeeper-troubleshoot-quorum-fails.md

Lines changed: 57 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -45,61 +45,71 @@ Message
4545
* High CPU usage on the zookeeper servers
4646
* In the Ambari UI, if you see near 100% sustained CPU usage on the zookeeper servers, then the zookeeper sessions open during that time can expire and timeout
4747
* Zookeeper clients are reporting frequent timeouts
48-
* The transaction logs and the snapshots are being written to the same disk. This can cause I/O bottlenecks.
48+
* In the logs for resource manager, namenode and others, you will see frequent client connection timeouts
49+
* This could result in quorum loss, frequent failovers and other issues
4950

5051
## Check for zookeeper status
51-
52-
* Find the zookeeper servers from the /etc/hosts file or from Ambari UI
53-
* Run the following command `echo stat | nc {zk_host_ip} 2181 (or 2182)`
54-
* Port 2181 is the apache zookeeper instance
55-
* Port 2182 is used by the HDInsight zookeeper (to provide HA for services that are not natively HA)
56-
52+
* Find the zookeeper servers from the /etc/hosts file or from Ambari UI
53+
* Run the following command
54+
* echo stat | nc {zk_host_ip} 2181 (or 2182)
55+
* Port 2181 is the apache zookeeper instance
56+
* Port 2182 is used by the HDI zookeeper (to provide HA for services that are not natively HA)
57+
* If the command shows no output, then it means that the zookeeper servers are not running
58+
* If the servers are running, the result will include statics of client connections and other statistics
59+
````
60+
Zookeeper version: 3.4.6-8--1, built on 12/05/2019 12:55 GMT
61+
Clients:
62+
/10.2.0.57:50988[1](queued=0,recved=715,sent=715)
63+
/10.2.0.57:46632[1](queued=0,recved=138340,sent=138347)
64+
/10.2.0.34:14688[1](queued=0,recved=264653,sent=353420)
65+
/10.2.0.52:49680[1](queued=0,recved=134812,sent=134814)
66+
/10.2.0.57:50614[1](queued=0,recved=19812,sent=19812)
67+
/10.2.0.56:35034[1](queued=0,recved=2586,sent=2586)
68+
/10.2.0.52:63982[1](queued=0,recved=72215,sent=72217)
69+
/10.2.0.57:53024[1](queued=0,recved=19805,sent=19805)
70+
/10.2.0.57:45126[1](queued=0,recved=19621,sent=19621)
71+
/10.2.0.56:41270[1](queued=0,recved=1348743,sent=1348788)
72+
/10.2.0.53:59097[1](queued=0,recved=72215,sent=72217)
73+
/10.2.0.56:41088[1](queued=0,recved=788,sent=802)
74+
/10.2.0.34:10246[1](queued=0,recved=19575,sent=19575)
75+
/10.2.0.56:40944[1](queued=0,recved=717,sent=717)
76+
/10.2.0.57:45466[1](queued=0,recved=19861,sent=19861)
77+
/10.2.0.57:59634[0](queued=0,recved=1,sent=0)
78+
/10.2.0.34:14704[1](queued=0,recved=264622,sent=353355)
79+
/10.2.0.57:42244[1](queued=0,recved=49245,sent=49248)
80+
81+
Latency min/avg/max: 0/3/14865
82+
Received: 238606078
83+
Sent: 239139381
84+
Connections: 18
85+
Outstanding: 0
86+
Zxid: 0x1004f99be
87+
Mode: follower
88+
Node count: 133212
89+
````
5790
## CPU load peaks up every hour
5891

5992
* Log in to the zookeeper server and check the /etc/crontab
6093
* If there are any hourly jobs running at this time, randomize the start time across different zookeeper servers.
6194

6295
## Purging old snapshots
63-
64-
### Auto purging configuration
65-
66-
* HDInsight Zookeepers are configured to auto purge old snapshots
67-
* By default, the last 30 snapshots are retained
68-
* This controlled by the configuration key `autopurge.snapRetainCount`
69-
* `/etc/zookeeper/conf/zoo.cfg` for hadoop zookeeper
70-
* `/etc/hdinsight-zookeeper/conf/zoo.cfg` for HDInsight zookeeper
71-
* Set this to a value =3 and restart the zookeeper servers
72-
* Hadoop zookeeper can be restarted through Ambari
73-
* HDInsight zookeeper has to be stopped manually and restarted manually
74-
* `sudo lsof -i :2182` will give you the process ID to kill
75-
* `sudo python /opt/startup_scripts/startup_hdinsight_zookeeper.py`
76-
77-
> [!Note]
78-
> Don't delete the snapshot files directly as this could result in data loss.
79-
80-
### Manually purging snapshots
81-
82-
Use the following command to manually purge zookeeper snapshots.
83-
84-
```
85-
sudo java -cp /usr/hdp/current/zookeeper-server/zookeeper.jar:/usr/hdp/current/zookeeper-server/lib/* org.apache.zookeeper.server.PurgeTxnLog /hadoop/zookeeper/ /hadoop/zookeeper/ 3
86-
```
87-
88-
Use the following command to manually purge hdinsight-zookeeper snapshots.
89-
90-
```
91-
sudo java -cp /usr/hdp/current/zookeeper-server/zookeeper.jar:/usr/hdp/current/zookeeper-server/lib/* org.apache.zookeeper.server.PurgeTxnLog /hadoop/hdinsight-zookeeper/ /hadoop/hdinsight-zookeeper/ 3
92-
```
93-
94-
## CancelledKeyException in the Zookeeper server log
95-
96-
* This exception usually means that the client is no longer active and the server is unable to send a message
97-
* This is indication of a symptom that the zookeeper client is terminating sessions prematurely
98-
* Look for the other symptoms outlined in this document
99-
100-
## Further reading
101-
102-
[ZooKeeper Strengths and Limitations](https://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html#sc_strengthsAndLimitations).
96+
* Zookeepers are configured to auto purge old snapshots
97+
* By default, last 30 snapshots are retained
98+
* This is controlled by the configuration key autopurge.snapRetainCount
99+
* /etc/zookeeper/conf/zoo.cfg for hadoop zookeeper
100+
* /etc/hdinsight-zookeeper/conf/zoo.cfg for HDI zookeeper
101+
* Set this to a value of 3 and restart the zookeeper servers
102+
* Hadoop zookeeper config can be updated and the service can be restarted through Ambari
103+
* HDI zookeeper has to be stopped manually and restarted manually
104+
* sudo lsof -i :2182 will give you the process id to kill
105+
* sudo python /opt/startup_scripts/startup_hdinsight_zookeeper.py
106+
* Manually purging snapshots is not required
107+
* **DO NOT delete the snapshot files directly as this could result in data loss**
108+
109+
## CancelledKeyException in the zookeeper server log doesn't require snapshot cleanup
110+
* This exception usually means that the client is no longer active and the server is unable to send a message
111+
* This is indication of a symptom that the zookeeper client is terminating sessions prematurely
112+
* Look for the other symptoms outlined in this document
103113

104114
## Next steps
105115

0 commit comments

Comments
 (0)