Skip to content

Commit b072b2a

Browse files
authored
Merge pull request #116456 from hrasheed-msft/hdi_zookeepertsg
HDInsight Zookeeper TSG
2 parents e87d5f0 + 9b2fdc1 commit b072b2a

File tree

1 file changed

+93
-24
lines changed

1 file changed

+93
-24
lines changed

articles/hdinsight/spark/zookeeper-troubleshoot-quorum-fails.md

Lines changed: 93 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -6,49 +6,118 @@ ms.topic: troubleshooting
66
author: hrasheed-msft
77
ms.author: hrasheed
88
ms.reviewer: jasonh
9-
ms.date: 08/20/2019
9+
ms.date: 05/20/2020
1010
---
11-
1211
# Apache ZooKeeper server fails to form a quorum in Azure HDInsight
1312

14-
This article describes troubleshooting steps and possible resolutions for issues when interacting with Azure HDInsight clusters.
13+
This article describes troubleshooting steps and possible resolutions for issues related to Zookeepers in Azure HDInsight clusters.
1514

16-
## Issue
15+
## Symptoms
1716

18-
Apache ZooKeeper server is unhealthy, symptoms could include: both Resource Managers/Name Nodes are in standby mode, simple HDFS operations do not work, `zkFailoverController` is stopped and cannot be started, Yarn/Spark/Livy jobs fail due to Zookeeper errors. LLAP Daemons may also fail to start on Secure Spark or Interactive Hive clusters. You may see an error message similar to:
17+
* Both the resource managers go to standby mode
18+
* Namenodes are both in standby mode
19+
* Spark, Hive, and Yarn jobs or Hive queries fail because of Zookeeper connection failures
20+
* LLAP daemons fail to start on secure Spark or secure interactive Hive clusters
1921

20-
```
21-
19/06/19 08:27:08 ERROR ZooKeeperStateStore: Fatal Zookeeper error. Shutting down Livy server.
22-
19/06/19 08:27:08 INFO LivyServer: Shutting down Livy server.
23-
```
22+
## Sample log
2423

25-
In the Zookeeper Server logs on any Zookeeper host at /var/log/zookeeper/zookeeper-zookeeper-server-\*.out, you may also see the following error:
24+
You may see an error message similar to:
2625

27-
```
28-
2020-02-12 00:31:52,513 - ERROR [CommitProcessor:1:NIOServerCnxn@178] - Unexpected Exception:
29-
java.nio.channels.CancelledKeyException
26+
```output
27+
2020-05-05 03:17:18.3916720|Lost contact with Zookeeper. Transitioning to standby in 10000 ms if connection is not reestablished.
28+
Message
29+
2020-05-05 03:17:07.7924490|Received RMFatalEvent of type STATE_STORE_FENCED, caused by org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth
30+
...
31+
2020-05-05 03:17:08.3890350|State store operation failed
32+
2020-05-05 03:17:08.3890350|Transitioning to standby state
3033
```
3134

32-
## Cause
33-
34-
When the volume of snapshot files is large or snapshot files are corrupted, ZooKeeper server will fail to form a quorum, which causes ZooKeeper related services unhealthy. ZooKeeper server will not remove old snapshot files from its data directory, instead, it is a periodic task to be performed by users to maintain the healthiness of ZooKeeper. For more information, see [ZooKeeper Strengths and Limitations](https://zookeeper.apache.org/doc/r3.3.5/zookeeperAdmin.html#sc_strengthsAndLimitations).
35+
## Related issues
36+
37+
* High availability services like Yarn, NameNode, and Livy can go down for many reasons.
38+
* Confirm from the logs that it is related to Zookeeper connections
39+
* Make sure that the issue happens repeatedly (do not use these solutions for one off cases)
40+
* Jobs can fail temporarily due to Zookeeper connection issues
41+
42+
## Common causes for Zookeeper failure
43+
44+
* High CPU usage on the zookeeper servers
45+
* In the Ambari UI, if you see near 100% sustained CPU usage on the zookeeper servers, then the zookeeper sessions open during that time can expire and time out
46+
* Zookeeper clients are reporting frequent timeouts
47+
* In the logs for Resource Manager, Namenode and others, you will see frequent client connection timeouts
48+
* This could result in quorum loss, frequent failovers, and other issues
49+
50+
## Check for zookeeper status
51+
52+
* Find the zookeeper servers from the /etc/hosts file or from Ambari UI
53+
* Run the following command
54+
* `echo stat | nc <ZOOKEEPER_HOST_IP> 2181` (or 2182)
55+
* Port 2181 is the apache zookeeper instance
56+
* Port 2182 is used by the HDInsight zookeeper (to provide HA for services that are not natively HA)
57+
* If the command shows no output, then it means that the zookeeper servers are not running
58+
* If the servers are running, the result will include statics of client connections and other statistics
59+
60+
```output
61+
Zookeeper version: 3.4.6-8--1, built on 12/05/2019 12:55 GMT
62+
Clients:
63+
/10.2.0.57:50988[1](queued=0,recved=715,sent=715)
64+
/10.2.0.57:46632[1](queued=0,recved=138340,sent=138347)
65+
/10.2.0.34:14688[1](queued=0,recved=264653,sent=353420)
66+
/10.2.0.52:49680[1](queued=0,recved=134812,sent=134814)
67+
/10.2.0.57:50614[1](queued=0,recved=19812,sent=19812)
68+
/10.2.0.56:35034[1](queued=0,recved=2586,sent=2586)
69+
/10.2.0.52:63982[1](queued=0,recved=72215,sent=72217)
70+
/10.2.0.57:53024[1](queued=0,recved=19805,sent=19805)
71+
/10.2.0.57:45126[1](queued=0,recved=19621,sent=19621)
72+
/10.2.0.56:41270[1](queued=0,recved=1348743,sent=1348788)
73+
/10.2.0.53:59097[1](queued=0,recved=72215,sent=72217)
74+
/10.2.0.56:41088[1](queued=0,recved=788,sent=802)
75+
/10.2.0.34:10246[1](queued=0,recved=19575,sent=19575)
76+
/10.2.0.56:40944[1](queued=0,recved=717,sent=717)
77+
/10.2.0.57:45466[1](queued=0,recved=19861,sent=19861)
78+
/10.2.0.57:59634[0](queued=0,recved=1,sent=0)
79+
/10.2.0.34:14704[1](queued=0,recved=264622,sent=353355)
80+
/10.2.0.57:42244[1](queued=0,recved=49245,sent=49248)
81+
82+
Latency min/avg/max: 0/3/14865
83+
Received: 238606078
84+
Sent: 239139381
85+
Connections: 18
86+
Outstanding: 0
87+
Zxid: 0x1004f99be
88+
Mode: follower
89+
Node count: 133212
90+
```
3591

36-
## Resolution
92+
## CPU load peaks up every hour
3793

38-
Check ZooKeeper data directory `/hadoop/zookeeper/version-2` and `/hadoop/hdinsight-zookeeper/version-2` to find out if the snapshots file size is large. Take the following steps if large snapshots exist:
94+
* Log in to the zookeeper server and check the /etc/crontab
95+
* If there are any hourly jobs running at this time, randomize the start time across different zookeeper servers.
96+
97+
## Purging old snapshots
3998

40-
1. Check the status of other ZooKeeper servers in the same quorum to make sure they are working fine with the command “`echo stat | nc {zk_host_ip} 2181 (or 2182)`”.
99+
* Zookeepers are configured to auto purge old snapshots
100+
* By default, the last 30 snapshots are retained
101+
* The number of snapshots that are retained, is controlled by the configuration key `autopurge.snapRetainCount`. This property can be found in the following files:
102+
* `/etc/zookeeper/conf/zoo.cfg` for Hadoop zookeeper
103+
* `/etc/hdinsight-zookeeper/conf/zoo.cfg` for HDInsight zookeeper
104+
* Set `autopurge.snapRetainCount` to a value of 3 and restart the zookeeper servers
105+
* Hadoop zookeeper config can be updated and the service can be restarted through Ambari
106+
* Stop and restart HDInsight zookeeper manually
107+
* `sudo lsof -i :2182` will give you the process ID to kill
108+
* `sudo python /opt/startup_scripts/startup_hdinsight_zookeeper.py`
109+
* Do not purge snapshots manually - deleting snapshots manually could result in data loss
41110

42-
1. Login the problematic ZooKeeper host, backup snapshots and transaction logs in `/hadoop/zookeeper/version-2` and `/hadoop/hdinsight-zookeeper/version-2`, then cleanup these files in the two directories.
111+
## CancelledKeyException in the zookeeper server log doesn't require snapshot cleanup
43112

44-
1. Restart the problematic ZooKeeper server in Ambari or the ZooKeeper host. Then restart the service which has problems.
113+
* This exception usually means that the client is no longer active and the server is unable to send a message
114+
* This exception also indicates that the zookeeper client is ending sessions prematurely
115+
* Look for the other symptoms outlined in this document
45116

46117
## Next steps
47118

48119
If you didn't see your problem or are unable to solve your issue, visit one of the following channels for more support:
49120

50121
- Get answers from Azure experts through [Azure Community Support](https://azure.microsoft.com/support/community/).
51-
52122
- Connect with [@AzureSupport](https://twitter.com/azuresupport) - the official Microsoft Azure account for improving customer experience. Connecting the Azure community to the right resources: answers, support, and experts.
53-
54-
- If you need more help, you can submit a support request from the [Azure portal](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade/). Select **Support** from the menu bar or open the **Help + support** hub. For more detailed information, review [How to create an Azure support request](https://docs.microsoft.com/azure/azure-portal/supportability/how-to-create-azure-support-request). Access to Subscription Management and billing support is included with your Microsoft Azure subscription, and Technical Support is provided through one of the [Azure Support Plans](https://azure.microsoft.com/support/plans/).
123+
- If you need more help, you can submit a support request from the [Azure portal](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade/). Select **Support** from the menu bar or open the **Help + support** hub. For more detailed information, review [How to create an Azure support request](https://docs.microsoft.com/azure/azure-portal/supportability/how-to-create-azure-support-request). Access to Subscription Management and billing support is included with your Microsoft Azure subscription, and Technical Support is provided through one of the [Azure Support Plans](https://azure.microsoft.com/support/plans/).

0 commit comments

Comments
 (0)