You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/spark/zookeeper-troubleshoot-quorum-fails.md
+93-24Lines changed: 93 additions & 24 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,49 +6,118 @@ ms.topic: troubleshooting
6
6
author: hrasheed-msft
7
7
ms.author: hrasheed
8
8
ms.reviewer: jasonh
9
-
ms.date: 08/20/2019
9
+
ms.date: 05/20/2020
10
10
---
11
-
12
11
# Apache ZooKeeper server fails to form a quorum in Azure HDInsight
13
12
14
-
This article describes troubleshooting steps and possible resolutions for issues when interacting with Azure HDInsight clusters.
13
+
This article describes troubleshooting steps and possible resolutions for issues related to Zookeepers in Azure HDInsight clusters.
15
14
16
-
## Issue
15
+
## Symptoms
17
16
18
-
Apache ZooKeeper server is unhealthy, symptoms could include: both Resource Managers/Name Nodes are in standby mode, simple HDFS operations do not work, `zkFailoverController` is stopped and cannot be started, Yarn/Spark/Livy jobs fail due to Zookeeper errors. LLAP Daemons may also fail to start on Secure Spark or Interactive Hive clusters. You may see an error message similar to:
17
+
* Both the resource managers go to standby mode
18
+
* Namenodes are both in standby mode
19
+
* Spark, Hive, and Yarn jobs or Hive queries fail because of Zookeeper connection failures
20
+
* LLAP daemons fail to start on secure Spark or secure interactive Hive clusters
2020-05-05 03:17:18.3916720|Lost contact with Zookeeper. Transitioning to standby in 10000 ms if connection is not reestablished.
28
+
Message
29
+
2020-05-05 03:17:07.7924490|Received RMFatalEvent of type STATE_STORE_FENCED, caused by org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth
30
+
...
31
+
2020-05-05 03:17:08.3890350|State store operation failed
32
+
2020-05-05 03:17:08.3890350|Transitioning to standby state
30
33
```
31
34
32
-
## Cause
33
-
34
-
When the volume of snapshot files is large or snapshot files are corrupted, ZooKeeper server will fail to form a quorum, which causes ZooKeeper related services unhealthy. ZooKeeper server will not remove old snapshot files from its data directory, instead, it is a periodic task to be performed by users to maintain the healthiness of ZooKeeper. For more information, see [ZooKeeper Strengths and Limitations](https://zookeeper.apache.org/doc/r3.3.5/zookeeperAdmin.html#sc_strengthsAndLimitations).
35
+
## Related issues
36
+
37
+
* High availability services like Yarn, NameNode, and Livy can go down for many reasons.
38
+
* Confirm from the logs that it is related to Zookeeper connections
39
+
* Make sure that the issue happens repeatedly (do not use these solutions for one off cases)
40
+
* Jobs can fail temporarily due to Zookeeper connection issues
41
+
42
+
## Common causes for Zookeeper failure
43
+
44
+
* High CPU usage on the zookeeper servers
45
+
* In the Ambari UI, if you see near 100% sustained CPU usage on the zookeeper servers, then the zookeeper sessions open during that time can expire and time out
46
+
* Zookeeper clients are reporting frequent timeouts
47
+
* In the logs for Resource Manager, Namenode and others, you will see frequent client connection timeouts
48
+
* This could result in quorum loss, frequent failovers, and other issues
49
+
50
+
## Check for zookeeper status
51
+
52
+
* Find the zookeeper servers from the /etc/hosts file or from Ambari UI
53
+
* Run the following command
54
+
*`echo stat | nc <ZOOKEEPER_HOST_IP> 2181` (or 2182)
55
+
* Port 2181 is the apache zookeeper instance
56
+
* Port 2182 is used by the HDInsight zookeeper (to provide HA for services that are not natively HA)
57
+
* If the command shows no output, then it means that the zookeeper servers are not running
58
+
* If the servers are running, the result will include statics of client connections and other statistics
59
+
60
+
```output
61
+
Zookeeper version: 3.4.6-8--1, built on 12/05/2019 12:55 GMT
Check ZooKeeper data directory `/hadoop/zookeeper/version-2` and `/hadoop/hdinsight-zookeeper/version-2` to find out if the snapshots file size is large. Take the following steps if large snapshots exist:
94
+
* Log in to the zookeeper server and check the /etc/crontab
95
+
* If there are any hourly jobs running at this time, randomize the start time across different zookeeper servers.
96
+
97
+
## Purging old snapshots
39
98
40
-
1. Check the status of other ZooKeeper servers in the same quorum to make sure they are working fine with the command “`echo stat | nc {zk_host_ip} 2181 (or 2182)`”.
99
+
* Zookeepers are configured to auto purge old snapshots
100
+
* By default, the last 30 snapshots are retained
101
+
* The number of snapshots that are retained, is controlled by the configuration key `autopurge.snapRetainCount`. This property can be found in the following files:
102
+
*`/etc/zookeeper/conf/zoo.cfg` for Hadoop zookeeper
103
+
*`/etc/hdinsight-zookeeper/conf/zoo.cfg` for HDInsight zookeeper
104
+
* Set `autopurge.snapRetainCount` to a value of 3 and restart the zookeeper servers
105
+
* Hadoop zookeeper config can be updated and the service can be restarted through Ambari
106
+
* Stop and restart HDInsight zookeeper manually
107
+
*`sudo lsof -i :2182` will give you the process ID to kill
* Do not purge snapshots manually - deleting snapshots manually could result in data loss
41
110
42
-
1. Login the problematic ZooKeeper host, backup snapshots and transaction logs in `/hadoop/zookeeper/version-2` and `/hadoop/hdinsight-zookeeper/version-2`, then cleanup these files in the two directories.
111
+
## CancelledKeyException in the zookeeper server log doesn't require snapshot cleanup
43
112
44
-
1. Restart the problematic ZooKeeper server in Ambari or the ZooKeeper host. Then restart the service which has problems.
113
+
* This exception usually means that the client is no longer active and the server is unable to send a message
114
+
* This exception also indicates that the zookeeper client is ending sessions prematurely
115
+
* Look for the other symptoms outlined in this document
45
116
46
117
## Next steps
47
118
48
119
If you didn't see your problem or are unable to solve your issue, visit one of the following channels for more support:
49
120
50
121
- Get answers from Azure experts through [Azure Community Support](https://azure.microsoft.com/support/community/).
51
-
52
122
- Connect with [@AzureSupport](https://twitter.com/azuresupport) - the official Microsoft Azure account for improving customer experience. Connecting the Azure community to the right resources: answers, support, and experts.
53
-
54
-
- If you need more help, you can submit a support request from the [Azure portal](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade/). Select **Support** from the menu bar or open the **Help + support** hub. For more detailed information, review [How to create an Azure support request](https://docs.microsoft.com/azure/azure-portal/supportability/how-to-create-azure-support-request). Access to Subscription Management and billing support is included with your Microsoft Azure subscription, and Technical Support is provided through one of the [Azure Support Plans](https://azure.microsoft.com/support/plans/).
123
+
- If you need more help, you can submit a support request from the [Azure portal](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade/). Select **Support** from the menu bar or open the **Help + support** hub. For more detailed information, review [How to create an Azure support request](https://docs.microsoft.com/azure/azure-portal/supportability/how-to-create-azure-support-request). Access to Subscription Management and billing support is included with your Microsoft Azure subscription, and Technical Support is provided through one of the [Azure Support Plans](https://azure.microsoft.com/support/plans/).
0 commit comments