You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/hdinsight/spark/zookeeper-troubleshoot-quorum-fails.md
+35-34Lines changed: 35 additions & 34 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -16,14 +16,14 @@ This article describes troubleshooting steps and possible resolutions for issues
16
16
17
17
* Both the resource managers go to standby mode
18
18
* Namenodes are both in standby mode
19
-
* Spark / Hive / Yarn jobs or Hive queries fail because of Zookeeper connection failures
19
+
* Spark, Hive, and Yarn jobs or Hive queries fail because of Zookeeper connection failures
20
20
* LLAP daemons fail to start on secure Spark or secure interactive Hive clusters
21
21
22
22
## Sample log
23
23
24
24
You may see an error message similar to:
25
25
26
-
```
26
+
```output
27
27
2020-05-05 03:17:18.3916720|Lost contact with Zookeeper. Transitioning to standby in 10000 ms if connection is not reestablished.
28
28
Message
29
29
2020-05-05 03:17:07.7924490|Received RMFatalEvent of type STATE_STORE_FENCED, caused by org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth
@@ -38,25 +38,26 @@ Message
38
38
* Confirm from the logs that it is related to Zookeeper connections
39
39
* Make sure that the issue happens repeatedly (do not use these solutions for one off cases)
40
40
* Jobs can fail temporarily due to Zookeeper connection issues
41
-
42
41
43
42
## Common causes for Zookeeper failure
44
43
45
44
* High CPU usage on the zookeeper servers
46
-
* In the Ambari UI, if you see near 100% sustained CPU usage on the zookeeper servers, then the zookeeper sessions open during that time can expire and timeout
45
+
* In the Ambari UI, if you see near 100% sustained CPU usage on the zookeeper servers, then the zookeeper sessions open during that time can expire and time out
47
46
* Zookeeper clients are reporting frequent timeouts
48
-
* In the logs for resource manager, namenode and others, you will see frequent client connection timeouts
49
-
* This could result in quorum loss, frequent failovers and other issues
47
+
* In the logs for Resource Manager, Namenode and others, you will see frequent client connection timeouts
48
+
* This could result in quorum loss, frequent failovers, and other issues
50
49
51
50
## Check for zookeeper status
52
-
* Find the zookeeper servers from the /etc/hosts file or from Ambari UI
53
-
* Run the following command
54
-
* echo stat | nc {zk_host_ip} 2181 (or 2182)
55
-
* Port 2181 is the apache zookeeper instance
56
-
* Port 2182 is used by the HDI zookeeper (to provide HA for services that are not natively HA)
57
-
* If the command shows no output, then it means that the zookeeper servers are not running
58
-
* If the servers are running, the result will include statics of client connections and other statistics
59
-
````
51
+
52
+
* Find the zookeeper servers from the /etc/hosts file or from Ambari UI
53
+
* Run the following command
54
+
*`echo stat | nc <ZOOKEEPER_HOST_IP> 2181` (or 2182)
55
+
* Port 2181 is the apache zookeeper instance
56
+
* Port 2182 is used by the HDInsight zookeeper (to provide HA for services that are not natively HA)
57
+
* If the command shows no output, then it means that the zookeeper servers are not running
58
+
* If the servers are running, the result will include statics of client connections and other statistics
59
+
60
+
```output
60
61
Zookeeper version: 3.4.6-8--1, built on 12/05/2019 12:55 GMT
61
62
Clients:
62
63
/10.2.0.57:50988[1](queued=0,recved=715,sent=715)
@@ -86,37 +87,37 @@ Outstanding: 0
86
87
Zxid: 0x1004f99be
87
88
Mode: follower
88
89
Node count: 133212
89
-
````
90
+
```
91
+
90
92
## CPU load peaks up every hour
91
93
92
94
* Log in to the zookeeper server and check the /etc/crontab
93
95
* If there are any hourly jobs running at this time, randomize the start time across different zookeeper servers.
94
96
95
97
## Purging old snapshots
96
-
* Zookeepers are configured to auto purge old snapshots
97
-
* By default, last 30 snapshots are retained
98
-
* This is controlled by the configuration key autopurge.snapRetainCount
99
-
* /etc/zookeeper/conf/zoo.cfg for hadoop zookeeper
100
-
*/etc/hdinsight-zookeeper/conf/zoo.cfg for HDI zookeeper
101
-
*Set this to a value of 3 and restart the zookeeper servers
102
-
* Hadoop zookeeper config can be updated and the service can be restarted through Ambari
103
-
* HDI zookeeper has to be stopped manually and restarted manually
104
-
* sudo lsof -i :2182 will give you the process id to kill
***DO NOT delete the snapshot files directly as this could result in data loss**
108
-
98
+
99
+
* Zookeepers are configured to auto purge old snapshots
100
+
* By default, the last 30 snapshots are retained
101
+
* The number of snapshots that are retained, is controlled by the configuration key `autopurge.snapRetainCount`. This property can be found in the following files:
102
+
*`/etc/zookeeper/conf/zoo.cfg` for Hadoop zookeeper
103
+
*`/etc/hdinsight-zookeeper/conf/zoo.cfg` for HDInsight zookeeper
104
+
* Set `autopurge.snapRetainCount` to a value of 3 and restart the zookeeper servers
105
+
* Hadoop zookeeper config can be updated and the service can be restarted through Ambari
106
+
* Stop and restart HDInsight zookeeper manually
107
+
*`sudo lsof -i :2182` will give you the process ID to kill
* Do not purge snapshots manually - deleting snapshots manually could result in data loss
110
+
109
111
## CancelledKeyException in the zookeeper server log doesn't require snapshot cleanup
110
-
* This exception usually means that the client is no longer active and the server is unable to send a message
111
-
* This is indication of a symptom that the zookeeper client is terminating sessions prematurely
112
-
* Look for the other symptoms outlined in this document
112
+
113
+
* This exception usually means that the client is no longer active and the server is unable to send a message
114
+
* This exception also indicates that the zookeeper client is ending sessions prematurely
115
+
* Look for the other symptoms outlined in this document
113
116
114
117
## Next steps
115
118
116
119
If you didn't see your problem or are unable to solve your issue, visit one of the following channels for more support:
117
120
118
121
- Get answers from Azure experts through [Azure Community Support](https://azure.microsoft.com/support/community/).
119
-
120
122
- Connect with [@AzureSupport](https://twitter.com/azuresupport) - the official Microsoft Azure account for improving customer experience. Connecting the Azure community to the right resources: answers, support, and experts.
121
-
122
-
- If you need more help, you can submit a support request from the [Azure portal](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade/). Select **Support** from the menu bar or open the **Help + support** hub. For more detailed information, review [How to create an Azure support request](https://docs.microsoft.com/azure/azure-portal/supportability/how-to-create-azure-support-request). Access to Subscription Management and billing support is included with your Microsoft Azure subscription, and Technical Support is provided through one of the [Azure Support Plans](https://azure.microsoft.com/support/plans/).
123
+
- If you need more help, you can submit a support request from the [Azure portal](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade/). Select **Support** from the menu bar or open the **Help + support** hub. For more detailed information, review [How to create an Azure support request](https://docs.microsoft.com/azure/azure-portal/supportability/how-to-create-azure-support-request). Access to Subscription Management and billing support is included with your Microsoft Azure subscription, and Technical Support is provided through one of the [Azure Support Plans](https://azure.microsoft.com/support/plans/).
0 commit comments