Skip to content

Commit 9b2fdc1

Browse files
committed
updates
1 parent 219104a commit 9b2fdc1

File tree

1 file changed

+35
-34
lines changed

1 file changed

+35
-34
lines changed

articles/hdinsight/spark/zookeeper-troubleshoot-quorum-fails.md

Lines changed: 35 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -16,14 +16,14 @@ This article describes troubleshooting steps and possible resolutions for issues
1616

1717
* Both the resource managers go to standby mode
1818
* Namenodes are both in standby mode
19-
* Spark / Hive / Yarn jobs or Hive queries fail because of Zookeeper connection failures
19+
* Spark, Hive, and Yarn jobs or Hive queries fail because of Zookeeper connection failures
2020
* LLAP daemons fail to start on secure Spark or secure interactive Hive clusters
2121

2222
## Sample log
2323

2424
You may see an error message similar to:
2525

26-
```
26+
```output
2727
2020-05-05 03:17:18.3916720|Lost contact with Zookeeper. Transitioning to standby in 10000 ms if connection is not reestablished.
2828
Message
2929
2020-05-05 03:17:07.7924490|Received RMFatalEvent of type STATE_STORE_FENCED, caused by org.apache.zookeeper.KeeperException$NoAuthException: KeeperErrorCode = NoAuth
@@ -38,25 +38,26 @@ Message
3838
* Confirm from the logs that it is related to Zookeeper connections
3939
* Make sure that the issue happens repeatedly (do not use these solutions for one off cases)
4040
* Jobs can fail temporarily due to Zookeeper connection issues
41-
4241

4342
## Common causes for Zookeeper failure
4443

4544
* High CPU usage on the zookeeper servers
46-
* In the Ambari UI, if you see near 100% sustained CPU usage on the zookeeper servers, then the zookeeper sessions open during that time can expire and timeout
45+
* In the Ambari UI, if you see near 100% sustained CPU usage on the zookeeper servers, then the zookeeper sessions open during that time can expire and time out
4746
* Zookeeper clients are reporting frequent timeouts
48-
* In the logs for resource manager, namenode and others, you will see frequent client connection timeouts
49-
* This could result in quorum loss, frequent failovers and other issues
47+
* In the logs for Resource Manager, Namenode and others, you will see frequent client connection timeouts
48+
* This could result in quorum loss, frequent failovers, and other issues
5049

5150
## Check for zookeeper status
52-
* Find the zookeeper servers from the /etc/hosts file or from Ambari UI
53-
* Run the following command
54-
* echo stat | nc {zk_host_ip} 2181 (or 2182)
55-
* Port 2181 is the apache zookeeper instance
56-
* Port 2182 is used by the HDI zookeeper (to provide HA for services that are not natively HA)
57-
* If the command shows no output, then it means that the zookeeper servers are not running
58-
* If the servers are running, the result will include statics of client connections and other statistics
59-
````
51+
52+
* Find the zookeeper servers from the /etc/hosts file or from Ambari UI
53+
* Run the following command
54+
* `echo stat | nc <ZOOKEEPER_HOST_IP> 2181` (or 2182)
55+
* Port 2181 is the apache zookeeper instance
56+
* Port 2182 is used by the HDInsight zookeeper (to provide HA for services that are not natively HA)
57+
* If the command shows no output, then it means that the zookeeper servers are not running
58+
* If the servers are running, the result will include statics of client connections and other statistics
59+
60+
```output
6061
Zookeeper version: 3.4.6-8--1, built on 12/05/2019 12:55 GMT
6162
Clients:
6263
/10.2.0.57:50988[1](queued=0,recved=715,sent=715)
@@ -86,37 +87,37 @@ Outstanding: 0
8687
Zxid: 0x1004f99be
8788
Mode: follower
8889
Node count: 133212
89-
````
90+
```
91+
9092
## CPU load peaks up every hour
9193

9294
* Log in to the zookeeper server and check the /etc/crontab
9395
* If there are any hourly jobs running at this time, randomize the start time across different zookeeper servers.
9496

9597
## Purging old snapshots
96-
* Zookeepers are configured to auto purge old snapshots
97-
* By default, last 30 snapshots are retained
98-
* This is controlled by the configuration key autopurge.snapRetainCount
99-
* /etc/zookeeper/conf/zoo.cfg for hadoop zookeeper
100-
* /etc/hdinsight-zookeeper/conf/zoo.cfg for HDI zookeeper
101-
* Set this to a value of 3 and restart the zookeeper servers
102-
* Hadoop zookeeper config can be updated and the service can be restarted through Ambari
103-
* HDI zookeeper has to be stopped manually and restarted manually
104-
* sudo lsof -i :2182 will give you the process id to kill
105-
* sudo python /opt/startup_scripts/startup_hdinsight_zookeeper.py
106-
* Manually purging snapshots is not required
107-
* **DO NOT delete the snapshot files directly as this could result in data loss**
108-
98+
99+
* Zookeepers are configured to auto purge old snapshots
100+
* By default, the last 30 snapshots are retained
101+
* The number of snapshots that are retained, is controlled by the configuration key `autopurge.snapRetainCount`. This property can be found in the following files:
102+
* `/etc/zookeeper/conf/zoo.cfg` for Hadoop zookeeper
103+
* `/etc/hdinsight-zookeeper/conf/zoo.cfg` for HDInsight zookeeper
104+
* Set `autopurge.snapRetainCount` to a value of 3 and restart the zookeeper servers
105+
* Hadoop zookeeper config can be updated and the service can be restarted through Ambari
106+
* Stop and restart HDInsight zookeeper manually
107+
* `sudo lsof -i :2182` will give you the process ID to kill
108+
* `sudo python /opt/startup_scripts/startup_hdinsight_zookeeper.py`
109+
* Do not purge snapshots manually - deleting snapshots manually could result in data loss
110+
109111
## CancelledKeyException in the zookeeper server log doesn't require snapshot cleanup
110-
* This exception usually means that the client is no longer active and the server is unable to send a message
111-
* This is indication of a symptom that the zookeeper client is terminating sessions prematurely
112-
* Look for the other symptoms outlined in this document
112+
113+
* This exception usually means that the client is no longer active and the server is unable to send a message
114+
* This exception also indicates that the zookeeper client is ending sessions prematurely
115+
* Look for the other symptoms outlined in this document
113116

114117
## Next steps
115118

116119
If you didn't see your problem or are unable to solve your issue, visit one of the following channels for more support:
117120

118121
- Get answers from Azure experts through [Azure Community Support](https://azure.microsoft.com/support/community/).
119-
120122
- Connect with [@AzureSupport](https://twitter.com/azuresupport) - the official Microsoft Azure account for improving customer experience. Connecting the Azure community to the right resources: answers, support, and experts.
121-
122-
- If you need more help, you can submit a support request from the [Azure portal](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade/). Select **Support** from the menu bar or open the **Help + support** hub. For more detailed information, review [How to create an Azure support request](https://docs.microsoft.com/azure/azure-portal/supportability/how-to-create-azure-support-request). Access to Subscription Management and billing support is included with your Microsoft Azure subscription, and Technical Support is provided through one of the [Azure Support Plans](https://azure.microsoft.com/support/plans/).
123+
- If you need more help, you can submit a support request from the [Azure portal](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade/). Select **Support** from the menu bar or open the **Help + support** hub. For more detailed information, review [How to create an Azure support request](https://docs.microsoft.com/azure/azure-portal/supportability/how-to-create-azure-support-request). Access to Subscription Management and billing support is included with your Microsoft Azure subscription, and Technical Support is provided through one of the [Azure Support Plans](https://azure.microsoft.com/support/plans/).

0 commit comments

Comments
 (0)