Skip to content

Commit 39281d2

Browse files
authored
Merge pull request #85738 from dagiro/ts_hbase14
Ts hbase14
2 parents 7f61266 + 9129af6 commit 39281d2

File tree

3 files changed

+130
-34
lines changed

3 files changed

+130
-34
lines changed

articles/hdinsight/hbase/apache-troubleshoot-hbase.md

Lines changed: 114 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -6,45 +6,139 @@ author: hrasheed-msft
66
ms.author: hrasheed
77
ms.custom: hdinsightactive, seodec18
88
ms.topic: troubleshooting
9-
ms.date: 08/14/2019
9+
ms.date: 08/16/2019
1010
---
1111

1212
# Troubleshoot Apache HBase by using Azure HDInsight
1313

1414
Learn about the top issues and their resolutions when working with Apache HBase payloads in Apache Ambari.
1515

16-
## How do I run hbck command reports with multiple unassigned regions?
16+
## How do I fix JDBC or SQLLine connectivity issues with Apache Phoenix?
1717

18-
A common error message that you might see when you run the `hbase hbck` command is "multiple regions being unassigned or holes in the chain of regions."
18+
### Resolution steps
19+
20+
To connect with Apache Phoenix, you must provide the IP address of an active Apache ZooKeeper node. Ensure that the ZooKeeper service to which sqlline.py is trying to connect is up and running.
21+
1. Sign in to the HDInsight cluster by using SSH.
22+
2. Enter the following command:
23+
24+
```apache
25+
"/usr/hdp/current/phoenix-client/bin/sqlline.py <IP of machine where Active Zookeeper is running"
26+
```
27+
28+
> [!Note]
29+
> You can get the IP address of the active ZooKeeper node from the Ambari UI. Go to **HBase** > **Quick Links** > **ZK\* (Active)** > **Zookeeper Info**.
30+
31+
3. If the sqlline.py connects to Phoenix and does not timeout, run the following command to validate the availability and health of Phoenix:
32+
33+
```apache
34+
!tables
35+
!quit
36+
```
37+
4. If this command works, there is no issue. The IP address provided by the user might be incorrect. However, if the command pauses for an extended time and then displays the following error, continue to step 5.
1938

20-
In the HBase Master UI, you can see the number of regions that are unbalanced across all region servers. Then, you can run `hbase hbck` command to see holes in the region chain.
39+
```apache
40+
Error while connecting to sqlline.py (Hbase - phoenix) Setting property: [isolation, TRANSACTION_READ_COMMITTED] issuing: !connect jdbc:phoenix:10.2.0.7 none none org.apache.phoenix.jdbc.PhoenixDriver Connecting to jdbc:phoenix:10.2.0.7 SLF4J: Class path contains multiple SLF4J bindings.
41+
```
2142

22-
Holes might be caused by the offline regions, so fix the assignments first.
43+
5. Run the following commands from the head node (hn0) to diagnose the condition of the Phoenix SYSTEM.CATALOG table:
2344

24-
To bring the unassigned regions back to a normal state, complete the following steps:
45+
```apache
46+
hbase shell
47+
48+
count 'SYSTEM.CATALOG'
49+
```
2550

26-
1. Sign in to the HDInsight HBase cluster by using SSH.
27-
2. To connect with the Apache ZooKeeper shell, run the `hbase zkcli` command.
28-
3. Run the `rmr /hbase/regions-in-transition` command or the `rmr /hbase-unsecure/regions-in-transition` command.
29-
4. To exit from the `hbase zkcli` shell, use the `exit` command.
30-
5. Open the Apache Ambari UI, and then restart the Active HBase Master service.
31-
6. Run the `hbase hbck` command again (without any options). Check the output of this command to ensure that all regions are being assigned.
51+
The command should return an error similar to the following:
3252

53+
```apache
54+
ERROR: org.apache.hadoop.hbase.NotServingRegionException: Region SYSTEM.CATALOG,,1485464083256.c0568c94033870c517ed36c45da98129. is not online on 10.2.0.5,16020,1489466172189)
55+
```
56+
6. In the Apache Ambari UI, complete the following steps to restart the HMaster service on all ZooKeeper nodes:
3357

34-
## <a name="how-do-i-fix-timeout-issues-with-hbck-commands-for-region-assignments"></a>How do I fix timeout issues when using hbck commands for region assignments?
58+
1. In the **Summary** section of HBase, go to **HBase** > **Active HBase Master**.
59+
2. In the **Components** section, restart the HBase Master service.
60+
3. Repeat these steps for all remaining **Standby HBase Master** services.
61+
62+
It can take up to five minutes for the HBase Master service to stabilize and finish the recovery process. After a few minutes, repeat the sqlline.py commands to confirm that the SYSTEM.CATALOG table is up, and that it can be queried.
63+
64+
When the SYSTEM.CATALOG table is back to normal, the connectivity issue to Phoenix should be automatically resolved.
65+
66+
## What causes a restart failure on a region server?
3567

3668
### Issue
3769

38-
A potential cause for timeout issues when you use the `hbck` command might be that several regions are in the "in transition" state for a long time. You can see those regions as offline in the HBase Master UI. Because a high number of regions are attempting to transition, HBase Master might timeout and be unable to bring those regions back online.
70+
A restart failure on a region server might be prevented by following best practices. We recommend that you pause heavy workload activity when you are planning to restart HBase region servers. If an application continues to connect with region servers when shutdown is in progress, the region server restart operation will be slower by several minutes. Also, it's a good idea to first flush all the tables. For a reference on how to flush tables, see [HDInsight HBase: How to improve the Apache HBase cluster restart time by flushing tables](https://web.archive.org/web/20190112153155/https://blogs.msdn.microsoft.com/azuredatalake/2016/09/19/hdinsight-hbase-how-to-improve-hbase-cluster-restart-time-by-flushing-tables/).
71+
72+
If you initiate the restart operation on HBase region servers from the Apache Ambari UI, you immediately see that the region servers went down, but they don't restart right away.
73+
74+
Here's what's happening behind the scenes:
75+
76+
1. The Ambari agent sends a stop request to the region server.
77+
2. The Ambari agent waits for 30 seconds for the region server to shut down gracefully.
78+
3. If your application continues to connect with the region server, the server won't shut down immediately. The 30-second timeout expires before shutdown occurs.
79+
4. After 30 seconds, the Ambari agent sends a force-kill (`kill -9`) command to the region server. You can see this in the ambari-agent log (in the /var/log/ directory of the respective worker node):
80+
81+
```apache
82+
2017-03-21 13:22:09,171 - Execute['/usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh --config /usr/hdp/current/hbase-regionserver/conf stop regionserver'] {'only_if': 'ambari-sudo.sh -H -E t
83+
est -f /var/run/hbase/hbase-hbase-regionserver.pid && ps -p `ambari-sudo.sh -H -E cat /var/run/hbase/hbase-hbase-regionserver.pid` >/dev/null 2>&1', 'on_timeout': '! ( ambari-sudo.sh -H -E test -
84+
f /var/run/hbase/hbase-hbase-regionserver.pid && ps -p `ambari-sudo.sh -H -E cat /var/run/hbase/hbase-hbase-regionserver.pid` >/dev/null 2>&1 ) || ambari-sudo.sh -H -E kill -9 `ambari-sudo.sh -H
85+
-E cat /var/run/hbase/hbase-hbase-regionserver.pid`', 'timeout': 30, 'user': 'hbase'}
86+
2017-03-21 13:22:40,268 - Executing '! ( ambari-sudo.sh -H -E test -f /var/run/hbase/hbase-hbase-regionserver.pid && ps -p `ambari-sudo.sh -H -E cat /var/run/hbase/hbase-hbase-regionserver.pid` >
87+
/dev/null 2>&1 ) || ambari-sudo.sh -H -E kill -9 `ambari-sudo.sh -H -E cat /var/run/hbase/hbase-hbase-regionserver.pid`'. Reason: Execution of 'ambari-sudo.sh su hbase -l -s /bin/bash -c 'export
88+
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/var/lib/ambari-agent ; /usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh --config /usr/hdp/curre
89+
nt/hbase-regionserver/conf stop regionserver was killed due timeout after 30 seconds
90+
2017-03-21 13:22:40,285 - File['/var/run/hbase/hbase-hbase-regionserver.pid'] {'action': ['delete']}
91+
2017-03-21 13:22:40,285 - Deleting File['/var/run/hbase/hbase-hbase-regionserver.pid']
92+
```
93+
Because of the abrupt shutdown, the port associated with the process might not be released, even though the region server process is stopped. This situation can lead to an AddressBindException when the region server is starting, as shown in the following logs. You can verify this in the region-server.log in the /var/log/hbase directory on the worker nodes where the region server fails to start.
94+
95+
```apache
96+
97+
2017-03-21 13:25:47,061 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
98+
java.lang.RuntimeException: Failed construction of Regionserver: class org.apache.hadoop.hbase.regionserver.HRegionServer
99+
at org.apache.hadoop.hbase.regionserver.HRegionServer.constructRegionServer(HRegionServer.java:2636)
100+
at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:64)
101+
at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87)
102+
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
103+
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
104+
at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2651)
105+
106+
Caused by: java.lang.reflect.InvocationTargetException
107+
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
108+
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
109+
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
110+
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
111+
at org.apache.hadoop.hbase.regionserver.HRegionServer.constructRegionServer(HRegionServer.java:2634)
112+
... 5 more
113+
114+
Caused by: java.net.BindException: Problem binding to /10.2.0.4:16020 : Address already in use
115+
at org.apache.hadoop.hbase.ipc.RpcServer.bind(RpcServer.java:2497)
116+
at org.apache.hadoop.hbase.ipc.RpcServer$Listener.<init>(RpcServer.java:580)
117+
at org.apache.hadoop.hbase.ipc.RpcServer.<init>(RpcServer.java:1982)
118+
at org.apache.hadoop.hbase.regionserver.RSRpcServices.<init>(RSRpcServices.java:863)
119+
at org.apache.hadoop.hbase.regionserver.HRegionServer.createRpcServices(HRegionServer.java:632)
120+
at org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:532)
121+
... 10 more
122+
123+
Caused by: java.net.BindException: Address already in use
124+
at sun.nio.ch.Net.bind0(Native Method)
125+
at sun.nio.ch.Net.bind(Net.java:463)
126+
at sun.nio.ch.Net.bind(Net.java:455)
127+
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
128+
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
129+
at org.apache.hadoop.hbase.ipc.RpcServer.bind(RpcServer.java:2495)
130+
... 15 more
131+
```
39132

40133
### Resolution steps
41134

42-
1. Sign in to the HDInsight HBase cluster by using SSH.
43-
2. To connect with the Apache ZooKeeper shell, run the `hbase zkcli` command.
44-
3. Run the `rmr /hbase/regions-in-transition` or the `rmr /hbase-unsecure/regions-in-transition` command.
45-
4. To exit the `hbase zkcli` shell, use the `exit` command.
46-
5. In the Ambari UI, restart the Active HBase Master service.
47-
6. Run the `hbase hbck -fixAssignments` command again.
135+
1. Try to reduce the load on the HBase region servers before you initiate a restart.
136+
2. Alternatively (if step 1 doesn't help), try to manually restart region servers on the worker nodes by using the following commands:
137+
138+
```apache
139+
sudo su - hbase -c "/usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh stop regionserver"
140+
sudo su - hbase -c "/usr/hdp/current/hbase-regionserver/bin/hbase-daemon.sh start regionserver"
141+
```
48142

49143
## Next steps
50144

articles/hdinsight/hbase/hbase-troubleshoot-timeouts-hbase-hbck.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ ms.service: hdinsight
55
ms.topic: troubleshooting
66
author: hrasheed-msft
77
ms.author: hrasheed
8-
ms.date: 08/01/2019
8+
ms.date: 08/16/2019
99
---
1010

1111
# Scenario: Timeouts with 'hbase hbck' command in Azure HDInsight
@@ -18,28 +18,30 @@ Encounter timeouts with `hbase hbck` command when fixing region assignments.
1818

1919
## Cause
2020

21-
The potential cause here could be several regions under "in transition" state for a long time. Those regions can be seen as offline from Apache HBase Master UI. Due to high number of regions that are attempting to transition, HBase Master could time out and will be unable to bring those regions back to online state.
21+
A potential cause for timeout issues when you use the `hbck` command might be that several regions are in the "in transition" state for a long time. You can see those regions as offline in the HBase Master UI. Because a high number of regions are attempting to transition, HBase Master might time out and be unable to bring those regions back online.
2222

2323
## Resolution
2424

25-
1. Sign in to HDInsight HBase cluster using SSH.
25+
1. Sign in to the HDInsight HBase cluster using SSH.
2626

27-
1. Run `hbase zkcli` command to connect with zookeeper shell.
27+
1. Run `hbase zkcli` command to connect with Apache ZooKeeper shell.
2828

2929
1. Run `rmr /hbase/regions-in-transition` or `rmr /hbase-unsecure/regions-in-transition` command.
3030

3131
1. Exit from `hbase zkcli` shell by using `exit` command.
3232

33-
1. Open Ambari UI and restart Active HBase Master service from Ambari.
33+
1. From the Apache Ambari UI, restart the Active HBase Master service.
34+
35+
1. Run the `hbase hbck -fixAssignments` command.
3436

3537
1. Monitor the HBase Master UI "region in transition" that section to make sure no region gets stuck.
3638

3739
## Next steps
3840

3941
If you didn't see your problem or are unable to solve your issue, visit one of the following channels for more support:
4042

41-
* Get answers from Azure experts through [Azure Community Support](https://azure.microsoft.com/support/community/).
43+
- Get answers from Azure experts through [Azure Community Support](https://azure.microsoft.com/support/community/).
4244

43-
* Connect with [@AzureSupport](https://twitter.com/azuresupport) - the official Microsoft Azure account for improving customer experience by connecting the Azure community to the right resources: answers, support, and experts.
45+
- Connect with [@AzureSupport](https://twitter.com/azuresupport) - the official Microsoft Azure account for improving customer experience. Connecting the Azure community to the right resources: answers, support, and experts.
4446

45-
* If you need more help, you can submit a support request from the [Azure portal](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade/). Select **Support** from the menu bar or open the **Help + support** hub. For more detailed information, please review [How to create an Azure support request](https://docs.microsoft.com/azure/azure-supportability/how-to-create-azure-support-request). Access to Subscription Management and billing support is included with your Microsoft Azure subscription, and Technical Support is provided through one of the [Azure Support Plans](https://azure.microsoft.com/support/plans/).
47+
- If you need more help, you can submit a support request from the [Azure portal](https://portal.azure.com/?#blade/Microsoft_Azure_Support/HelpAndSupportBlade/). Select **Support** from the menu bar or open the **Help + support** hub. For more detailed information, review [How to create an Azure support request](https://docs.microsoft.com/azure/azure-supportability/how-to-create-azure-support-request). Access to Subscription Management and billing support is included with your Microsoft Azure subscription, and Technical Support is provided through one of the [Azure Support Plans](https://azure.microsoft.com/support/plans/).

articles/hdinsight/hbase/hbase-troubleshoot-unassigned-regions.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ ms.service: hdinsight
55
ms.topic: troubleshooting
66
author: hrasheed-msft
77
ms.author: hrasheed
8-
ms.date: 08/07/2019
8+
ms.date: 08/16/2019
99
---
1010

1111
# Issues with region servers in Azure HDInsight
@@ -22,7 +22,7 @@ When running `hbase hbck` command, you see an error message similar to:
2222
multiple regions being unassigned or holes in the chain of regions
2323
```
2424

25-
From the Apache HBase Master UI, it can be seen that the count of regions being unbalanced across all the region servers.
25+
From the Apache HBase Master UI, you can see the number of regions that are unbalanced across all region servers. Then, you can run `hbase hbck` command to see holes in the region chain.
2626

2727
### Cause
2828

@@ -32,15 +32,15 @@ Holes may be the result of offline regions.
3232

3333
Fix the assignments. Follow the steps below to bring the unassigned regions back to normal state:
3434

35-
1. Sign in to HDInsight HBase cluster using SSH.
35+
1. Sign in to the HDInsight HBase cluster using SSH.
3636

37-
1. Run `hbase zkcli` command to connect with zookeeper shell.
37+
1. Run `hbase zkcli` command to connect with ZooKeeper shell.
3838

3939
1. Run `rmr /hbase/regions-in-transition` or `rmr /hbase-unsecure/regions-in-transition` command.
4040

4141
1. Exit zookeeper shell by using `exit` command.
4242

43-
1. Open Ambari UI and restart Active HBase Master service from Ambari.
43+
1. Open the Apache Ambari UI, and then restart the Active HBase Master service.
4444

4545
1. Run `hbase hbck` command again (without any further options). Check the output and ensure that all regions are being assigned.
4646

@@ -56,7 +56,7 @@ Region servers fail to start.
5656

5757
Multiple splitting WAL directories.
5858

59-
1. Get list of current wals: `hadoop fs -ls -R /hbase/WALs/ > /tmp/wals.out`.
59+
1. Get list of current WALs: `hadoop fs -ls -R /hbase/WALs/ > /tmp/wals.out`.
6060

6161
1. Inspect the `wals.out` file. If there are too many splitting directories (starting with *-splitting), the region server is probably failing because of these directories.
6262

0 commit comments

Comments
 (0)