Skip to content

Commit 3d080dc

Browse files
authored
Update troubleshoot-data-retention-issues-expired-data.md
1 parent 830726f commit 3d080dc

File tree

1 file changed

+21
-20
lines changed

1 file changed

+21
-20
lines changed

articles/hdinsight/hbase/troubleshoot-data-retention-issues-expired-data.md

Lines changed: 21 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -15,83 +15,84 @@ Despite setting TTL, you may notice sometimes that you don't obtain the desired
1515
## Prerequisites
1616

1717
To prepare to follow the steps and commands below, open two ssh connections to HBase cluster:
18-
* In one of the ssh sessions keeps the default bash shell.
18+
* In one of the ssh sessions keep the default bash shell.
1919
* In the second ssh session launch HBase shell by running the command below.
2020

2121
```
2222
hbase shell
2323
```
24+
2425
### Check if desired TTL is configured and if expired data is removed from query result
2526

26-
Follow the steps below to understand where is the issue. Start by checking if the behavior occurs for a specific table or for all the tables. If you're unsure whether the issue impacts all the tables or a specific table, just consider as example a specific table name for the start.
27-
1) Check first that TTL has been configured for ColumnFamily for the target tables. Run the command below in the ssh session where you launched HBase shell and observe example and output below. One column family has TTL set to 50 seconds, the other ColumnFamily has no value configured for TTL, thus it appears as "FOREVER" (data in this column family isn't configured to expire).
27+
Follow the steps below to understand where is the issue. Start by checking if the behavior occurs for a specific table or for all the tables. If you're unsure whether the issue impacts all the tables or a specific table, just consider as example a specific table name for the start.
28+
29+
1. Check first that TTL has been configured for ColumnFamily for the target tables. Run the command below in the ssh session where you launched HBase shell and observe example and output below. One column family has TTL set to 50 seconds, the other ColumnFamily has no value configured for TTL, thus it appears as "FOREVER" (data in this column family isn't configured to expire).
2830
```
2931
describe 'table_name'
3032
```
3133

32-
2) If not configured, default TTL is set to 'FOREVER'. There are two possibilities why data is not expired as expected and removed from query result.
33-
a) If TTL has any other value then 'FOREVER', observe the value for column family and note down the value in seconds(pay special attention to value correlated with the unit measure as cell TTL is in ms, but column family TTL is in seconds) to confirm if it is the expected one. If the observed value isn't correct, fix that first.
34-
b) If TTL value is 'FOREVER' for all column families, configure TTL as first step and afterwards monitor if data is expired as expected.
35-
3) If you establish that TTL is configured and has the correct value for the ColumnFamily, next step is to confirm that the expired data no longer shows up when doing table scans. When data expires, it should be removed and not show up in the scan table results. Run the below command in HBase shell to check.
34+
1. If not configured, default TTL is set to 'FOREVER'. There are two possibilities why data is not expired as expected and removed from query result.
35+
1. If TTL has any other value then 'FOREVER', observe the value for column family and note down the value in seconds(pay special attention to value correlated with the unit measure as cell TTL is in ms, but column family TTL is in seconds) to confirm if it is the expected one. If the observed value isn't correct, fix that first.
36+
1. If TTL value is 'FOREVER' for all column families, configure TTL as first step and afterwards monitor if data is expired as expected.
37+
1. If you establish that TTL is configured and has the correct value for the ColumnFamily, next step is to confirm that the expired data no longer shows up when doing table scans. When data expires, it should be removed and not show up in the scan table results. Run the below command in HBase shell to check.
3638
```
3739
scan 'table_name'
3840
```
3941
### Check the number and size of StoreFiles per table per region to observe if any changes are visible after the compaction operation
4042

41-
1) Before moving to next step, from ssh session with bash shell, run the following command to check the current number of StoreFiles and size for each StoreFile currently showing up for the ColumnFamily for which the TTL has been configured. Note first the table and ColumnFamily for which you'll be doing the check, then run the following command in ssh session (bash).
43+
1. Before moving to next step, from ssh session with bash shell, run the following command to check the current number of StoreFiles and size for each StoreFile currently showing up for the ColumnFamily for which the TTL has been configured. Note first the table and ColumnFamily for which you'll be doing the check, then run the following command in ssh session (bash).
4244

4345
```
4446
hdfs dfs -ls -R /hbase/data/default/table_name/ | grep "column_family_name"
4547
```
4648

47-
2) Likely, there will be more results shown in the output, one result for each region ID that is part of the table and between 0 and more results for StoreFiles present under each region name, for the selected ColumnFamily. To count the overall number of rows in the result output above, run the following command.
49+
1. Likely, there will be more results shown in the output, one result for each region ID that is part of the table and between 0 and more results for StoreFiles present under each region name, for the selected ColumnFamily. To count the overall number of rows in the result output above, run the following command.
4850
```
4951
hdfs dfs -ls -R /hbase/data/default/table_name/ | grep "column_family_name" | wc -l
5052
```
5153

5254
### Check the number and size of StoreFiles per table per region after flush
5355

54-
1) Based on the TTL configured for each ColumnFamily and how much data is written in the table for the target ColumnFamily, part of the data may still exist in MemStore and isn't written as StoreFile to storage. Thus, to make sure that the data is written to storage as StoreFile, before the maximum configured MemStore size is reached, you can run the following command in HBase shell to write data from MemStore to StoreFile immediately.
56+
1. Based on the TTL configured for each ColumnFamily and how much data is written in the table for the target ColumnFamily, part of the data may still exist in MemStore and isn't written as StoreFile to storage. Thus, to make sure that the data is written to storage as StoreFile, before the maximum configured MemStore size is reached, you can run the following command in HBase shell to write data from MemStore to StoreFile immediately.
5557

5658
```
5759
flush 'table_name'
5860
```
5961

60-
2) Observe the result by running again in bash shell the command.
62+
1. Observe the result by running again in bash shell the command.
6163

6264
```
6365
hdfs dfs -ls -R /hbase/data/default/table_name/ | grep "column_family_name"
6466
```
6567

66-
3) An additional store file is created compared to previous result output for each region where data is modified, the StoreFile will include current content of MemStore for that region.
68+
1. An additional store file is created compared to previous result output for each region where data is modified, the StoreFile will include current content of MemStore for that region.
6769

6870
### Check the number and size of StoreFiles per table per region after major compaction
6971

70-
1) At this point, the data from MemStore has been written to StoreFile, in storage, but expired data may still exist in one or more of the current StoreFiles. Although minor compactions can help delete some of the expired entries, it isn't guaranteed that it will remove all of them as minor compaction will usually not select all the StoreFiles for compaction, while major compaction will select all the StoreFiles for compaction in that region.
72+
1. At this point, the data from MemStore has been written to StoreFile, in storage, but expired data may still exist in one or more of the current StoreFiles. Although minor compactions can help delete some of the expired entries, it isn't guaranteed that it will remove all of them as minor compaction will usually not select all the StoreFiles for compaction, while major compaction will select all the StoreFiles for compaction in that region.
7173

7274
Also, there's another situation when minor compaction may not remove cells with TTL expired. There's a property named MIN_VERSIONS and it defaults to 0 only (see in the above output from describe 'table_name' the property MIN_VERSIONS=>'0'). If this property is set to 0, the minor compaction will remove the cells with TTL expired. If this value is greater than 0, minor compaction may not remove the cells with TTL expired even if it touches the corresponding file as part of compaction. This property configures the min number of versions of a cell to keep, even if those versions have TTL expired.
7375

74-
2) To make sure expired data is also deleted from storage, we need to run a major compaction operation. The major compaction operation, when completed, will leave behind a single StoreFile per region. In HBase shell, run the command to execute a major compaction operation on the table:
75-
76+
1. To make sure expired data is also deleted from storage, we need to run a major compaction operation. The major compaction operation, when completed, will leave behind a single StoreFile per region. In HBase shell, run the command to execute a major compaction operation on the table:
7677

7778
```
7879
major_compact 'table_name'
7980
```
8081

81-
3) Depending on the table size, major compaction operation can take some time. Use the command below in HBase shell to monitor progress. If the compaction is still running when you execute the command below, you'll see the output "MAJOR", but if the compaction is completed, you will see the output "NONE".
82+
1. Depending on the table size, major compaction operation can take some time. Use the command below in HBase shell to monitor progress. If the compaction is still running when you execute the command below, you'll see the output "MAJOR", but if the compaction is completed, you will see the output "NONE".
8283

8384
```
8485
compaction_state 'table_name'
8586
```
8687

87-
4) When the compaction status appears as "NONE" in hbase shell, if you switch quickly to bash and run command
88+
1. When the compaction status appears as "NONE" in hbase shell, if you switch quickly to bash and run command
8889

89-
```
90+
```
9091
hdfs dfs -ls -R /hbase/data/default/table_name/ | grep "column_family_name"
91-
```
92+
```
9293
You will notice that an extra StoreFile has been created in addition to previous ones per region per ColumnFamily and after several moments only the last created StoreFile is kept per region per column family.
9394

94-
5) For the example region above, once the extra moments elapse, we can notice that one single StoreFile remained and the size occupied by this file on the storage is reduced as major compaction occurred and at this point any expired data that has not been deleted before(by another major compaction), will be deleted after running current major compaction operation.
95+
1. For the example region above, once the extra moments elapse, we can notice that one single StoreFile remained and the size occupied by this file on the storage is reduced as major compaction occurred and at this point any expired data that has not been deleted before(by another major compaction), will be deleted after running current major compaction operation.
9596

9697
> [!NOTE]
9798
> For this troubleshooting exercise we triggered the major compaction manually. But in practice, doing that manually for many tables might be time consuming. By default, major compaction is disabled on HDInsight cluster. The main reason for keeping major compaction disabled by default is because the performance of the table operations is impacted when a major compaction is in progress. However, you can enable major compaction by configuring the value for the property hbase.hregion.majorcompaction in ms or can use a cron tab job or another external system to schedule compaction at a time convenient for you, with lower workload.

0 commit comments

Comments
 (0)