Skip to content

Commit 6ba0c90

Browse files
authored
Improved Acrolinx Score
Improved Acrolinx Score
1 parent 02d0a72 commit 6ba0c90

File tree

1 file changed

+14
-12
lines changed

1 file changed

+14
-12
lines changed

articles/hdinsight/hbase/Troubleshoot-data-retention-(TTL)-issues-with-expired-data-not-being-deleted-from-storage.md

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,38 @@
11
---
2-
title: Troubleshoot Apache HBase performance issues on Azure HDInsight
3-
description: Troubleshoot various Apache HBase performance tuning guidelines and tips for getting optimal performance on Azure HDInsight.
2+
title: Troubleshoot data retention (TTL) issues with expired data not being deleted from storage on Azure HDInsight
3+
description: Troubleshoot various data-retention (TTL) issues with expired data not being deleted from storage on Azure HDInsight
44
ms.service: hdinsight
55
ms.topic: troubleshooting
66
ms.date: 05/06/2022
77
---
88

9-
In HBase cluster, you may decide that you would like to remove data after it ages either to free some storage and save on costs as the older data is no longer needed, either to comply with regulations. When that is needed , you will usually set TTL in a table at the ColumnFamily level to expire and automatically delete older data. While TTL can be set as well at cell level, setting it at ColumnFamily level is usually a more convenient option because the ease of administration and because a cell TTLs (expressed in ms)cannot extend the effective lifetime of a cell beyond a ColumnFamily level TTL setting (expressed in seconds), so only required shorter retention times at cell level could benefit from setting cell level TTL.
9+
# Troubleshoot data retention (TTL) issues with expired data not being deleted from storage on Azure HDInsight
10+
11+
In HBase cluster, you may decide that you would like to remove data after it ages either to free some storage and save on costs as the older data is no longer needed, either to comply with regulations. When that is needed, you'll usually set TTL in a table at the ColumnFamily level to expire and automatically delete older data. While TTL can be set as well at cell level, setting it at ColumnFamily level is usually a more convenient option because the ease of administration and because a cell TTLs (expressed in ms) can't extend the effective lifetime of a cell beyond a ColumnFamily level TTL setting (expressed in seconds), so only required shorter retention times at cell level could benefit from setting cell level TTL.
1012

1113
Despite setting TTL, you may notice sometimes that you don't obtain the desired effect, i.e. some data hasn't expired and/or storage size hasn't decreased.
1214

1315
## Prerequisites:
1416

1517
To prepare to follow the steps and commands below, open 2 ssh connections to HBase cluster:
16-
1) In one of the ssh sessions keep the default bash shell;
18+
1) In one of the ssh sessions keeps the default bash shell;
1719
2) In the second ssh session launch HBase shell by running the command below:
1820

1921
```
2022
hbase shell
2123
```
2224
### Check if desired TTL is configured and if expired data is removed from query result
2325

24-
Follow the steps below to understand where is the issue. Start by checking if he the behavior occurs for a specific table or for all the tables. If you are unsure whether the issue impacts all the tables or a specific table, just consider as example a specific table name for the start.
25-
1) Check first that TTL has been configured for ColumnFamily for the target tables. Run the command below in the ssh session where you launched HBase shell and observe example and output below. One column family has TTL set to 50 seconds, the other ColumnFamily has no value configured for TTL, thus it appears as "FOREVER" (data in this column family is not configured to expire):
26+
Follow the steps below to understand where is the issue. Start by checking if he the behavior occurs for a specific table or for all the tables. If you're unsure whether the issue impacts all the tables or a specific table, just consider as example a specific table name for the start.
27+
1) Check first that TTL has been configured for ColumnFamily for the target tables. Run the command below in the ssh session where you launched HBase shell and observe example and output below. One column family has TTL set to 50 seconds, the other ColumnFamily has no value configured for TTL, thus it appears as "FOREVER" (data in this column family isn't configured to expire):
2628

2729
```
2830
describe 'table_name'
2931
```
3032

3133
2) If not configured, default TTL is set to 'FOREVER'. There are 2 possibilities why data is not expired as expected and removed from query result:
32-
* a) If TTL has any other value than 'FOREVER', observe the value for column family and note down the value in seconds(pay special attention to value correlated with the unit measure as cell TTL is in ms, but column family TTL is in seconds) to confirm if it is the expected one. If the observed value is not correct, fix that first.
33-
* b) If TTL value is 'FOREVER' for all column families, configure TTL as first step and afterwards monitor if data is expired as expected.
34+
a) If TTL has any other value then 'FOREVER', observe the value for column family and note down the value in seconds(pay special attention to value correlated with the unit measure as cell TTL is in ms, but column family TTL is in seconds) to confirm if it is the expected one. If the observed value isn't correct, fix that first.
35+
b) If TTL value is 'FOREVER' for all column families, configure TTL as first step and afterwards monitor if data is expired as expected.
3436
3) If you establish that TTL is configured and has the correct value for the ColumnFamily, next step is to confirm that the expired data no longer shows up when doing table scans. When data expires, it should be removed and not show up in the scan table results. Run the below command in HBase shell to check:
3537

3638
```
@@ -52,7 +54,7 @@ Follow the steps below to understand where is the issue. Start by checking if he
5254

5355
### Check the number and size of StoreFiles per table per region after flush
5456

55-
6) Based on the TTL configured for each ColumnFamily and how much data is written in the table for the target ColumnFamily, part of the data may still exist in MemStore and is not written as StoreFile to storage. Thus, to make sure that the data is written to storage as StoreFile, before the maximum configured MemStore size is reached, you can run the following command in HBase shell to write data from MemStore to StoreFile immediately.
57+
6) Based on the TTL configured for each ColumnFamily and how much data is written in the table for the target ColumnFamily, part of the data may still exist in MemStore and isn't written as StoreFile to storage. Thus, to make sure that the data is written to storage as StoreFile, before the maximum configured MemStore size is reached, you can run the following command in HBase shell to write data from MemStore to StoreFile immediately.
5658

5759
```
5860
flush 'table_name'
@@ -64,9 +66,9 @@ flush 'table_name'
6466
### Check the number and size of StoreFiles per table per region after major compaction
6567

6668

67-
8) At this point, the data from MemStore has been written to StoreFile, in storage, but expired data may still exist in one or more of the current StoreFiles. Although minor compactions can help delete some of the expired entries, it is not guaranteed that it will remove all of them as minor compaction will usually not select all the StoreFiles for compaction, while major compaction will select all the StoreFiles for compaction in that region.
69+
8) At this point, the data from MemStore has been written to StoreFile, in storage, but expired data may still exist in one or more of the current StoreFiles. Although minor compactions can help delete some of the expired entries, it is'nt guaranteed that it will remove all of them as minor compaction will usually not select all the StoreFiles for compaction, while major compaction will select all the StoreFiles for compaction in that region.
6870

69-
Also, there is another situation when minor compaction may not remove cells with TTL expired. There is a property named MIN_VERSIONS and it defaults to 0 only (see in the above output from describe 'table_name' the property MIN_VERSIONS=>'0'). If this property is set to 0, the minor compaction will remove the cells with TTL expired. If this value is greater than 0, minor compaction may not remove the cells with TTL expired even if it touches the corresponding file as part of compaction. This property configures the min number of versions of a cell to keep, even if those versions have TTL expired.
71+
Also, there's another situation when minor compaction may not remove cells with TTL expired. There's a property named MIN_VERSIONS and it defaults to 0 only (see in the above output from describe 'table_name' the property MIN_VERSIONS=>'0'). If this property is set to 0, the minor compaction will remove the cells with TTL expired. If this value is greater than 0, minor compaction may not remove the cells with TTL expired even if it touches the corresponding file as part of compaction. This property configures the min number of versions of a cell to keep, even if those versions have TTL expired.
7072

7173
9) To make sure expired data is also deleted from storage, we need to run a major compaction operation. The major compaction operation, when completed, will leave behind a single StoreFile per region. In HBase shell, run the command to execute a major compaction operation on the table:
7274

@@ -75,7 +77,7 @@ flush 'table_name'
7577
major_compact 'table_name'
7678
```
7779

78-
10) Depending on the table size, major compaction operation can take some time. Use the command below in HBase shell to monitor progress. If the compaction is still running when you execute the command below, you will see the output "MAJOR", but if the compaction is completed, you will see the output "NONE":
80+
10) Depending on the table size, major compaction operation can take some time. Use the command below in HBase shell to monitor progress. If the compaction is still running when you execute the command below, you'll see the output "MAJOR", but if the compaction is completed, you will see the output "NONE":
7981

8082
```
8183
compaction_state 'table_name'

0 commit comments

Comments
 (0)