You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/modules/cassandra/pages/managing/operating/compaction/tombstones.adoc
+78-16Lines changed: 78 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,17 +4,18 @@
4
4
5
5
== What are tombstones?
6
6
7
-
{cassandra}'s processes for deleting data are designed to improve performance, and to work with {cassandra}'s built-in properties for data distribution and fault-tolerance.
7
+
{cassandra}'s processes for deleting data are designed to be efficient, and to work with {cassandra}'s native features for data distribution and fault-tolerance.
8
8
9
9
{cassandra} treats a deletion as an insertion, and inserts a time-stamped deletion marker called a tombstone.
10
10
The tombstones go through {cassandra}'s write path, and are written to SSTables on one or more nodes.
11
11
The key feature difference of a tombstone is that it has a built-in expiration date/time.
12
-
At the end of its expiration period, the grace period, the tombstone is deleted as part of {cassandra}'s normal compaction process.
12
+
At the end of its expiration period, called the grace period, the tombstone is deleted as part of {cassandra}'s normal compaction process.
13
13
14
14
[NOTE]
15
15
====
16
-
You can also mark a {cassandra} row or column with a time-to-live (TTL) value.
17
-
After this amount of time has ended, {cassandra} marks the object with a tombstone, and handles it like other tombstoned objects.
16
+
In {cassandra}, you can assign a time-to-live (TTL) to a row or column. Once the TTL expires, the data is eligible for removal.
17
+
During compaction, if the `gc_grace_seconds` period is still active, {cassandra} marks the data as expired, handling it like any other deleted item.
18
+
After `gc_grace_seconds` has elapsed, the data is eligible for permanent removal.
18
19
====
19
20
20
21
== Why tombstones?
@@ -23,14 +24,14 @@ The tombstone represents the deletion of an object, either a row or column value
23
24
This approach is used instead of removing values because of the distributed nature of {cassandra}.
24
25
Once an object is marked as a tombstone, queries will ignore all values that are time-stamped previous to the tombstone insertion.
25
26
26
-
== Zombies
27
+
== Preventing Data Resurrection
27
28
28
-
In a multi-node cluster, {cassandra} may store replicas of the same data on two or more nodes.
29
-
This helps prevent data loss, but it complicates the deletion process.
30
-
If a node receives a delete command for data it stores locally, the node tombstones the specified object and tries to pass the tombstone to other nodes containing replicas of that object.
31
-
But if one replica node is unresponsive at that time, it does not receive the tombstone immediately, so it still contains the pre-delete version of the object.
32
-
If the tombstoned object has already been deleted from the rest of the cluster before that node recovers, {cassandra} treats the object on the recovered node as new data, and propagates it to the rest of the cluster.
33
-
This kind of deleted but persistent object is called a https://cassandra.apache.org/_/glossary.html#zombie[zombie].
29
+
In a multi-node {cassandra} cluster, data is often replicated across several nodes to safeguard against loss.
30
+
However, this replication can make deletions more complex.
31
+
When a node receives a request to delete data, it marks the item with a tombstone and attempts to share this tombstone with other nodes that hold copies of the same data.
32
+
If one of these replica nodes is offline or unreachable during the deletion, it won’t get the tombstone right away and will continue to store the original, undeleted data.
33
+
If the rest of the cluster purges the tombstoned data before the offline node comes back online, {cassandra} may mistakenly treat the data on the recovered node as live and repair may replicate it across the cluster again.
34
+
This scenario, where deleted data reappears, is known as a https://cassandra.apache.org/_/glossary.html#zombie[zombie].
34
35
35
36
== Grace period
36
37
@@ -52,10 +53,10 @@ After the tombstone's grace period ends, {cassandra} deletes the tombstone durin
52
53
53
54
== Deletion
54
55
55
-
After `gc_grace_seconds` has expired the tombstone may be removed (meaning there will no longer be any object that a certain piece of data was
56
-
deleted).
57
-
But one complication for deletion is that a tombstone can live in one SSTable and the data it marks for deletion in another, so a compaction must also remove both SSTables.
58
-
More precisely, drop an actual tombstone the:
56
+
Once the `gc_grace_seconds` period has passed, the tombstone can be removed, meaning there will no longer be any record indicating that a specific piece of data was deleted.
57
+
However, deleting data can be complicated because the tombstone might exist in one SSTable while the data it marks for deletion is in another.
58
+
Therefore, a compaction process must remove both SSTables.
59
+
More specifically, a tombstone is only dropped when:
59
60
60
61
* The tombstone must be older than `gc_grace_seconds`.
61
62
Note that tombstones will not be removed until a compaction event even if `gc_grace_seconds` has elapsed.
@@ -124,6 +125,67 @@ To avoid keeping tombstones forever, we set `gc_grace_seconds` for every table i
124
125
125
126
If an SSTable contains only tombstones and it is guaranteed that SSTable is not shadowing data in any other SSTable, then the compaction can drop
126
127
that SSTable.
127
-
If you see SSTables with only tombstones (note that TTL'd data is considered tombstones once the time-to-live has expired), but it is not being dropped by compaction, it is likely that other SSTables contain older data.
128
+
If you observe SSTables that contain only tombstones or expired TTL data, and compaction is not removing them, it likely indicates that older versions of the data still exist in other SSTables.
128
129
There is a tool called `sstableexpiredblockers` that will list which SSTables are droppable and which are blocking them from being dropped.
129
130
With `TimeWindowCompactionStrategy` it is possible to remove the guarantee (not check for shadowing data) by enabling `unsafe_aggressive_sstable_expiration`.
131
+
132
+
133
+
== Examples
134
+
135
+
Below is the sstabledump output showing a live row with expired flag as "false":
0 commit comments