CNDB-14207: Don't break the whole flush if SAI fails to build #1770

pkolaczk · 2025-05-27T07:29:16Z

If the index fails to build, mark the index non-queryable,
but don't fail the C* flush. This way the node can continue
to run the other queries.

github-actions · 2025-05-27T07:29:31Z

eolivelli

So this may be "fine" because Compaction will pick up the sstable and "fix" the problem, at least in CNDB (I don't know how this will play in Cassandra).

When this happens the sstable is in a bad state and the index will become suddendly unavailable and basically all the queries against that index will fail (probably most of the queries for that table?)

Probably this decision deserves a larged discussion at the triage meeting

pkolaczk · 2025-05-27T09:49:58Z

When this happens the sstable is in a bad state and the index will become suddendly unavailable and basically all the queries against that index will fail (probably most of the queries for that table?)

This is what already happens in CNDB.
We do mark the index unavailable already, so this commit doesn't make the unavailability worse.

But I think I get what you're trying to say.
Sometimes it's better to crash fully rather than try to workaround and let things work in half. I can imagine that if a node/writer fails to flush because of some intermittent I/O error, it's better to just not do flush, crash the writer, and rely on replication to let the other writers take over. But in that case we should not mark the index as non-queryable as we aborted the flush, so the files on disk should be still in a good state (and we should just delete whatever was flushed so far for this sstable).

src/java/org/apache/cassandra/index/sai/disk/StorageAttachedIndexWriter.java

test/unit/org/apache/cassandra/index/sai/functional/FlushingTest.java

pkolaczk · 2025-05-27T13:55:09Z

Thinking about it more, I feel this inconsistency between flush and compaction behavior introduced by this PR is not good.
I think both repeated flush and compaction failure is generally catastrophic in the long run - if it happens too many times the node/writer will be in a serious trouble. When flush fails too many times, it runs out of memory and writes will block. If it fails on startup during the commitlog replay, the node will not start. Compaction is maybe kinda less critical, but if it doesn't run for too long, reads will get slower and slower until they start to timeout. The whole reason for creating this PR was to avoid SAI breaking the other stuff, but I agree this needs some broader discussion.

I think there are two sensible ways of handling SAI build failures:

Not failing flush/compaction when building the index fails:

the other queries that don't need the index might still run fine
if the flush or compaction failed intermittently we can retry and the index can be brought back to normal
not losing the whole writer decreases the risk of losing the data
better choice for index build failures caused by bugs; this strategy tries to contain the damage and minimize the impact on the non-SAI parts of the system (or other SAI indexes)

All or nothing - failing the flush/compaction but leaving the index queryable (undoing any partial flush):

good for intermittent failures; indexes continue to work as long as there are enough other nodes/writers that can take over
the node is immediately considered broken
overall "simpler" - we don't end up with half-baked sstables with no indexes

However, I fail to see how the current behavior is useful.
Currently we do both: we mark the index as non-queryable and we abort the flush/compaction. So SAI queries stop working despite the index on disk being consistent, also on other nodes.

WDYT?

@adelapena @michaeljmarshall @JeremiahDJordan

JeremiahDJordan · 2025-05-28T16:40:29Z

All or nothing - failing the flush/compaction but leaving the index queryable (undoing any partial flush):

I kind of think this is how flush is intended to be. Also if a flush fails, most likely we probably want to be quarantining a pod and moving tenants away from it, because the reasons flushes fail are mostly not very recoverable.

This commit changes handling SAI index flush failures. A flush does not force the index into the non-queryable state anymore. We can do this because after failure we rollback any partially flushed index components, and we abort the sstable flush as well. Therefore, both the failed-to-flush memtable and memtable indexes remain intact and can still serve queries. This change has several advantages: - the flush failure could be temporary and the flush may still succeed the next time, - even if the problem with flushing persists, reads will run fine - if this is a node-local problem, the other nodes have a chance to take over; a failure of one node does not propagate to the rest of the cluster

JeremiahDJordan · 2025-06-06T14:46:21Z

src/java/org/apache/cassandra/index/sai/disk/StorageAttachedIndexWriter.java

@@ -303,7 +303,7 @@ public void abort(Throwable accumulator, boolean fromIndex)

        // For non-compaction, make any indexes involved in this transaction non-queryable, as they will likely not match the backing table.
        // For compaction: the compaction task should be aborted and new sstables will not be added to tracker
-        if (fromIndex && opType != OperationType.COMPACTION)


So this change means if we abort during a flush we should just abort the index creation and not fail the index? Please update the comment above as well.

When abort happens during flush, does it also mean the sstable was aborted? Just want to understand that we are not creating a case where we have data which can be queried from a direct query, but would be missing from an index.

sonarqubecloud · 2025-06-06T15:04:28Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cassci-bot · 2025-06-06T15:09:42Z

❌ Build ds-cassandra-pr-gate/PR-1770 rejected by Butler

1 new test failure(s) in 3 builds
See build details here

Found 1 new test failures

Test	Explanation	Branch history	Upstream history
o.a.c.u.b.BinLogTest.testTruncationReleasesLogS...	regression	🔴🔴🔴	🔵🔵🔵🔵🔵🔵🔵

Found 4 known test failures

pkolaczk requested a review from adelapena May 27, 2025 08:33

eolivelli requested changes May 27, 2025

View reviewed changes

adelapena reviewed May 27, 2025

View reviewed changes

src/java/org/apache/cassandra/index/sai/disk/StorageAttachedIndexWriter.java Outdated Show resolved Hide resolved

adelapena reviewed May 27, 2025

View reviewed changes

test/unit/org/apache/cassandra/index/sai/functional/FlushingTest.java Outdated Show resolved Hide resolved

pkolaczk force-pushed the c14207 branch from a7eb258 to f6707c8 Compare May 27, 2025 13:33

pkolaczk force-pushed the c14207 branch from f6707c8 to 211b67f Compare June 6, 2025 14:23

JeremiahDJordan reviewed Jun 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CNDB-14207: Don't break the whole flush if SAI fails to build #1770

CNDB-14207: Don't break the whole flush if SAI fails to build #1770

Uh oh!

pkolaczk commented May 27, 2025

Uh oh!

github-actions bot commented May 27, 2025 •

edited by pkolaczk

Loading

Uh oh!

eolivelli left a comment

Uh oh!

pkolaczk commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

pkolaczk commented May 27, 2025 •

edited

Loading

Uh oh!

JeremiahDJordan commented May 28, 2025

Uh oh!

JeremiahDJordan Jun 6, 2025

Uh oh!

JeremiahDJordan Jun 6, 2025 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Jun 6, 2025

Uh oh!

cassci-bot commented Jun 6, 2025

Uh oh!

Uh oh!

CNDB-14207: Don't break the whole flush if SAI fails to build #1770

Are you sure you want to change the base?

CNDB-14207: Don't break the whole flush if SAI fails to build #1770

Uh oh!

Conversation

pkolaczk commented May 27, 2025

Uh oh!

github-actions bot commented May 27, 2025 • edited by pkolaczk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist before you submit for review

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

pkolaczk commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

pkolaczk commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JeremiahDJordan commented May 28, 2025

Uh oh!

JeremiahDJordan Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

JeremiahDJordan Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Jun 6, 2025

Quality Gate passed

Uh oh!

cassci-bot commented Jun 6, 2025

❌ Build ds-cassandra-pr-gate/PR-1770 rejected by Butler

Found 1 new test failures

Found 4 known test failures

Uh oh!

Uh oh!

github-actions bot commented May 27, 2025 •

edited by pkolaczk

Loading

pkolaczk commented May 27, 2025 •

edited

Loading

JeremiahDJordan Jun 6, 2025 •

edited

Loading