Skip to content

CNDB-14207: Don't break the whole flush if SAI fails to build #1770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pkolaczk
Copy link

If the index fails to build, mark the index non-queryable,
but don't fail the C* flush. This way the node can continue
to run the other queries.

Copy link

github-actions bot commented May 27, 2025

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@pkolaczk pkolaczk requested a review from adelapena May 27, 2025 08:33
Copy link

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this may be "fine" because Compaction will pick up the sstable and "fix" the problem, at least in CNDB (I don't know how this will play in Cassandra).

When this happens the sstable is in a bad state and the index will become suddendly unavailable and basically all the queries against that index will fail (probably most of the queries for that table?)

Probably this decision deserves a larged discussion at the triage meeting

@pkolaczk
Copy link
Author

When this happens the sstable is in a bad state and the index will become suddendly unavailable and basically all the queries against that index will fail (probably most of the queries for that table?)

This is what already happens in CNDB.
We do mark the index unavailable already, so this commit doesn't make the unavailability worse.

But I think I get what you're trying to say.
Sometimes it's better to crash fully rather than try to workaround and let things work in half. I can imagine that if a node/writer fails to flush because of some intermittent I/O error, it's better to just not do flush, crash the writer, and rely on replication to let the other writers take over. But in that case we should not mark the index as non-queryable as we aborted the flush, so the files on disk should be still in a good state (and we should just delete whatever was flushed so far for this sstable).

@pkolaczk
Copy link
Author

pkolaczk commented May 27, 2025

Thinking about it more, I feel this inconsistency between flush and compaction behavior introduced by this PR is not good.
I think both repeated flush and compaction failure is generally catastrophic in the long run - if it happens too many times the node/writer will be in a serious trouble. When flush fails too many times, it runs out of memory and writes will block. If it fails on startup during the commitlog replay, the node will not start. Compaction is maybe kinda less critical, but if it doesn't run for too long, reads will get slower and slower until they start to timeout. The whole reason for creating this PR was to avoid SAI breaking the other stuff, but I agree this needs some broader discussion.

I think there are two sensible ways of handling SAI build failures:

  1. Not failing flush/compaction when building the index fails:
  • the other queries that don't need the index might still run fine
  • if the flush or compaction failed intermittently we can retry and the index can be brought back to normal
  • not losing the whole writer decreases the risk of losing the data
  • better choice for index build failures caused by bugs; this strategy tries to contain the damage and minimize the impact on the non-SAI parts of the system (or other SAI indexes)
  1. All or nothing - failing the flush/compaction but leaving the index queryable (undoing any partial flush):
  • good for intermittent failures; indexes continue to work as long as there are enough other nodes/writers that can take over
  • the node is immediately considered broken
  • overall "simpler" - we don't end up with half-baked sstables with no indexes

However, I fail to see how the current behavior is useful.
Currently we do both: we mark the index as non-queryable and we abort the flush/compaction. So SAI queries stop working despite the index on disk being consistent, also on other nodes.

WDYT?

@adelapena @michaeljmarshall @JeremiahDJordan

@JeremiahDJordan
Copy link
Member

All or nothing - failing the flush/compaction but leaving the index queryable (undoing any partial flush):

I kind of think this is how flush is intended to be. Also if a flush fails, most likely we probably want to be quarantining a pod and moving tenants away from it, because the reasons flushes fail are mostly not very recoverable.

This commit changes handling SAI index flush failures.

A flush does not force the index into the non-queryable state anymore.
We can do this because after failure we rollback any partially flushed
index components, and we abort the sstable flush as well. Therefore,
both the failed-to-flush memtable and memtable indexes remain intact
and can still serve queries.

This change has several advantages:
- the flush failure could be temporary and the flush may still succeed
  the next time,
- even if the problem with flushing persists, reads will run fine
- if this is a node-local problem, the other nodes have a chance to
  take over; a failure of one node does not propagate to the rest of
  the cluster
@@ -303,7 +303,7 @@ public void abort(Throwable accumulator, boolean fromIndex)

// For non-compaction, make any indexes involved in this transaction non-queryable, as they will likely not match the backing table.
// For compaction: the compaction task should be aborted and new sstables will not be added to tracker
if (fromIndex && opType != OperationType.COMPACTION)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this change means if we abort during a flush we should just abort the index creation and not fail the index? Please update the comment above as well.

Copy link
Member

@JeremiahDJordan JeremiahDJordan Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When abort happens during flush, does it also mean the sstable was aborted? Just want to understand that we are not creating a case where we have data which can be queried from a direct query, but would be missing from an index.

Copy link

sonarqubecloud bot commented Jun 6, 2025

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-1770 rejected by Butler


1 new test failure(s) in 3 builds
See build details here


Found 1 new test failures

Test Explanation Branch history Upstream history
o.a.c.u.b.BinLogTest.testTruncationReleasesLogS... regression 🔴🔴🔴 🔵🔵🔵🔵🔵🔵🔵

Found 4 known test failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants