-
Notifications
You must be signed in to change notification settings - Fork 4.1k
rowexec: fix deadlock when processors panic during execution #160348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
4c4f51a to
7505b65
Compare
michae2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michae2 reviewed 4 files and all commit messages, and made 2 comments.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @mw5h and @yuzefovich).
pkg/sql/rowexec/sample_aggregator_test.go line 475 at r1 (raw file):
func TestPanicDeadlock(t *testing.T) { defer leaktest.AfterTest(t)() skip.UnderStress(t, "test has a 10-second timeout to detect deadlock")
nit: I don't feel strongly, but it seems to only be a 10-second wait if the test fails, so not sure this is a reason to skip under stress.
This test fails consistently under stress, even with the 10s wait. I could try with an even longer timeout if you think that's preferable. I just did this because I had already gone from 2s -> 5s -> 10s. |
yuzefovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder we should take this a step further and apply the same fix to other processors that could result in the same deadlock, in particular, ingestFileProcessor, inspectProcessor, columnBackfiller, ttlProcessor. It seems like these shouldn't have the same problem (they don't have inputs that need to be ConsumerClosed), but it might be nice to still ensure the deferred cleanup for them too.
@yuzefovich reviewed 4 files and all commit messages, and made 3 comments.
Reviewable status:complete! 2 of 0 LGTMs obtained (waiting on @michae2 and @mw5h).
pkg/sql/rowexec/sample_aggregator.go line 261 at r1 (raw file):
row, meta := s.input.Next() // Testing knob to inject panics or other test behavior
nit: this comment seems obvious and redundant.
pkg/sql/rowexec/sample_aggregator.go line 262 at r1 (raw file):
// Testing knob to inject panics or other test behavior if row != nil && s.FlowCtx.Cfg.TestingKnobs.SampleAggregatorTestingKnobRowHook != nil {
nit: I'd move this below the break when we have non-nil row.
michae2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michae2 made 1 comment.
Reviewable status:complete! 2 of 0 LGTMs obtained (waiting on @mw5h).
pkg/sql/rowexec/sample_aggregator_test.go line 475 at r1 (raw file):
Previously, mw5h (Matt White) wrote…
This test fails consistently under stress, even with the 10s wait. I could try with an even longer timeout if you think that's preferable. I just did this because I had already gone from 2s -> 5s -> 10s.
Ahh, got it. Fine with me to skip under stress.
I opened #160409 to do that - no need to backport those changes, so we should keep them separate. Relatedly, "backport" is the wrong label to apply if you want blathers to open backport PRs automatically ("backport" is actually applied by the blathers) - we need to use either |
The first iteration of this patch was to make RowChannel context aware, which solves this particular problem a bit more generally. That seemed a bit more risky and maybe not something we would want to backport. I think applying this fix more generally is the way to go. |
Before this change, when the sampleAggregator or sampler processors panicked during row processing, cleanup code (ConsumerClosed() and ProducerDone()) was never executed. This left producer goroutines blocked indefinitely on channel sends, which prevented the flow from completing. During cluster drain operations, this caused the drain to hang indefinitely waiting for flows to finish. This change adds deferred cleanup to both processors' Run() methods, ensuring that ConsumerClosed() is called even when a panic occurs. This unblocks stuck producers and allows panics to be properly recovered without causing deadlocks. A new test verifies the fix by injecting a panic via testing knob and confirming that producer goroutines complete successfully. Fixes: cockroachdb#160337 Release note (bug fix): Fixed a deadlock that could occur when a statistics creation task panicked. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
7505b65 to
a5197e0
Compare
|
bors r+ |
|
Build succeeded: |
|
Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches. Issue #160337: branch-release-25.3, branch-release-25.4, branch-release-26.1. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Before this change, when the sampleAggregator or sampler processors panicked during row processing, cleanup code (ConsumerClosed() and ProducerDone()) was never executed. This left producer goroutines blocked indefinitely on channel sends, which prevented the flow from completing. During cluster drain operations, this caused the drain to hang indefinitely waiting for flows to finish.
This change adds deferred cleanup to both processors' Run() methods, ensuring that ConsumerClosed() is called even when a panic occurs. This unblocks stuck producers and allows panics to be properly recovered without causing deadlocks.
A new test verifies the fix by injecting a panic via testing knob and confirming that producer goroutines complete successfully.
Fixes: #160337
Release note (bug fix): Fixed a deadlock that could occur when a statistics creation task panicked.
🤖 Generated with Claude Code