Skip to content

Conversation

@mw5h
Copy link
Contributor

@mw5h mw5h commented Dec 31, 2025

Before this change, when the sampleAggregator or sampler processors panicked during row processing, cleanup code (ConsumerClosed() and ProducerDone()) was never executed. This left producer goroutines blocked indefinitely on channel sends, which prevented the flow from completing. During cluster drain operations, this caused the drain to hang indefinitely waiting for flows to finish.

This change adds deferred cleanup to both processors' Run() methods, ensuring that ConsumerClosed() is called even when a panic occurs. This unblocks stuck producers and allows panics to be properly recovered without causing deadlocks.

A new test verifies the fix by injecting a panic via testing knob and confirming that producer goroutines complete successfully.

Fixes: #160337

Release note (bug fix): Fixed a deadlock that could occur when a statistics creation task panicked.

🤖 Generated with Claude Code

@mw5h mw5h requested a review from a team as a code owner December 31, 2025 23:50
@mw5h mw5h requested review from yuzefovich and removed request for a team December 31, 2025 23:50
@mw5h mw5h added the backport Label PR's that are backports to older release branches label Dec 31, 2025
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@mw5h mw5h force-pushed the rowexec-sampler-cleanup branch 3 times, most recently from 4c4f51a to 7505b65 Compare January 1, 2026 01:35
@michae2 michae2 self-requested a review January 2, 2026 17:30
Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: Nice fix!

@michae2 reviewed 4 files and all commit messages, and made 2 comments.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mw5h and @yuzefovich).


pkg/sql/rowexec/sample_aggregator_test.go line 475 at r1 (raw file):

func TestPanicDeadlock(t *testing.T) {
	defer leaktest.AfterTest(t)()
	skip.UnderStress(t, "test has a 10-second timeout to detect deadlock")

nit: I don't feel strongly, but it seems to only be a 10-second wait if the test fails, so not sure this is a reason to skip under stress.

@mw5h
Copy link
Contributor Author

mw5h commented Jan 2, 2026

:lgtm: Nice fix!

@michae2 reviewed 4 files and all commit messages, and made 2 comments.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @mw5h and @yuzefovich).

pkg/sql/rowexec/sample_aggregator_test.go line 475 at r1 (raw file):

func TestPanicDeadlock(t *testing.T) {
	defer leaktest.AfterTest(t)()
	skip.UnderStress(t, "test has a 10-second timeout to detect deadlock")

nit: I don't feel strongly, but it seems to only be a 10-second wait if the test fails, so not sure this is a reason to skip under stress.

This test fails consistently under stress, even with the 10s wait. I could try with an even longer timeout if you think that's preferable. I just did this because I had already gone from 2s -> 5s -> 10s.

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix! :lgtm:

I also wonder we should take this a step further and apply the same fix to other processors that could result in the same deadlock, in particular, ingestFileProcessor, inspectProcessor, columnBackfiller, ttlProcessor. It seems like these shouldn't have the same problem (they don't have inputs that need to be ConsumerClosed), but it might be nice to still ensure the deferred cleanup for them too.

@yuzefovich reviewed 4 files and all commit messages, and made 3 comments.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @michae2 and @mw5h).


pkg/sql/rowexec/sample_aggregator.go line 261 at r1 (raw file):

		row, meta := s.input.Next()

		// Testing knob to inject panics or other test behavior

nit: this comment seems obvious and redundant.


pkg/sql/rowexec/sample_aggregator.go line 262 at r1 (raw file):

		// Testing knob to inject panics or other test behavior
		if row != nil && s.FlowCtx.Cfg.TestingKnobs.SampleAggregatorTestingKnobRowHook != nil {

nit: I'd move this below the break when we have non-nil row.

@yuzefovich yuzefovich added backport-25.2.x Flags PRs that need to be backported to 25.2 backport-25.3.x Flags PRs that need to be backported to 25.3 backport-25.4.x Flags PRs that need to be backported to 25.4 backport-26.1.x Flags PRs that need to be backported to 26.1 and removed backport Label PR's that are backports to older release branches labels Jan 2, 2026
Copy link
Collaborator

@michae2 michae2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michae2 made 1 comment.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @mw5h).


pkg/sql/rowexec/sample_aggregator_test.go line 475 at r1 (raw file):

Previously, mw5h (Matt White) wrote…

This test fails consistently under stress, even with the 10s wait. I could try with an even longer timeout if you think that's preferable. I just did this because I had already gone from 2s -> 5s -> 10s.

Ahh, got it. Fine with me to skip under stress.

@yuzefovich
Copy link
Member

I also wonder we should take this a step further

I opened #160409 to do that - no need to backport those changes, so we should keep them separate.

Relatedly, "backport" is the wrong label to apply if you want blathers to open backport PRs automatically ("backport" is actually applied by the blathers) - we need to use either backport-all or backport-25... labels. I fixed that.

@mw5h
Copy link
Contributor Author

mw5h commented Jan 2, 2026

Nice fix! :lgtm:

I also wonder we should take this a step further and apply the same fix to other processors that could result in the same deadlock, in particular, ingestFileProcessor, inspectProcessor, columnBackfiller, ttlProcessor. It seems like these shouldn't have the same problem (they don't have inputs that need to be ConsumerClosed), but it might be nice to still ensure the deferred cleanup for them too.

The first iteration of this patch was to make RowChannel context aware, which solves this particular problem a bit more generally. That seemed a bit more risky and maybe not something we would want to backport. I think applying this fix more generally is the way to go.

Before this change, when the sampleAggregator or sampler processors
panicked during row processing, cleanup code (ConsumerClosed() and
ProducerDone()) was never executed. This left producer goroutines
blocked indefinitely on channel sends, which prevented the flow from
completing. During cluster drain operations, this caused the drain to
hang indefinitely waiting for flows to finish.

This change adds deferred cleanup to both processors' Run() methods,
ensuring that ConsumerClosed() is called even when a panic occurs.
This unblocks stuck producers and allows panics to be properly recovered
without causing deadlocks.

A new test verifies the fix by injecting a panic via testing knob and
confirming that producer goroutines complete successfully.

Fixes: cockroachdb#160337

Release note (bug fix): Fixed a deadlock that could occur when a
statistics creation task panicked.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@mw5h mw5h force-pushed the rowexec-sampler-cleanup branch from 7505b65 to a5197e0 Compare January 3, 2026 00:25
@mw5h
Copy link
Contributor Author

mw5h commented Jan 3, 2026

bors r+

@craig
Copy link
Contributor

craig bot commented Jan 3, 2026

@craig craig bot merged commit e764212 into cockroachdb:master Jan 3, 2026
25 checks passed
@blathers-crl
Copy link

blathers-crl bot commented Jan 3, 2026

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.


Issue #160337: branch-release-25.3, branch-release-25.4, branch-release-26.1.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-25.2.x Flags PRs that need to be backported to 25.2 backport-25.3.x Flags PRs that need to be backported to 25.3 backport-25.4.x Flags PRs that need to be backported to 25.4 backport-26.1.x Flags PRs that need to be backported to 26.1 target-release-26.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rowexec: a panic in sampleAggregator.Run() can cause a goroutine deadlock

4 participants