Skip to content

logpuller: fix potential stuck#4358

Merged
ti-chi-bot[bot] merged 3 commits intomasterfrom
ldz/fix-puller0305
Mar 5, 2026
Merged

logpuller: fix potential stuck#4358
ti-chi-bot[bot] merged 3 commits intomasterfrom
ldz/fix-puller0305

Conversation

@lidezhu
Copy link
Collaborator

@lidezhu lidezhu commented Mar 5, 2026

What problem does this PR solve?

Issue Number: close #4359

What is changed and how it works?

This pull request addresses a potential deadlock or stuck state within the logpuller's subscription client. It enhances the robustness of the pushRegionEventToDS mechanism by integrating context cancellation checks and ensuring proper unblocking during client shutdown, thereby preventing indefinite waits and improving the client's lifecycle management.

Highlights

  • Prevented potential stuck state in pushRegionEventToDS: The pushRegionEventToDS method now checks for context cancellation (s.ctx.Err()) before and during its waiting loop, ensuring it can exit gracefully if the context is canceled while the client is paused.
  • Ensured thread safety for paused state updates: Mutex locks (s.mu.Lock() and s.mu.Unlock()) were added around s.paused.Store() operations within handleDSFeedBack to prevent race conditions when modifying the client's paused state.
  • Guaranteed unblocking on client closure: The Close method now explicitly sets s.paused.Store(false) and broadcasts to the condition variable (s.cond.Broadcast()), which ensures any goroutines blocked in pushRegionEventToDS are released during shutdown.
  • Added new test case for unblocking on close: A new unit test, TestPushRegionEventToDSUnblocksOnClose, was introduced to verify that the pushRegionEventToDS method correctly unblocks when the subscriptionClient is closed.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

  • Bug Fixes

    • Subscriptions now stop promptly when cancelled and unblocks operations on close to avoid hangs.
    • Improved thread-safety around pause/resume transitions to prevent race conditions during subscription handling.
  • Tests

    • Added tests to verify pause/unpause behavior and that blocked operations are unblocked on close.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 5, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a potential deadlock or stuck state within the logpuller's subscription client. It enhances the robustness of the pushRegionEventToDS mechanism by integrating context cancellation checks and ensuring proper unblocking during client shutdown, thereby preventing indefinite waits and improving the client's lifecycle management.

Highlights

  • Prevented potential stuck state in pushRegionEventToDS: The pushRegionEventToDS method now checks for context cancellation (s.ctx.Err()) before and during its waiting loop, ensuring it can exit gracefully if the context is canceled while the client is paused.
  • Ensured thread safety for paused state updates: Mutex locks (s.mu.Lock() and s.mu.Unlock()) were added around s.paused.Store() operations within handleDSFeedBack to prevent race conditions when modifying the client's paused state.
  • Guaranteed unblocking on client closure: The Close method now explicitly sets s.paused.Store(false) and broadcasts to the condition variable (s.cond.Broadcast()), which ensures any goroutines blocked in pushRegionEventToDS are released during shutdown.
  • Added new test case for unblocking on close: A new unit test, TestPushRegionEventToDSUnblocksOnClose, was introduced to verify that the pushRegionEventToDS method correctly unblocks when the subscriptionClient is closed.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • logservice/logpuller/subscription_client.go
    • Added context cancellation checks to pushRegionEventToDS to prevent blocking indefinitely when paused.
    • Introduced mutex locks around s.paused.Store() calls in handleDSFeedBack for thread safety.
    • Modified Close method to unblock any waiting goroutines by broadcasting to the condition variable.
  • logservice/logpuller/subscription_client_test.go
    • Added a new test TestPushRegionEventToDSUnblocksOnClose to verify the unblocking behavior of pushRegionEventToDS on client closure.
    • Implemented a mockDynamicStream to facilitate testing of the subscriptionClient.
Activity
  • The pull request description contains boilerplate text, indicating no specific human activity or detailed problem description has been provided yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ti-chi-bot ti-chi-bot bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 5, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

📝 Walkthrough

Walkthrough

Added context-cancellation checks and mutex protection to subscription client's pause/wakeup and event-pushing paths; Close now clears paused under lock and broadcasts to wake waiters. Tests add a mock dynamic stream and a test verifying pushRegionEventToDS unblocks on Close (duplicate test/mock blocks present).

Changes

Cohort / File(s) Summary
Subscription Client Core Logic
logservice/logpuller/subscription_client.go
Added context-cancel checks in pushRegionEventToDS slow-path; tightened pause-wait loop to exit on context cancellation; wrapped PauseArea/ResumeArea state changes with the cond mutex; Close now locks, sets paused = false, and Broadcastes to wake waiting goroutines.
Subscription Client Tests (mocks & test)
logservice/logpuller/subscription_client_test.go
Added an unexported mockDynamicStream (no-op dynstream) and TestPushRegionEventToDSUnblocksOnClose to assert waiters unblock on Close. Note: the mock and test blocks were inserted multiple times (duplications present).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

lgtm, approved, release-note-none

Suggested reviewers

  • hongyunyan
  • asddongmen
  • wk989898

Poem

🐇 I nibble bugs in quiet code,
I guard the pause where waiters doze,
A gentle lock, a timely wake,
Close calls — no goroutine aches,
Hooray — the stream hops on its road!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'logpuller: fix potential stuck' is vague and incomplete, using informal language ('stuck') without clarifying what issue is being fixed. Improve the title to be more specific, such as 'logpuller: fix potential stuck goroutines in subscription client' or 'logpuller: fix missed wakeups in cond pause/resume'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description check ✅ Passed The PR description follows the template structure, includes a linked issue (#4359), detailed highlights of changes, test coverage, and sections for questions and release notes.
Linked Issues check ✅ Passed All coding objectives from issue #4359 are addressed: context cancellation checks in pushRegionEventToDS, mutex-protected paused state updates, unblocking on Close via cond.Broadcast, and unit tests verifying the fixes.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the stuck/missed wakeup issue: context cancellation, mutex synchronization, condition variable signaling, and corresponding tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ldz/fix-puller0305

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request effectively addresses a potential stuck issue in the logpuller by integrating context cancellation checks and ensuring correct synchronization with condition variables. The addition of mutex locks around s.cond.Broadcast() and the new test case TestPushRegionEventToDSUnblocksOnClose significantly improve the robustness and correctness of the subscriptionClient's pause/resume and shutdown mechanisms.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@logservice/logpuller/subscription_client_test.go`:
- Around line 37-39: The import ordering is incorrect: the import
"github.com/pingcap/ticdc/utils/dynstream" is in a separate group causing gci
formatting failure; move that import into the same block as the other
"github.com/pingcap/ticdc/*" imports in subscription_client_test.go (so it
groups with existing ticdc imports), then run make fmt (or gofmt/gci) to
reformat and verify the import groups are correct.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1a77cd4a-c622-47be-8406-46d6fbd19920

📥 Commits

Reviewing files that changed from the base of the PR and between 039417c and ded5322.

📒 Files selected for processing (2)
  • logservice/logpuller/subscription_client.go
  • logservice/logpuller/subscription_client_test.go

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
logservice/logpuller/subscription_client_test.go (1)

225-235: Harden the pre-close blocking assertion to avoid flaky pass conditions.

Line 231-Line 235 infers “blocked” via a fixed 100ms timeout. If the goroutine is late-scheduled, this can pass without truly validating the paused wait path. Prefer explicit synchronization before calling Close().

♻️ Suggested refactor
-	done := make(chan struct{})
+	entered := make(chan struct{})
+	done := make(chan struct{})
 	go func() {
+		close(entered)
 		client.pushRegionEventToDS(SubscriptionID(1), regionEvent{})
 		close(done)
 	}()
+	<-entered
 
 	select {
 	case <-done:
 		t.Fatal("pushRegionEventToDS should block when paused")
-	case <-time.After(100 * time.Millisecond):
+	default:
 	}

As per coding guidelines: "Applies to **/*_test.go : Use unit test files named *_test.go in Go; favor deterministic tests and use testify/require".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@logservice/logpuller/subscription_client_test.go` around lines 225 - 235, The
test's assertion that pushRegionEventToDS is blocked uses a fixed 100ms timeout
and can be flaky; change the test to deterministically synchronize with the
goroutine before calling Close() by adding an explicit "entered wait" signal
channel (e.g., started := make(chan struct{})) in the test wrapper goroutine
that sends on started immediately before calling
client.pushRegionEventToDS(SubscriptionID(1), regionEvent{}), then have the main
test wait for <-started to guarantee the goroutine reached the paused state,
assert that it hasn't completed, call client.Close() to unblock, and finally
wait for completion; also replace t.Fatal/time.After checks with testify/require
assertions for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@logservice/logpuller/subscription_client_test.go`:
- Around line 225-235: The test's assertion that pushRegionEventToDS is blocked
uses a fixed 100ms timeout and can be flaky; change the test to
deterministically synchronize with the goroutine before calling Close() by
adding an explicit "entered wait" signal channel (e.g., started := make(chan
struct{})) in the test wrapper goroutine that sends on started immediately
before calling client.pushRegionEventToDS(SubscriptionID(1), regionEvent{}),
then have the main test wait for <-started to guarantee the goroutine reached
the paused state, assert that it hasn't completed, call client.Close() to
unblock, and finally wait for completion; also replace t.Fatal/time.After checks
with testify/require assertions for clarity.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cd34b234-2ff4-4dd3-ae08-b536f650a745

📥 Commits

Reviewing files that changed from the base of the PR and between ded5322 and a8abdef.

📒 Files selected for processing (1)
  • logservice/logpuller/subscription_client_test.go

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Mar 5, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
logservice/logpuller/subscription_client.go (1)

477-486: Proper shutdown unblocking with graceful Push() handling.

The mutex-protected broadcast correctly ensures blocked goroutines wake up during shutdown. There is a race window where a woken goroutine may call ds.Push() (line 423) after or concurrent with ds.Close(), but this is safely handled: the DynamicStream.Push() implementation checks if the stream is closed and silently drops the event via handler.OnDrop() rather than panicking.

Adding a defensive context check after exiting the wait loop (if s.ctx.Err() != nil { return }) would avoid the unnecessary Push() call when the context is already cancelled, but it is not required for correctness.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@logservice/logpuller/subscription_client.go` around lines 477 - 486, The
Close method currently broadcasts to wake blocked goroutines but can let a woken
goroutine continue past the wait loop and call ds.Push(); add a defensive
context cancellation check (check s.ctx.Err() != nil) immediately after the wait
loop in the goroutine (the same loop that currently calls ds.Push()) so that if
the subscriptionClient context is cancelled the goroutine returns early and
skips calling DynamicStream.Push(); update the goroutine that references
ds.Push() and the wait/cond logic to perform this check before performing any
Push() or other work to avoid unnecessary calls during shutdown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@logservice/logpuller/subscription_client.go`:
- Around line 477-486: The Close method currently broadcasts to wake blocked
goroutines but can let a woken goroutine continue past the wait loop and call
ds.Push(); add a defensive context cancellation check (check s.ctx.Err() != nil)
immediately after the wait loop in the goroutine (the same loop that currently
calls ds.Push()) so that if the subscriptionClient context is cancelled the
goroutine returns early and skips calling DynamicStream.Push(); update the
goroutine that references ds.Push() and the wait/cond logic to perform this
check before performing any Push() or other work to avoid unnecessary calls
during shutdown.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 60e8eb3d-ed3e-4b6a-8929-3226bb1963b8

📥 Commits

Reviewing files that changed from the base of the PR and between a8abdef and f0d8625.

📒 Files selected for processing (1)
  • logservice/logpuller/subscription_client.go

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Mar 5, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 5, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: asddongmen, wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [asddongmen,wk989898]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link

ti-chi-bot bot commented Mar 5, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-03-05 09:05:41.534991894 +0000 UTC m=+435386.113071088: ☑️ agreed by wk989898.
  • 2026-03-05 11:30:31.586099224 +0000 UTC m=+444076.164178408: ☑️ agreed by asddongmen.

@ti-chi-bot ti-chi-bot bot merged commit 82a73ed into master Mar 5, 2026
26 checks passed
@ti-chi-bot ti-chi-bot bot deleted the ldz/fix-puller0305 branch March 5, 2026 12:32
lidezhu added a commit that referenced this pull request Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

logpuller: subscriptionClient cond pause/resume may miss wakeups

3 participants