rls: only reset backoff on recovery from TRANSIENT_FAILURE #8720

ulascansenturk · 2025-11-20T19:16:51Z

Fix control channel connectivity monitoring to track TRANSIENT_FAILURE state explicitly. Only reset backoff timers when transitioning from TRANSIENT_FAILURE to READY, not for benign state changes like READY → IDLE → READY.

RELEASE NOTES: N/A

Fixes #8693

codecov · 2025-11-20T19:20:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 83.29%. Comparing base (50c6321) to head (5dcc02c).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #8720      +/-   ##
==========================================
+ Coverage   83.21%   83.29%   +0.07%     
==========================================
  Files         419      419              
  Lines       32427    32415      -12     
==========================================
+ Hits        26985    27000      +15     
+ Misses       4054     4040      -14     
+ Partials     1388     1375      -13

Files with missing lines	Coverage Δ
balancer/rls/control_channel.go	`90.00% <100.00%> (+3.04%)`	⬆️

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Fix control channel connectivity monitoring to track TRANSIENT_FAILURE state explicitly. Only reset backoff timers when transitioning from TRANSIENT_FAILURE to READY, not for benign state changes like READY → IDLE → READY. Fixes grpc#8693

eshitachandwani · 2025-11-21T08:30:28Z

balancer/rls/control_channel.go

+				cc.backToReadyFunc()
+				seenTransientFailure = false
+			} else {
+				cc.logger.Infof("Control channel back to READY (no prior failure)")


I think this comment can be improved and made a little more explicit for ease of users.

eshitachandwani · 2025-11-21T08:44:11Z

balancer/rls/control_channel_test.go

+			}
+
+			// Give extra time for any pending callbacks
+			time.Sleep(100 * time.Millisecond)


I dont think adding time.Sleep() to wait for state changes is a good idea, since it can make tests flake if sometimes the state transitions take longer, we should look for better guarantees to make sure the states have transitioned.
cc : @easwars

- Add testOnlyInitialReadyDone channel for proper test synchronization - Signal when monitoring goroutine processes initial READY state - Tests wait for this signal instead of using time.Sleep - All synchronization now uses channels/callbacks - no arbitrary sleeps - Tests pass consistently with race detector Addresses review feedback about removing time.Sleep for state transitions.

Copilot

Pull request overview

This PR fixes control channel connectivity monitoring in the RLS balancer to only reset backoff timers when genuinely recovering from a TRANSIENT_FAILURE state, not during benign state changes like READY → IDLE → READY.

Adds explicit tracking of TRANSIENT_FAILURE state with a boolean flag
Updates callback invocation logic to only trigger after recovery from TRANSIENT_FAILURE
Adds comprehensive test coverage for various state transition scenarios

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
balancer/rls/control_channel.go	Implements TRANSIENT_FAILURE tracking with a boolean flag, adds nil check for callback, and includes test synchronization channel
balancer/rls/control_channel_test.go	Adds comprehensive test cases covering different connectivity state transition scenarios

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

easwars · 2025-11-21T22:43:41Z

balancer/rls/control_channel.go

+	// testOnlyInitialReadyDone is closed when the monitoring goroutine
+	// processes the initial READY state. Only used in tests.
+	testOnlyInitialReadyDone chan struct{}


I have we have test-only code/hooks in some parts of the code. But it would be nice to avoid these.

balancer/rls/control_channel.go

eshitachandwani · 2025-11-24T14:00:48Z

balancer/rls/control_channel.go

+	case connectivity.TransientFailure:
+		// Track that we've entered TRANSIENT_FAILURE state so we know to reset
+		// backoffs when we recover to READY.
+		cc.logger.Infof("Control channel is TRANSIENT_FAILURE")


According to the log_level docs, I feel this could be a warning instead of Info.

eshitachandwani · 2025-11-24T14:10:41Z

balancer/rls/control_channel_test.go

+		wantCallbackCount int
+	}{
+		{
+			name: "READY → TRANSIENT_FAILURE → READY triggers callback",


The subtest names show up in logs with the test names and can be used to filter test, we prefer names that can be easily used to filter and put the extra text (if we want any) in description, and also we try to avoid using spaces in the names. So for example for the 1st test case,
name : ready_after_transient_failure
description : ready after transient failure triggers callback to reset the timer.

Then the way we will see the name in logs will be Test/ControlChannelConnectivityStateTransitions/ready_after_transient_failure

eshitachandwani · 2025-11-24T14:48:34Z

balancer/rls/control_channel_test.go

+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {


I think there might be some way to improve the test. Maybe we can use waitGroups , but I will defer to @easwars for his opinion on this.

To be able to test this in an e2e style, we would need to do make it possible for the test to see the connectivity state changes on the control channel, but without adding hooks into the test code. I propose the following:

Add a new variable that can be overridden by the test. This variable will point to a function that returns the subscriber to the passed to the SubscribeToConnectivityStateChanges API

var newConnectivityStateSubscriber = connStateSubscriber

Set the above variable to the following piece of code in control_channel.go

func connStateSubscriber(sub grpcsync.Subscriber) grpcsync.Subscriber { return sub }

When calling the SubscribeToConnectivityStateChanges API, use the above function as follows:

ctrlCh.unsubscribe = internal.SubscribeToConnectivityStateChanges.(func(cc *grpc.ClientConn, s grpcsync.Subscriber) func())(ctrlCh.cc, newConnectivityStateSubscriber(ctrlCh))

In the test, create an implementation of the grpcsync.Subscriber interface that contains a delegate. This makes it possible for the test to delegate to the subscriber set by the non-test code, but make the connectivity state changes available to the test:

type wrappingConnectivityStateSubscriber struct { delegate grpcsync.Subscriber connStateCh chan connectivity.State } func (w *wrappingConnectivityStateSubscriber) OnMessage(msg any) { w.delegate.OnMessage(msg) w.connStateCh <- msg.(connectivity.State) }

Modify TestControlChannelConnectivityStateMonitoring or add a new test where we do the override. You would probably want to use a `testutils.

// Override the connectivity state subscriber. wrappedSubscriber := &wrappingConnectivityStateSubscriber{connStateCh: make(chan connectivity.State, 1)} origConnectivityStateSubscriber := newConnectivityStateSubscriber newConnectivityStateSubscriber = func(delegate grpcsync.Subscriber) grpcsync.Subscriber { wrappedSubscriber.delegate = delegate return wrappedSubscriber } defer func() { newConnectivityStateSubscriber = origConnectivityStateSubscriber }()

In the test body, we need to check the appropriate state transitions in a bunch of places:

// Make sure an RLS request is sent out. verifyRLSRequest(t, rlsReqCh, true) // Verify that the control channel moves to READY. wantStates := []connectivity.State{ connectivity.Connecting, connectivity.Ready, } for _, wantState := range wantStates { select { case gotState := <-wrappedSubscriber.connStateCh: if gotState != wantState { t.Fatalf("Unexpected connectivity state: got %v, want %v", gotState, wantState) } case <-ctx.Done(): t.Fatalf("Timeout waiting for RLS control channel to become %q", wantState) } } // Stop the RLS server. lis.Stop() // Verify that the control channel moves to IDLE. wantStates = []connectivity.State{ connectivity.Idle, } for _, wantState := range wantStates { select { case gotState := <-wrappedSubscriber.connStateCh: if gotState != wantState { t.Fatalf("Unexpected connectivity state: got %v, want %v", gotState, wantState) } case <-ctx.Done(): t.Fatalf("Timeout waiting for RLS control channel to become %q", wantState) } } // Make another RPC similar to the first one. Since the above cache entry // would have expired by now, this should trigger another RLS request. And // since the RLS server is down, RLS request will fail and the cache entry // will enter backoff, and we have overridden the default backoff strategy to // return a value which will keep this entry in backoff for the whole duration // of the test. makeTestRPCAndVerifyError(ctx, t, cc, codes.Unavailable, nil) // Verify that the control channel moves to TRANSIENT_FAILURE. wantStates = []connectivity.State{ connectivity.Connecting, connectivity.TransientFailure, } for _, wantState := range wantStates { select { case gotState := <-wrappedSubscriber.connStateCh: if gotState != wantState { t.Fatalf("Unexpected connectivity state: got %v, want %v", gotState, wantState) } case <-ctx.Done(): t.Fatalf("Timeout waiting for RLS control channel to become %q", wantState) } } // Restart the RLS server. lis.Restart()

The above will test the READY --> TF --> READY transition.

For the READY --> IDLE --> READY, we need to restart the RLS server once the control channel goes IDLE, and then wait for it to go READY before attempting another RPC and verifying that backoffs are not reset.

Let me know what you think about this approach.

Thanks

easwars · 2025-11-25T07:04:40Z

balancer/rls/control_channel.go

+		// proceed immediately. We skip benign transitions like READY → IDLE → READY
+		// since those don't represent actual failures.
+		if cc.seenTransientFailure {
+			cc.logger.Infof("Control channel back to READY after TRANSIENT_FAILURE")


Nit: Please guard INFO logs with a verbosity check of 2. We weren't doing that when most of the code in this file was written, but have been trying to do that diligently of late. So, something like:

if cc.logger.V(2) { cc.logger.Infof("Control channel back to READY after TRANSIENT_FAILURE") }

Here and elsewhere in this method. Thanks.

easwars · 2025-11-25T07:05:34Z

balancer/rls/control_channel.go

+	seenTransientFailure bool
+	mu                   sync.Mutex


Nit: Please group the mutex and the fields that it guards as a separate block. It is a generally used convention that any fields that come right below a mutex are to be guarded by that mutex.

ulascansenturk force-pushed the fix/8693-rls-control-channel-state-monitoring branch from e2406a9 to a6fcb7e Compare November 20, 2025 20:42

eshitachandwani added this to the 1.78 Release milestone Nov 21, 2025

eshitachandwani added the Type: Bug label Nov 21, 2025

eshitachandwani reviewed Nov 21, 2025

View reviewed changes

eshitachandwani requested a review from easwars November 21, 2025 08:44

eshitachandwani assigned easwars and ulascansenturk Nov 21, 2025

ulascansenturk added 4 commits November 21, 2025 12:13

address reviews

ca49ae7

address reviews

c7eb618

Shorter comment

943240d

ulascansenturk force-pushed the fix/8693-rls-control-channel-state-monitoring branch from 135b43d to ed5ab2c Compare November 21, 2025 11:01

ulascansenturk requested a review from eshitachandwani November 21, 2025 11:16

easwars requested a review from Copilot November 21, 2025 18:58

Copilot started reviewing on behalf of easwars November 21, 2025 18:59 View session

Copilot finished reviewing on behalf of easwars November 21, 2025 19:00

Copilot AI reviewed Nov 21, 2025

View reviewed changes

easwars reviewed Nov 21, 2025

View reviewed changes

easwars removed their assignment Nov 21, 2025

Fix comments

2ad8249

ulascansenturk requested a review from easwars November 23, 2025 11:19

eshitachandwani assigned easwars and eshitachandwani and unassigned ulascansenturk Nov 24, 2025

eshitachandwani reviewed Nov 24, 2025

View reviewed changes

eshitachandwani assigned ulascansenturk and unassigned eshitachandwani Nov 24, 2025

address reviews

faee5a0

gofmt

5dcc02c

ulascansenturk requested a review from eshitachandwani November 24, 2025 17:52

easwars reviewed Nov 25, 2025

View reviewed changes

easwars removed their assignment Nov 25, 2025

rls: only reset backoff on recovery from TRANSIENT_FAILURE #8720

Are you sure you want to change the base?

rls: only reset backoff on recovery from TRANSIENT_FAILURE #8720

Conversation

ulascansenturk commented Nov 20, 2025

Uh oh!

codecov bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Nov 20, 2025 •

edited

Loading