Skip to content

stats/opentelemetry: restore the changes from #8342 and fix the flaky test.#8923

Open
Pranjali-2501 wants to merge 2 commits intogrpc:masterfrom
Pranjali-2501:fix-otel-changes
Open

stats/opentelemetry: restore the changes from #8342 and fix the flaky test.#8923
Pranjali-2501 wants to merge 2 commits intogrpc:masterfrom
Pranjali-2501:fix-otel-changes

Conversation

@Pranjali-2501
Copy link
Contributor

@Pranjali-2501 Pranjali-2501 commented Feb 20, 2026

Fixes #8700

This PR re-lands the changes from #8342 , which were reverted due to a flaky test. It cherry-picks the original commits and adds a fix for the underlying race condition that caused the test TestTraceSpan_WithRetriesAndNameResolutionDelay to flake.

Problem

As mentioned here, the test was flaky because of a race condition between the load balancing policy creating a ready picker and the RPC attempting to pick a connection. If the picker became ready before the first Pick attempt, the RPC would not be delayed, the "Delayed LB pick complete" event would not be emitted, and the test would fail.

Fix

To solve the race condition, the test now uses a custom stub balancer and returns a blocking picker that guarantees the RPC will wait for a connection. As soon as the RPC attempts to Pick and is confirmed to be in a waiting state, the balancer then provides a valid, "non-blocking" picker, allowing the RPC to succeed. This sequence reliably triggers the DelayedPickComplete event.

RELEASE NOTES:

  • stats/opentelemetry: Retry attempts (grpc.previous-rpc-attempts) are now recorded as span attributes for non-transparent client retries.

vinothkumarr227 and others added 2 commits February 20, 2026 09:42
Fixes: grpc#8299

RELEASE NOTES:

- stats/opentelemetry: Retry attempts (`grpc.previous-rpc-attempts`) are
now recorded as span attributes for non-transparent client retries.
@Pranjali-2501 Pranjali-2501 added this to the 1.80 Release milestone Feb 20, 2026
@Pranjali-2501 Pranjali-2501 added Type: Bug Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability Area: Testing Includes tests and testing utilities that we have for unit and e2e tests within our repo. labels Feb 20, 2026
@codecov
Copy link

codecov bot commented Feb 20, 2026

Codecov Report

❌ Patch coverage is 78.57143% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.23%. Comparing base (b6f89f7) to head (0d0d007).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
stats/opentelemetry/client_tracing.go 78.57% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8923      +/-   ##
==========================================
+ Coverage   83.14%   83.23%   +0.08%     
==========================================
  Files         417      417              
  Lines       32937    32943       +6     
==========================================
+ Hits        27387    27420      +33     
+ Misses       4117     4098      -19     
+ Partials     1433     1425       -8     
Files with missing lines Coverage Δ
stats/opentelemetry/opentelemetry.go 75.12% <ø> (ø)
stats/opentelemetry/trace.go 90.32% <ø> (-1.99%) ⬇️
stats/opentelemetry/client_tracing.go 86.36% <78.57%> (-2.73%) ⬇️

... and 22 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Pranjali-2501
Copy link
Contributor Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully re-lands the changes to record retry attempts as OpenTelemetry span attributes and fixes a flaky test that caused the original revert. The fix for the race condition in the test TestTraceSpan_WithRetriesAndNameResolutionDelay using a custom blocking balancer is well-implemented and should reliably prevent flakiness. The refactoring of the tracing logic to move attempt counting to callInfo and handle client-specific attributes in client_tracing.go is a good improvement. I've found one issue in the test validation logic that should be addressed.

@Pranjali-2501 Pranjali-2501 changed the title stats/opentelemetry: re-lands PR #8342(with changes): Restore the existing changes from #8342 and fix the flaky test. stats/opentelemetry: restore the changes from #8342 and fix the flaky test. Feb 21, 2026
const delayedResolutionEventName = "Delayed name resolution complete"

type blockingPicker struct {
sc atomic.Pointer[subConnWrapper]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of atomically swapping out the subchannel, a new picker object should be created and used while calling UpdateState.

}

// Create a SubConn with the addresses from the resolver.
if len(ccs.ResolverState.Addresses) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of implementing a partial LB policy, we should wrap an existing LB policy like pickfirst and delegate calls to it. We can intercept call from pickfirst to UpdateState wrap the picker to detect calls to pick.

Comment on lines +1417 to +1419
case <-p.pickInvoked:
default:
close(p.pickInvoked)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Pick method can be called multiple times concurrently. This implementation can result in a panic since it can result in closing a channel multiple times. Instead of this, you can use an Event.
https://github.com/grpc/grpc-go/blob/master/internal/grpcsync/event.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area: Observability Includes Stats, Tracing, Channelz, Healthz, Binlog, Reflection, Admin, GCP Observability Area: Testing Includes tests and testing utilities that we have for unit and e2e tests within our repo. Type: Bug

Projects

None yet

Development

Successfully merging this pull request may close these issues.

otel: retries must be tracked per-call and not per-attempt

5 participants