stats/opentelemetry: restore the changes from #8342 and fix the flaky test.#8923
stats/opentelemetry: restore the changes from #8342 and fix the flaky test.#8923Pranjali-2501 wants to merge 2 commits intogrpc:masterfrom
Conversation
Fixes: grpc#8299 RELEASE NOTES: - stats/opentelemetry: Retry attempts (`grpc.previous-rpc-attempts`) are now recorded as span attributes for non-transparent client retries.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8923 +/- ##
==========================================
+ Coverage 83.14% 83.23% +0.08%
==========================================
Files 417 417
Lines 32937 32943 +6
==========================================
+ Hits 27387 27420 +33
+ Misses 4117 4098 -19
+ Partials 1433 1425 -8
🚀 New features to boost your workflow:
|
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request successfully re-lands the changes to record retry attempts as OpenTelemetry span attributes and fixes a flaky test that caused the original revert. The fix for the race condition in the test TestTraceSpan_WithRetriesAndNameResolutionDelay using a custom blocking balancer is well-implemented and should reliably prevent flakiness. The refactoring of the tracing logic to move attempt counting to callInfo and handle client-specific attributes in client_tracing.go is a good improvement. I've found one issue in the test validation logic that should be addressed.
| const delayedResolutionEventName = "Delayed name resolution complete" | ||
|
|
||
| type blockingPicker struct { | ||
| sc atomic.Pointer[subConnWrapper] |
There was a problem hiding this comment.
Instead of atomically swapping out the subchannel, a new picker object should be created and used while calling UpdateState.
| } | ||
|
|
||
| // Create a SubConn with the addresses from the resolver. | ||
| if len(ccs.ResolverState.Addresses) > 0 { |
There was a problem hiding this comment.
Instead of implementing a partial LB policy, we should wrap an existing LB policy like pickfirst and delegate calls to it. We can intercept call from pickfirst to UpdateState wrap the picker to detect calls to pick.
| case <-p.pickInvoked: | ||
| default: | ||
| close(p.pickInvoked) |
There was a problem hiding this comment.
The Pick method can be called multiple times concurrently. This implementation can result in a panic since it can result in closing a channel multiple times. Instead of this, you can use an Event.
https://github.com/grpc/grpc-go/blob/master/internal/grpcsync/event.go
Fixes #8700
This PR re-lands the changes from #8342 , which were reverted due to a flaky test. It cherry-picks the original commits and adds a fix for the underlying race condition that caused the test
TestTraceSpan_WithRetriesAndNameResolutionDelayto flake.Problem
As mentioned here, the test was flaky because of a race condition between the load balancing policy creating a ready picker and the RPC attempting to pick a connection. If the picker became ready before the first Pick attempt, the RPC would not be delayed, the "Delayed LB pick complete" event would not be emitted, and the test would fail.
Fix
To solve the race condition, the test now uses a custom stub balancer and returns a blocking picker that guarantees the RPC will wait for a connection. As soon as the RPC attempts to Pick and is confirmed to be in a waiting state, the balancer then provides a valid, "non-blocking" picker, allowing the RPC to succeed. This sequence reliably triggers the
DelayedPickCompleteevent.RELEASE NOTES: