-
Notifications
You must be signed in to change notification settings - Fork 4.6k
stats/opentelemetry: record retry attempts from clientStream #8342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 50 commits
3ba457e
5d19779
4245950
5347db1
99e88d8
586cf63
1bdad7e
39c5f0d
11523b6
3720f4e
b97a2da
a0ef86c
ac79ad2
10c6a90
1048040
05e2cc8
ba08688
fe1831f
4f76f19
65da5d8
d489c91
8de4d5e
1654ba1
06f350c
b944353
daef268
33f89a5
2b6ff10
08b5e7c
e84407b
fcb1279
c6a254b
a3da45c
41a5eb4
7db3821
6fc9f84
eec064f
99a1b6e
bb239e3
72ff228
4144b46
2bec1a5
abf0f8b
8bc283f
9e2647d
c9c6f4c
919caa0
2141cd5
9b13b21
9f59e51
1660fcb
992343e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,7 @@ import ( | |
"log" | ||
"strings" | ||
|
||
"go.opentelemetry.io/otel/attribute" | ||
otelcodes "go.opentelemetry.io/otel/codes" | ||
"go.opentelemetry.io/otel/trace" | ||
"google.golang.org/grpc" | ||
|
@@ -83,7 +84,10 @@ func (h *clientTracingHandler) finishTrace(err error, ts trace.Span) { | |
// It creates a new outgoing carrier which serializes information about this | ||
// span into gRPC Metadata, if TextMapPropagator is provided in the trace | ||
// options. if TextMapPropagator is not provided, it returns the context as is. | ||
func (h *clientTracingHandler) traceTagRPC(ctx context.Context, ai *attemptInfo, nameResolutionDelayed bool) (context.Context, *attemptInfo) { | ||
// | ||
// Note: The passed attemptInfo pointer (ai) is mutated in-place. Fields such as | ||
// ai.traceSpan are updated directly. No new attemptInfo is returned. | ||
func (h *clientTracingHandler) traceTagRPC(ctx context.Context, ai *attemptInfo, nameResolutionDelayed bool) context.Context { | ||
// Add a "Delayed name resolution complete" event to the call span | ||
// if there was name resolution delay. In case of multiple retry attempts, | ||
// ensure that event is added only once. | ||
|
@@ -98,7 +102,7 @@ func (h *clientTracingHandler) traceTagRPC(ctx context.Context, ai *attemptInfo, | |
carrier := otelinternaltracing.NewOutgoingCarrier(ctx) | ||
h.options.TraceOptions.TextMapPropagator.Inject(ctx, carrier) | ||
ai.traceSpan = span | ||
return carrier.Context(), ai | ||
return carrier.Context() | ||
} | ||
|
||
// createCallTraceSpan creates a call span to put in the provided context using | ||
|
@@ -121,7 +125,13 @@ func (h *clientTracingHandler) HandleConn(context.Context, stats.ConnStats) {} | |
// TagRPC implements per RPC attempt context management for traces. | ||
func (h *clientTracingHandler) TagRPC(ctx context.Context, info *stats.RPCTagInfo) context.Context { | ||
ctx, ai := getOrCreateRPCAttemptInfo(ctx) | ||
ctx, ai = h.traceTagRPC(ctx, ai, info.NameResolutionDelay) | ||
ci := getCallInfo(ctx) | ||
if ci == nil { | ||
logger.Error("context passed into client side stats handler (TagRPC) has no call info") | ||
return ctx | ||
} | ||
ai.previousRPCAttempts = ci.previousRPCAttempts | ||
ctx = h.traceTagRPC(ctx, ai, info.NameResolutionDelay) | ||
return setRPCInfo(ctx, &rpcInfo{ai: ai}) | ||
} | ||
|
||
|
@@ -132,5 +142,22 @@ func (h *clientTracingHandler) HandleRPC(ctx context.Context, rs stats.RPCStats) | |
logger.Error("ctx passed into client side tracing handler trace event handling has no client attempt data present") | ||
return | ||
} | ||
|
||
// Client-specific Begin attributes. | ||
var previousRPCAttempts int64 | ||
if ri.ai.previousRPCAttempts != nil { | ||
previousRPCAttempts = int64(ri.ai.previousRPCAttempts.Load()) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why are we loading this here but only using it inside the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No need for the condition — I’ve removed the nil check as well. |
||
} | ||
if begin, ok := rs.(*stats.Begin); ok { | ||
ri.ai.traceSpan.SetAttributes( | ||
attribute.Bool("Client", begin.Client), | ||
attribute.Bool("FailFast", begin.FailFast), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's remove these now since they are not part of the spec There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
attribute.Int64("previous-rpc-attempts", previousRPCAttempts), | ||
attribute.Bool("transparent-retry", begin.IsTransparentRetryAttempt), | ||
) | ||
if !begin.IsTransparentRetryAttempt && ri.ai.previousRPCAttempts != nil { | ||
ri.ai.previousRPCAttempts.Add(1) | ||
} | ||
} | ||
populateSpan(rs, ri.ai) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for this to be false?
And is there any reason this is a pointer instead of just a value? Then we wouldn't need nil checks, or to construct it explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s not possible for it to be nil — we always initialize it in getOrCreateCallInfo with ci.previousRPCAttempts = new(atomic.Uint32). I’ve removed the nil check as well. The pointer is used so that updates in ai are automatically reflected in ci.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this is in the attemptInfo. Why are we keeping it here instead of just loading it out of the callInfo?
Also I realized that with hedging, the previous attempts accounting is racy:
Begin
simultaneouslyIt's possible for any of them to have any value now. You need to instead increment and read together:
Then they will all see unique values.
(We don't implement hedging yet, but we need to keep in mind that we will one day.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previousRPCAttempts represents the number of retries before the current attempt. We should record the existing retry count in attemptInfo before incrementing it. After that, we increment the retry count.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we need to atomically read and increment it, as I explained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Doug, I’m a bit confused. Initially, I had the increment here — https://github.com/grpc/grpc-go/blob/master/stats/opentelemetry/client_metrics.go#L78 — but after the review comments, I’m moved to trace. Could you clarify what exactly you’d like to change here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do the load and add separately, there is a race if multiple attempts are happening at once, with hedging. So we need to do the add and use the value it returns. It returns the incremented value, so we need to subtract 1 from it to get the original value. This way if multiple attempts are happening at the same time, then each one will get a unique value.
I think we should also see about removing the previousRPCAttempts pointer from the attemptInfo, and instead read it out of the callInfo, which is stored in the context too, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I'll update the code accordingly. I'll also double-check how it's stored in the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! I've updated the code.