-
Notifications
You must be signed in to change notification settings - Fork 262
Add distributed tracing and structured error handling for extensions #6321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add distributed tracing and structured error handling for extensions #6321
Conversation
wbreza
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that adding these tracing fields directly to all our messages will cause long term maintenance issues and require many updates and each new service will need to need to include the same values.
We should consider leveraging gRPC metadata that are passing along via HTTP2 headers similar to the approach we are taking with authorization.
Authorization is sent transparently to/from extensions via this method and so far has been easy to maintain/update.
cli/azd/grpc/proto/event.proto
Outdated
| // W3C traceparent format for distributed tracing propagation. | ||
| // Format: "00-{traceId}-{spanId}-{flags}" | ||
| string trace_parent = 98; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not liking the embedding of trace_parent as a message field, echoing what @wbreza mentioned earlier. This would make a lot of sense as a grpc header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@weikanglim @wbreza Can you take another look? I pushed some updates to use metadata, although it isn't strictly needed to fix #6292/#6316.
Setting the TRACEPARENT env var and updating the extension to hydrate the context from it is the actual fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't mind removing the grpc request-based tracing entirely in favor of ONLY TRACEPARENT process invocation tracing.
56fbab3 to
eea682a
Compare
Azure Dev CLI Install InstructionsInstall scriptsMacOS/Linux
bash: pwsh: WindowsPowerShell install MSI install Standalone Binary
MSI
Documentationlearn.microsoft.com documentationtitle: Azure Developer CLI reference
|
Fixes #6292
This PR enhances the extension framework by adding support for distributed tracing propagation and improving error handling telemetry. It ensures that trace contexts are preserved across the gRPC boundary between
azdand extensions, and that service-specific errors from extensions are correctly captured in telemetry.A follow-up PR will be made to update the agents extension to use these new capabilities.
Changes
Distributed Tracing
TRACEPARENTandTRACESTATEenvironment variables when launching extension process and addedazdext.NewContext()helper to hydrate extension context fromTRACEPARENT/TRACESTATE.traceparent/tracestateheader injection and extraction using gRPC client and server interceptors.azdext.NewContext()to hydrate the extension's root context fromTRACEPARENT/TRACESTATEenvironment variablese.Structured Error Handling
ExtensionResponseErrorto capture detailed error information includingErrorCode,StatusCode, andServiceName.MapErrorto recognize these structured errors and emit precise telemetry signals (e.g.,ext.service.openai.429instead of generic errors).FrameworkServiceErrorMessageandServiceTargetErrorMessagedefinitions to transport structured error data.Bug Fixes
SendAndWaitandSendAndWaitWithProgressinMessageBrokerto prevent race conditions during concurrent stream writes.Validation
Tracing Comparison:
{"Status": {
"Code": "Error",
"Description": "UnknownError"
},
"Attributes": { ... }
...
}
{"Status": {
"Code": "Error",
"Description": "ext.service.ai.500"
},
"Attributes": {
"error.service.name": "ai",
"error.service.host": "services.ai.azure.com",
"error.service.statusCode": "500",
...
}
...
}