feat: instrument controller with distributed tracing and A2A trace propagation#1433
Conversation
| docker rm -f jaeger-desktop || true | ||
| docker run -d --name jaeger-desktop --restart=always -p 16686:16686 -p 4317:4317 -p 4318:4318 jaegertracing/jaeger:2.7.0 | ||
| open http://localhost:16686/ | ||
| @echo "Jaeger UI available at http://localhost:16686/" |
There was a problem hiding this comment.
open is MacOS specific
6b19792 to
7c92926
Compare
| assert instrument_calls["google_instrumented"] is True | ||
|
|
||
|
|
||
| def test_otel_sdk_default_propagator_includes_w3c_tracecontext(): |
There was a problem hiding this comment.
Added this test to ensure that text propagation remains in place across underlying SDK upgrades - without needing any set_global_textmap call.
| exporter: | ||
| otlp: | ||
| endpoint: "" | ||
| protocol: "grpc" |
There was a problem hiding this comment.
autoexport defaults to http/protobuf so might be worth explicitly calling this out in release notes as anyone who uses http will need to ensure this is set correctly.
74f325d to
cc49458
Compare
|
Don't think |
There was a problem hiding this comment.
Pull request overview
Adds end-to-end OpenTelemetry tracing to the Go controller HTTP/API and improves trace context propagation across controller→agent A2A proxy calls, while reducing trace noise in the Python agent by excluding high-frequency endpoints.
Changes:
- Initialize an OTEL TracerProvider in the Go controller and instrument incoming HTTP requests with
otelhttp. - Inject W3C TraceContext headers into outbound A2A proxy requests and add an
invoke_agenttracing middleware with GenAI semantic attributes. - Update Helm OTEL configuration (protocol env vars) and exclude
/.well-known/agent-card.jsonfrom Python traces.
Reviewed changes
Copilot reviewed 15 out of 16 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| python/packages/kagent-core/tests/test_tracing_configure.py | Adds a regression test asserting W3C TraceContext is present in default propagators. |
| python/packages/kagent-core/src/kagent/core/tracing/_utils.py | Excludes agent-card endpoint from HTTPX/FastAPI instrumentation to reduce trace noise. |
| helm/kagent/values.yaml | Adds default OTLP protocol for tracing exporter values. |
| helm/kagent/templates/controller-deployment.yaml | Injects pod/node env vars for richer OTEL resource attributes. |
| helm/kagent/templates/controller-configmap.yaml | Wires OTLP protocol env vars into controller config. |
| go/go.work.sum | Updates workspace dependency sums. |
| go/core/pkg/app/app.go | Initializes tracing on controller startup and flushes on shutdown. |
| go/core/internal/telemetry/tracing.go | New OTEL tracer provider setup using autoexport + resource attributes. |
| go/core/internal/httpserver/server.go | Wraps router with otelhttp handler and filters health checks. |
| go/core/internal/a2a/trace_test.go | Adds tests for trace header injection and A2A middleware span attributes. |
| go/core/internal/a2a/trace.go | Adds A2A tracing middleware, outbound TraceContext injection, provider name resolution. |
| go/core/internal/a2a/a2a_registrar.go | Wraps outbound A2A request handler with TraceContext injection; registers tracing middleware. |
| go/core/internal/a2a/a2a_handler_mux.go | Extends handler registration to accept optional tracing middleware. |
| go/core/go.sum | Updates Go module sums for new/updated dependencies. |
| go/core/go.mod | Updates/expands indirect OTEL and related dependencies. |
| Makefile | Adjusts local Jaeger target output to avoid OS-specific open. |
Comments suppressed due to low confidence (2)
helm/kagent/values.yaml:410
otel.tracing.exporter.otlp.protocolwas added, butotel.logging.exporter.otlphas no matchingprotocolfield. Since the controller ConfigMap now wires protocol env vars, consider adding aotel.logging.exporter.otlp.protocolvalue (defaulting to the same as tracing) so users can keep logs aligned with their endpoint type (gRPC vs HTTP/protobuf), especially when using separate endpoints.
otel:
tracing:
enabled: false
exporter:
otlp:
endpoint: ""
protocol: "grpc"
timeout: 15
insecure: true
logging:
enabled: false
exporter:
otlp:
endpoint: ""
timeout: 15
insecure: true
helm/kagent/templates/controller-configmap.yaml:50
- The chart sets
OTEL_EXPORTER_OTLP_PROTOCOL/OTEL_EXPORTER_OTLP_TRACES_PROTOCOLfrom the tracing config, but there is no correspondingOTEL_EXPORTER_OTLP_LOGS_PROTOCOLwhen traces/logs use separate endpoints. This can leave logs exporting with the SDK default protocol (often http/protobuf) while the endpoint is gRPC (4317), causing log export failures. Consider adding aotel.logging.exporter.otlp.protocolvalue and wiring it toOTEL_EXPORTER_OTLP_LOGS_PROTOCOL(or reusing the tracing protocol explicitly) in the separate-endpoints branch.
OTEL_EXPORTER_OTLP_PROTOCOL: {{ .Values.otel.tracing.exporter.otlp.protocol | quote }}
{{- else }}
# Using separate endpoints for traces and logs
{{- if $tracesEndpoint }}
OTEL_EXPORTER_OTLP_TRACES_ENDPOINT: {{ $tracesEndpoint | quote }}
OTEL_EXPORTER_OTLP_TRACES_INSECURE: {{ .Values.otel.tracing.exporter.otlp.insecure | quote }}
OTEL_EXPORTER_OTLP_TRACES_PROTOCOL: {{ .Values.otel.tracing.exporter.otlp.protocol | quote }}
OTEL_EXPORTER_OTLP_TRACES_TIMEOUT: {{ .Values.otel.tracing.exporter.otlp.timeout | quote }}
{{- end }}
{{- if $logsEndpoint }}
OTEL_EXPORTER_OTLP_LOGS_ENDPOINT: {{ $logsEndpoint | quote }}
OTEL_EXPORTER_OTLP_LOGS_INSECURE: {{ .Values.otel.logging.exporter.otlp.insecure | quote }}
OTEL_EXPORTER_OTLP_LOGS_TIMEOUT: {{ .Values.otel.logging.exporter.otlp.timeout | quote }}
{{- end }}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
cc49458 to
a271019
Compare
krisztianfekete
left a comment
There was a problem hiding this comment.
Very nice work, added a couple of minor comments!
| otelhttp.WithSpanNameFormatter(func(_ string, r *http.Request) string { | ||
| return r.Method + " " + r.URL.Path | ||
| }), | ||
| otelhttp.WithFilter(func(r *http.Request) bool { |
There was a problem hiding this comment.
Would it make sense to filter A2A spans as well to reduce redundancy? Not sure how noisy this can get.
There was a problem hiding this comment.
Hm checking the middleware, looks like if we do this the invoke_agent span would become a root span instead of a child.
There was a problem hiding this comment.
Yeah, we want the nesting in this case.
| "os" | ||
|
|
||
| "github.com/google/uuid" | ||
| "go.opentelemetry.io/contrib/exporters/autoexport" |
There was a problem hiding this comment.
Cannot we just initiate an exporter based on OTEL_EXPORTER_OTLP_PROTOCOL? The rest would would be unused and using them would require code changes anyway due to how we set OTEL_EXPORTER_OTLP_PROTOCOL in Helm.
There was a problem hiding this comment.
I guess we could, although I feel like we're just going to end up duplicating what autoexport does since users will probably expect us to fully comply with the standard hierarchy of env var configuration . E.g. OTEL_EXPORTER_OTLP_PROTOCOL applying to both logs and traces vs the use of OTEL_EXPORTER_OTLP_TRACES_PROTOCOL and OTEL_EXPORTER_OTLP_LOGS_PROTOCOL individually. See also https://opentelemetry.io/docs/languages/sdk-configuration/otlp-exporter/ and https://github.com/open-telemetry/opentelemetry-specification/blob/main/spec-compliance-matrix.md#environment-variables
Personally, I'd prefer relying on the upstream here rather than doing our own thing.
| return genAIProviderName(mc.Spec.Provider) | ||
| } | ||
|
|
||
| // genAIProviderName maps kagent's ModelProvider values to the standard |
| OTEL_EXPORTER_OTLP_TRACES_TIMEOUT: {{ .Values.otel.tracing.exporter.otlp.timeout | quote }} | ||
| OTEL_EXPORTER_OTLP_LOGS_INSECURE: {{ .Values.otel.logging.exporter.otlp.insecure | quote }} | ||
| OTEL_EXPORTER_OTLP_LOGS_TIMEOUT: {{ .Values.otel.logging.exporter.otlp.timeout | quote }} | ||
| OTEL_EXPORTER_OTLP_PROTOCOL: {{ .Values.otel.tracing.exporter.otlp.protocol | quote }} |
There was a problem hiding this comment.
Can we move this inside {{- if $tracesEndpoint }}?
There was a problem hiding this comment.
It is already there albeit slightly differently named as per the current convention of either handling traces and logs together or individually.
61153ee to
6864802
Compare
…opagation Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
6864802 to
7a1ad86
Compare
krisztianfekete
left a comment
There was a problem hiding this comment.
CI seems to failing (might be transient), otherwise LGTM, thank you for the PR!
This PR adds OpenTelemetry distributed tracing to the kagent controller API, fixes trace context propagation across A2A agent calls, and cleans up noise in the existing Python agent tracing.
Fixes #1295 essentially replacing a chunk of #1297 (it does not address being able to set the
appProtocolon theAgentService- that is a separate concern IMO).