Skip to content

feat: instrument controller with distributed tracing and A2A trace propagation#1433

Merged
EItanya merged 1 commit intokagent-dev:mainfrom
onematchfox:fix-trace-propagation
Mar 10, 2026
Merged

feat: instrument controller with distributed tracing and A2A trace propagation#1433
EItanya merged 1 commit intokagent-dev:mainfrom
onematchfox:fix-trace-propagation

Conversation

@onematchfox
Copy link
Contributor

@onematchfox onematchfox commented Mar 5, 2026

This PR adds OpenTelemetry distributed tracing to the kagent controller API, fixes trace context propagation across A2A agent calls, and cleans up noise in the existing Python agent tracing.

Fixes #1295 essentially replacing a chunk of #1297 (it does not address being able to set the appProtocol on the Agent Service - that is a separate concern IMO).

docker rm -f jaeger-desktop || true
docker run -d --name jaeger-desktop --restart=always -p 16686:16686 -p 4317:4317 -p 4318:4318 jaegertracing/jaeger:2.7.0
open http://localhost:16686/
@echo "Jaeger UI available at http://localhost:16686/"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open is MacOS specific

@onematchfox onematchfox force-pushed the fix-trace-propagation branch from 6b19792 to 7c92926 Compare March 5, 2026 09:49
assert instrument_calls["google_instrumented"] is True


def test_otel_sdk_default_propagator_includes_w3c_tracecontext():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this test to ensure that text propagation remains in place across underlying SDK upgrades - without needing any set_global_textmap call.

exporter:
otlp:
endpoint: ""
protocol: "grpc"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autoexport defaults to http/protobuf so might be worth explicitly calling this out in release notes as anyone who uses http will need to ensure this is set correctly.

@onematchfox onematchfox force-pushed the fix-trace-propagation branch 3 times, most recently from 74f325d to cc49458 Compare March 5, 2026 10:08
@onematchfox onematchfox marked this pull request as ready for review March 5, 2026 10:31
Copilot AI review requested due to automatic review settings March 5, 2026 10:31
@onematchfox
Copy link
Contributor Author

Don't think test-e2e failure is related to this PR but let me know if there's something I need to fix.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds end-to-end OpenTelemetry tracing to the Go controller HTTP/API and improves trace context propagation across controller→agent A2A proxy calls, while reducing trace noise in the Python agent by excluding high-frequency endpoints.

Changes:

  • Initialize an OTEL TracerProvider in the Go controller and instrument incoming HTTP requests with otelhttp.
  • Inject W3C TraceContext headers into outbound A2A proxy requests and add an invoke_agent tracing middleware with GenAI semantic attributes.
  • Update Helm OTEL configuration (protocol env vars) and exclude /.well-known/agent-card.json from Python traces.

Reviewed changes

Copilot reviewed 15 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
python/packages/kagent-core/tests/test_tracing_configure.py Adds a regression test asserting W3C TraceContext is present in default propagators.
python/packages/kagent-core/src/kagent/core/tracing/_utils.py Excludes agent-card endpoint from HTTPX/FastAPI instrumentation to reduce trace noise.
helm/kagent/values.yaml Adds default OTLP protocol for tracing exporter values.
helm/kagent/templates/controller-deployment.yaml Injects pod/node env vars for richer OTEL resource attributes.
helm/kagent/templates/controller-configmap.yaml Wires OTLP protocol env vars into controller config.
go/go.work.sum Updates workspace dependency sums.
go/core/pkg/app/app.go Initializes tracing on controller startup and flushes on shutdown.
go/core/internal/telemetry/tracing.go New OTEL tracer provider setup using autoexport + resource attributes.
go/core/internal/httpserver/server.go Wraps router with otelhttp handler and filters health checks.
go/core/internal/a2a/trace_test.go Adds tests for trace header injection and A2A middleware span attributes.
go/core/internal/a2a/trace.go Adds A2A tracing middleware, outbound TraceContext injection, provider name resolution.
go/core/internal/a2a/a2a_registrar.go Wraps outbound A2A request handler with TraceContext injection; registers tracing middleware.
go/core/internal/a2a/a2a_handler_mux.go Extends handler registration to accept optional tracing middleware.
go/core/go.sum Updates Go module sums for new/updated dependencies.
go/core/go.mod Updates/expands indirect OTEL and related dependencies.
Makefile Adjusts local Jaeger target output to avoid OS-specific open.
Comments suppressed due to low confidence (2)

helm/kagent/values.yaml:410

  • otel.tracing.exporter.otlp.protocol was added, but otel.logging.exporter.otlp has no matching protocol field. Since the controller ConfigMap now wires protocol env vars, consider adding a otel.logging.exporter.otlp.protocol value (defaulting to the same as tracing) so users can keep logs aligned with their endpoint type (gRPC vs HTTP/protobuf), especially when using separate endpoints.
otel:
  tracing:
    enabled: false
    exporter:
      otlp:
        endpoint: ""
        protocol: "grpc"
        timeout: 15
        insecure: true
  logging:
    enabled: false
    exporter:
      otlp:
        endpoint: ""
        timeout: 15
        insecure: true

helm/kagent/templates/controller-configmap.yaml:50

  • The chart sets OTEL_EXPORTER_OTLP_PROTOCOL / OTEL_EXPORTER_OTLP_TRACES_PROTOCOL from the tracing config, but there is no corresponding OTEL_EXPORTER_OTLP_LOGS_PROTOCOL when traces/logs use separate endpoints. This can leave logs exporting with the SDK default protocol (often http/protobuf) while the endpoint is gRPC (4317), causing log export failures. Consider adding a otel.logging.exporter.otlp.protocol value and wiring it to OTEL_EXPORTER_OTLP_LOGS_PROTOCOL (or reusing the tracing protocol explicitly) in the separate-endpoints branch.
  OTEL_EXPORTER_OTLP_PROTOCOL: {{ .Values.otel.tracing.exporter.otlp.protocol | quote }}
  {{- else }}
  # Using separate endpoints for traces and logs
  {{- if $tracesEndpoint }}
  OTEL_EXPORTER_OTLP_TRACES_ENDPOINT: {{ $tracesEndpoint | quote }}
  OTEL_EXPORTER_OTLP_TRACES_INSECURE: {{ .Values.otel.tracing.exporter.otlp.insecure | quote }}
  OTEL_EXPORTER_OTLP_TRACES_PROTOCOL: {{ .Values.otel.tracing.exporter.otlp.protocol | quote }}
  OTEL_EXPORTER_OTLP_TRACES_TIMEOUT: {{ .Values.otel.tracing.exporter.otlp.timeout | quote }}
  {{- end }}
  {{- if $logsEndpoint }}
  OTEL_EXPORTER_OTLP_LOGS_ENDPOINT: {{ $logsEndpoint | quote }}
  OTEL_EXPORTER_OTLP_LOGS_INSECURE: {{ .Values.otel.logging.exporter.otlp.insecure | quote }}
  OTEL_EXPORTER_OTLP_LOGS_TIMEOUT: {{ .Values.otel.logging.exporter.otlp.timeout | quote }}
  {{- end }}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@onematchfox onematchfox force-pushed the fix-trace-propagation branch from cc49458 to a271019 Compare March 5, 2026 10:53
Copy link
Contributor

@krisztianfekete krisztianfekete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work, added a couple of minor comments!

otelhttp.WithSpanNameFormatter(func(_ string, r *http.Request) string {
return r.Method + " " + r.URL.Path
}),
otelhttp.WithFilter(func(r *http.Request) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to filter A2A spans as well to reduce redundancy? Not sure how noisy this can get.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm checking the middleware, looks like if we do this the invoke_agent span would become a root span instead of a child.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we want the nesting in this case.

"os"

"github.com/google/uuid"
"go.opentelemetry.io/contrib/exporters/autoexport"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot we just initiate an exporter based on OTEL_EXPORTER_OTLP_PROTOCOL? The rest would would be unused and using them would require code changes anyway due to how we set OTEL_EXPORTER_OTLP_PROTOCOL in Helm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could, although I feel like we're just going to end up duplicating what autoexport does since users will probably expect us to fully comply with the standard hierarchy of env var configuration . E.g. OTEL_EXPORTER_OTLP_PROTOCOL applying to both logs and traces vs the use of OTEL_EXPORTER_OTLP_TRACES_PROTOCOL and OTEL_EXPORTER_OTLP_LOGS_PROTOCOL individually. See also https://opentelemetry.io/docs/languages/sdk-configuration/otlp-exporter/ and https://github.com/open-telemetry/opentelemetry-specification/blob/main/spec-compliance-matrix.md#environment-variables

Personally, I'd prefer relying on the upstream here rather than doing our own thing.

return genAIProviderName(mc.Spec.Provider)
}

// genAIProviderName maps kagent's ModelProvider values to the standard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

OTEL_EXPORTER_OTLP_TRACES_TIMEOUT: {{ .Values.otel.tracing.exporter.otlp.timeout | quote }}
OTEL_EXPORTER_OTLP_LOGS_INSECURE: {{ .Values.otel.logging.exporter.otlp.insecure | quote }}
OTEL_EXPORTER_OTLP_LOGS_TIMEOUT: {{ .Values.otel.logging.exporter.otlp.timeout | quote }}
OTEL_EXPORTER_OTLP_PROTOCOL: {{ .Values.otel.tracing.exporter.otlp.protocol | quote }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this inside {{- if $tracesEndpoint }}?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is already there albeit slightly differently named as per the current convention of either handling traces and logs together or individually.

…opagation

Signed-off-by: Brian Fox <878612+onematchfox@users.noreply.github.com>
@onematchfox onematchfox force-pushed the fix-trace-propagation branch from 6864802 to 7a1ad86 Compare March 10, 2026 11:35
Copy link
Contributor

@krisztianfekete krisztianfekete left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI seems to failing (might be transient), otherwise LGTM, thank you for the PR!

@EItanya EItanya merged commit 413d688 into kagent-dev:main Mar 10, 2026
41 of 42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Trace context (traceparent) not propagated from controller to agent pods

4 participants