-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Improve distributed tracing instrumentation coverage #9701
Copy link
Copy link
Open
Labels
kind/featureCategorizes issue or PR as related to a new feature.Categorizes issue or PR as related to a new feature.
Description
Feature request
After the OpenCensus to OpenTelemetry migration (#9043), the tracing infrastructure is solid but the instrumentation coverage has significant gaps. This umbrella issue tracks the work to bring tracing from ~40% coverage to comprehensive end-to-end observability.
Completed
- chore: bump OpenTelemetry semconv to match SDK version #9697 - Bump semconv v1.12.0 to v1.40.0
- fix: propagate trace span context to child PipelineRuns and CustomRuns #9698 - Propagate span context to child PipelineRuns (PiP) and CustomRuns
- fix: end root tracing span at reconciliation completion #9699 - Fix root span lifecycle (end at reconciliation completion, not init)
- feat: add sampling-ratio configuration for distributed tracing #9700 - Add sampling-ratio configuration for production deployments
Remaining work
Medium effort:
- Record errors on trace spans in reconcilers - ~30 error paths across TaskRun and PipelineRun reconcilers only log errors without calling
span.RecordError()orspan.SetStatus(Error). Traces show all spans as OK even when runs fail. - Add tracing to resolver framework - git, hub, cluster, and HTTP resolvers have zero OTel instrumentation. When a TaskRun sits in
ResolvingTaskReffor 30 seconds, there is no trace visibility into why. - Add spans to
cancelPipelineRunandtimeoutPipelineRun- these standalone functions patch N child resources in sequence with zero trace visibility.
Large effort:
- Add tracing to entrypoint step execution - The entrypoint binary (
cmd/entrypoint/) has zero OTel instrumentation. Step execution (waiting, running commands, collecting results) is a complete gap in traces. Requires trace context injection via pod environment variables, OTel SDK initialization in the entrypoint, and a span export path.
Minor improvements:
- Add span attributes for outcome (status, failure reason, step count)
- Link metrics to traces via exemplars
- Remove duplicate root span pattern (initTracing root + ReconcileKind root)
Use case
As a platform operator running Tekton in production, I want comprehensive distributed traces so that when a PipelineRun takes longer than expected, I can use Jaeger/Tempo to identify exactly which stage (resolution, pod creation, step execution, result extraction) is the bottleneck. Currently, traces show controller-level decisions but are blind to the data plane (step execution) and resolution pipeline.
Related
- feat(metrics): Migrate from OpenCensus to OpenTelemetry #9043 - OpenCensus to OpenTelemetry migration (completed)
- fix: update default tracing endpoint to http protobuf endpoint #9141 - Fix default tracing endpoint (completed)
- Help configuring Distributed Tracing using env var #8535 - Help configuring distributed tracing
- Results, TerminationMessage and Containers #4808 - Results, TerminationMessage and Containers (related to entrypoint tracing)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
kind/featureCategorizes issue or PR as related to a new feature.Categorizes issue or PR as related to a new feature.
Type
Projects
Status
Todo