Improve distributed tracing instrumentation coverage

### Feature request

After the OpenCensus to OpenTelemetry migration (#9043), the tracing infrastructure is solid but the instrumentation coverage has significant gaps. This umbrella issue tracks the work to bring tracing from ~40% coverage to comprehensive end-to-end observability.

### Completed

- [x] #9697 - Bump semconv v1.12.0 to v1.40.0
- [ ] #9698 - Propagate span context to child PipelineRuns (PiP) and CustomRuns
- [ ] #9699 - Fix root span lifecycle (end at reconciliation completion, not init)
- [ ] #9700 - Add sampling-ratio configuration for production deployments

### Remaining work

**Medium effort:**
- [ ] Record errors on trace spans in reconcilers - ~30 error paths across TaskRun and PipelineRun reconcilers only log errors without calling `span.RecordError()` or `span.SetStatus(Error)`. Traces show all spans as OK even when runs fail.
- [ ] Add tracing to resolver framework - git, hub, cluster, and HTTP resolvers have zero OTel instrumentation. When a TaskRun sits in `ResolvingTaskRef` for 30 seconds, there is no trace visibility into why.
- [ ] Add spans to `cancelPipelineRun` and `timeoutPipelineRun` - these standalone functions patch N child resources in sequence with zero trace visibility.

**Large effort:**
- [ ] Add tracing to entrypoint step execution - The entrypoint binary (`cmd/entrypoint/`) has zero OTel instrumentation. Step execution (waiting, running commands, collecting results) is a complete gap in traces. Requires trace context injection via pod environment variables, OTel SDK initialization in the entrypoint, and a span export path.

**Minor improvements:**
- [ ] Add span attributes for outcome (status, failure reason, step count)
- [ ] Link metrics to traces via exemplars
- [ ] Remove duplicate root span pattern (initTracing root + ReconcileKind root)

### Use case

As a platform operator running Tekton in production, I want comprehensive distributed traces so that when a PipelineRun takes longer than expected, I can use Jaeger/Tempo to identify exactly which stage (resolution, pod creation, step execution, result extraction) is the bottleneck. Currently, traces show controller-level decisions but are blind to the data plane (step execution) and resolution pipeline.

### Related

- #9043 - OpenCensus to OpenTelemetry migration (completed)
- #9141 - Fix default tracing endpoint (completed)
- #8535 - Help configuring distributed tracing
- #4808 - Results, TerminationMessage and Containers (related to entrypoint tracing)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve distributed tracing instrumentation coverage #9701

Feature request

Completed

Remaining work

Use case

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve distributed tracing instrumentation coverage #9701

Description

Feature request

Completed

Remaining work

Use case

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions