Scalability and reconcile behavior for completed PipelineRuns

## Summary
Today we retain `PipelineRun` objects for about 24 hours, which keeps roughly ~2k runs in scope. We would like to extend retention to at least 7 days, which we expect to bring the steady-state count to on the order of ~14k `PipelineRun`s (same workload profile, longer history).

The main driver is operational: we want completed runs to remain visible longer so teams can use the Tekton Dashboard for debugging. This need is further amplified by the fact that Tekton Dashboard is not compatible with Tekton Results, forcing us to rely on in-cluster `PipelineRun` retention for historical visibility.

That retention change amplifies concerns we already have about controller and watch scalability (memory, reconcile churn, etcd pressure) as finished runs accumulate.

## Operational need
- Move from roughly ~2k to ~14k `PipelineRun`s (with longer TTL, e.g. 7 days).
- Keep steady-state controller cost and memory under control as completed runs accumulate.
- Avoid unnecessary work on runs that are finished and no longer need controller reconciliation.

## What we observed (reconcile + "done")
- `IsDone()` is based on the `Succeeded` condition no longer being `Unknown` (terminal success/failure from a pipeline perspective).
- Post-completion work (e.g. Affinity Assistant / PVC cleanup and related updates) runs in the `IsDone()` branch of `ReconcileKind`, after the run is already terminal from a user/API perspective.
- The PipelineRun informer effectively tracks all `PipelineRun`s in scope. Completed runs remain in the cache; events (labels, annotations, resyncs, related object updates) can still enqueue work. For large numbers of historical runs, that means a lot of reconcile churn on objects that are already terminal, with limited benefit.
- Filtering events so that terminal runs are never enqueued is subtle: on cold start, runs that are already done may only appear as adds to the informer. If the filter drops "done" runs entirely, the controller might never run the post-completion path for those objects unless another mechanism guarantees it—so "done" (pipeline) and "controller finished housekeeping" are not the same concept.
- Separately, approaches like FilteredInformer / label selectors usually need a stable criterion on the object that the API server can select on (often labels), while clear semantics for "pipeline finished" vs "controller finished post-processing" often point to status (e.g. an additional condition or field).

⚠️ We do not claim the above is exhaustive; it reflects our reading of `pkg/reconciler/pipelinerun` and related code paths.

## Proposal direction (for discussion)
We see two related ideas:
1. Distinguish "pipeline terminal" from "controller post-run work complete" in the API, e.g.:
  - Keep Succeeded as the user-facing terminal outcome and completionTime meaning pipeline completion.
  - Add a explicit status signal (e.g. a new condition or a dedicated field) meaning post-run controller work (cleanup, etc.) is done (or failed in a defined way).
2. Use that signal to:
  - Drive safe filtering: e.g. do not enqueue / do not keep in a hot watch path objects that are terminal + post-run complete, while still handling migration and first-time processing for older objects without the new status.
3. Optionally mirror that state to a controller-managed label (e.g. under tekton.dev/…) only if we need an efficient labelSelector for informers or watches—status as source of truth, label as a denormalized index if required.

We are open to alternatives if maintainers prefer not to extend the API or have a better solution.

### Questions for maintainers / community
- Is scalability of completed PipelineRun reconciliation a problem Tekton wants to address in the core controller?
- Would a status-level distinction between pipeline complete and controller post-run complete be acceptable, or is it considered too heavy for a TEP?
- Are there preferred patterns already planned (e.g. similar to other Tekton controllers) we should align with?
- What compatibility constraints (version skew, downgrade) should any design respect?

Thank you for reading; we appreciate any guidance on whether to turn this into a TEP, a smaller design sketch, or incremental PRs, and what the maintainers consider in scope for `tektoncd/pipeline` 🙇 .


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability and reconcile behavior for completed PipelineRuns #9686

Summary

Operational need

What we observed (reconcile + "done")

Proposal direction (for discussion)

Questions for maintainers / community

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scalability and reconcile behavior for completed PipelineRuns #9686

Description

Summary

Operational need

What we observed (reconcile + "done")

Proposal direction (for discussion)

Questions for maintainers / community

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions