Skip to content

Scalability and reconcile behavior for completed PipelineRuns #9686

@Maximilien-R

Description

@Maximilien-R

Summary

Today we retain PipelineRun objects for about 24 hours, which keeps roughly ~2k runs in scope. We would like to extend retention to at least 7 days, which we expect to bring the steady-state count to on the order of ~14k PipelineRuns (same workload profile, longer history).

The main driver is operational: we want completed runs to remain visible longer so teams can use the Tekton Dashboard for debugging. This need is further amplified by the fact that Tekton Dashboard is not compatible with Tekton Results, forcing us to rely on in-cluster PipelineRun retention for historical visibility.

That retention change amplifies concerns we already have about controller and watch scalability (memory, reconcile churn, etcd pressure) as finished runs accumulate.

Operational need

  • Move from roughly ~2k to ~14k PipelineRuns (with longer TTL, e.g. 7 days).
  • Keep steady-state controller cost and memory under control as completed runs accumulate.
  • Avoid unnecessary work on runs that are finished and no longer need controller reconciliation.

What we observed (reconcile + "done")

  • IsDone() is based on the Succeeded condition no longer being Unknown (terminal success/failure from a pipeline perspective).
  • Post-completion work (e.g. Affinity Assistant / PVC cleanup and related updates) runs in the IsDone() branch of ReconcileKind, after the run is already terminal from a user/API perspective.
  • The PipelineRun informer effectively tracks all PipelineRuns in scope. Completed runs remain in the cache; events (labels, annotations, resyncs, related object updates) can still enqueue work. For large numbers of historical runs, that means a lot of reconcile churn on objects that are already terminal, with limited benefit.
  • Filtering events so that terminal runs are never enqueued is subtle: on cold start, runs that are already done may only appear as adds to the informer. If the filter drops "done" runs entirely, the controller might never run the post-completion path for those objects unless another mechanism guarantees it—so "done" (pipeline) and "controller finished housekeeping" are not the same concept.
  • Separately, approaches like FilteredInformer / label selectors usually need a stable criterion on the object that the API server can select on (often labels), while clear semantics for "pipeline finished" vs "controller finished post-processing" often point to status (e.g. an additional condition or field).

⚠️ We do not claim the above is exhaustive; it reflects our reading of pkg/reconciler/pipelinerun and related code paths.

Proposal direction (for discussion)

We see two related ideas:

  1. Distinguish "pipeline terminal" from "controller post-run work complete" in the API, e.g.:
  • Keep Succeeded as the user-facing terminal outcome and completionTime meaning pipeline completion.
  • Add a explicit status signal (e.g. a new condition or a dedicated field) meaning post-run controller work (cleanup, etc.) is done (or failed in a defined way).
  1. Use that signal to:
  • Drive safe filtering: e.g. do not enqueue / do not keep in a hot watch path objects that are terminal + post-run complete, while still handling migration and first-time processing for older objects without the new status.
  1. Optionally mirror that state to a controller-managed label (e.g. under tekton.dev/…) only if we need an efficient labelSelector for informers or watches—status as source of truth, label as a denormalized index if required.

We are open to alternatives if maintainers prefer not to extend the API or have a better solution.

Questions for maintainers / community

  • Is scalability of completed PipelineRun reconciliation a problem Tekton wants to address in the core controller?
  • Would a status-level distinction between pipeline complete and controller post-run complete be acceptable, or is it considered too heavy for a TEP?
  • Are there preferred patterns already planned (e.g. similar to other Tekton controllers) we should align with?
  • What compatibility constraints (version skew, downgrade) should any design respect?

Thank you for reading; we appreciate any guidance on whether to turn this into a TEP, a smaller design sketch, or incremental PRs, and what the maintainers consider in scope for tektoncd/pipeline 🙇 .

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions