generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 108
Open
Description
When a SandboxClaim cannot successfully provision a Sandbox or Pod, the controller continues to reconcile the resource indefinitely. In large-scale environments, this results in:
- Stale SandboxClaims accumulate in etcd and consume kube-apiserver bandwidth.
- There is no clear way to distinguish between a transient error and a persistent failure without a time-bound threshold.
- Users must manually identify and delete reconciling claims that have stalled.
Proposed Solution:
Introduce a timeout mechanism for the SandboxClaim lifecycle:
- Add a configurable timeout after which a claim is considered failed.
- If the Sandbox and Pod are not ready within the timeout period, the controller should take action to stop the reconciliation loop.
Open Questions:
Possible options for how to configure a timeout:
- Controller Flag: A global timeout applied to all SandboxClaims managed by the controller. This is simpler to implement and manage globally but lacks flexibility for specific workloads.
- CRD Field: Adding a timeout field to the SandboxClaim specification. This allows users to define custom timeouts per request, which may be useful for workloads with varying startup characteristics.
Possible options for desired behavior once a timeout is reached:
- Hard Deletion: The controller deletes the Sandbox, Pod, and the SandboxClaim to immediately free up resources.
- Failed Status: The controller retains the SandboxClaim but updates its status to Failed to provide a clear signal for debugging and observability.
- Failed Status to Deletion: Same as the "Failed Status", but the SandboxClaim is deleted after X period of time in a Failed status.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels