Conversation
| - Introduce a `workloadType` field to distinguish `Inference` (default, no behavior change) from `Training` workloads. | ||
| - Add a `phase` field to `PodCliqueSetStatus` representing the workload lifecycle: `Pending`, `Running`, `Succeeded`, `Failed`. | ||
| - Add `maxRuntime` and `maxRestarts` controls under a `trainingSpec` stanza. | ||
| - Default `terminationDelay` to `0` for training workloads for immediate failure response. |
There was a problem hiding this comment.
If we set terminationDelay default to 0, does it means that gang termination be triggered when one pod fail/deleted? Wouldn't we wait for the failed/deleted pod to be created again?
|
|
||
| A new `workloadType` field is added to `PodCliqueSetSpec`. When set to `Training`, the operator changes its behavior in the following ways: | ||
|
|
||
| - Pods that exit with code 0 are not recreated. When all pods in a replica complete successfully, the replica is marked `Succeeded`. |
There was a problem hiding this comment.
What happens if we support in-place update in the near future. If we in-place update the pod, would we consider it as success or fail? For this kind of action/job, we shouldn't block the recreate, right?
| maxRestarts: 2 | ||
| ``` | ||
|
|
||
| A "restart" means deleting and recreating all pods across all PodCliques in the affected replica. The counter is a single total across all replicas. The PodCliqueSet is marked `Failed` when the total restart count exceeds `maxRestarts`. |
There was a problem hiding this comment.
Can we provide restart action also for PCSG and PCLQ level ?
|
|
||
| ### terminationDelay | ||
|
|
||
| `terminationDelay` (default: 4 hours) is the grace period grove gives after `MinAvailableBreached` before tearing down the gang. For inference workloads, this allows time for transient pod failures to recover. For training workloads, a missing worker means the job cannot make progress — a 4-hour wait wastes reserved GPU capacity and delays the retry. |
There was a problem hiding this comment.
A training job can have several part, each part fail may don't affect other parts, all we need to do is to restart the part of job. A missing worker which may not mean the job can't make progress.
|
|
||
| `terminationDelay` (default: 4 hours) is the grace period grove gives after `MinAvailableBreached` before tearing down the gang. For inference workloads, this allows time for transient pod failures to recover. For training workloads, a missing worker means the job cannot make progress — a 4-hour wait wastes reserved GPU capacity and delays the retry. | ||
|
|
||
| When `workloadType: Training`, `terminationDelay` defaults to `0`. The operator immediately tears down the replica and begins a restart (or marks it `Failed` if `maxRestarts` is exhausted). Users can override this with an explicit value if their framework tolerates brief pod gaps. |
There was a problem hiding this comment.
Could we add multiple level of Tear down the replica of PCS/PCSG/PCLQ
|
|
||
| #### Validating Webhook | ||
|
|
||
| - Reject pod template spec changes that would produce a new generation hash (i.e., would trigger a rolling update) for Training workloads. |
There was a problem hiding this comment.
We should support in-place update which could change spec. Maybe we should allow this action
|
|
||
| - **Phase computation** — on every reconcile, aggregate PodClique statuses to derive the phase: all pods pending → `Pending`; any replica running → `Running`; all PodCliques have `Succeeded` condition → `Succeeded`; restart budget exhausted or `maxRuntime` exceeded → `Failed`. | ||
| - **maxRuntime enforcement** — record `startTime` in PCS status when phase first transitions to `Running`. `startTime` is never reset on restarts — `maxRuntime` is a wall-clock deadline from first start, inclusive of any time spent failing and restarting. This matches the behavior of Kubernetes Job's `activeDeadlineSeconds` and Kubeflow's `runPolicy.activeDeadlineSeconds`, and bounds total GPU consumption regardless of retry count. On each reconcile, check elapsed time; if elapsed > `maxRuntime`, terminate all pods and set phase `Failed`. Use `requeueAfter(remainingDuration)` to avoid busy-polling. | ||
| - **Restart orchestration** — when a PodClique has `MinAvailableBreached: True`, check `status.restartCount` vs `maxRestarts`. If budget remains, delete all pods across all PodCliques in the replica, increment `restartCount`, and emit a `ReplicaRestarting` event. If exhausted, terminate all remaining pods, set phase `Failed`, and emit `MaxRestartsExceeded`. |
There was a problem hiding this comment.
One PodClique has MinAvailableBreached: True should avoid delete all PodCliques's all pods
/kind documentation
/kind feature
What this PR does / why we need it:
Adds GREP-285 for training job support in
PodCliqueSet. The document covers theproposed API changes, lifecycle phase semantics,
trainingSpecfields (maxRuntime,maxRestarts),terminationDelaydefaulting, webhook enforcement, conditions, and events needed to support finite trainingworkloads alongside existing inference workloads.
Which issue(s) this PR fixes:
Fixes #285
Special notes for your reviewer:
This is a design/proposal PR only — no code changes. Feedback welcome on API naming,
maxRestartscountingsemantics, and the
terminationDelaydefaulting behavior for training workloads.Does this PR introduce a API change?
Extends the
PodCliqueSetAPI with aworkloadTypefield (Inference/Training),a
phasefield in status, and atrainingSpecstanza (maxRuntime,maxRestarts)to support finite training workloads alongside existing inference workloads.