-
Notifications
You must be signed in to change notification settings - Fork 42
GREP 285: Training job support #446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
00c6e1d
99c29d0
d41a872
0eea29e
9a8b154
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,234 @@ | ||
| # GREP-0285: Training Job Support for PodCliqueSet | ||
|
|
||
| <!-- toc --> | ||
| - [Summary](#summary) | ||
| - [Motivation](#motivation) | ||
| - [Goals](#goals) | ||
| - [Non-Goals](#non-goals) | ||
| - [Proposal](#proposal) | ||
| - [Limitations/Risks & Mitigations](#limitationsrisks--mitigations) | ||
| - [Design Details](#design-details) | ||
| - [WorkloadType Field](#workloadtype-field) | ||
| - [Phase](#phase) | ||
| - [trainingSpec Fields](#trainingspec-fields) | ||
| - [terminationDelay](#terminationdelay) | ||
| - [Controller Changes](#controller-changes) | ||
| - [Defaulting Webhook](#defaulting-webhook) | ||
| - [Validating Webhook](#validating-webhook) | ||
| - [PodClique Controller](#podclique-controller) | ||
| - [PodCliqueSet Controller](#podcliqueset-controller) | ||
| - [PodCliqueScalingGroup Controller](#podcliquescalinggroup-controller) | ||
| - [New Status Fields](#new-status-fields) | ||
| - [State Persistence](#state-persistence) | ||
| - [Monitoring](#monitoring) | ||
| - [Test Plan](#test-plan) | ||
| - [Graduation Criteria](#graduation-criteria) | ||
| <!-- /toc --> | ||
|
|
||
| ## Summary | ||
|
|
||
| Grove currently supports inference workloads exclusively. This proposal extends the `PodCliqueSet` API to support training jobs — finite workloads that complete normally once work is done. It introduces a `workloadType` field to distinguish training from inference, adds lifecycle phase tracking, and provides controls for runtime limits, restart budgets, and gang-aware failure handling. | ||
|
|
||
| ## Motivation | ||
|
|
||
| Interest in grove and its adoption is growing in the community. Grove's unique capabilities — topology-aware scheduling, gang scheduling, clique startup ordering, and MNNVL support — are valuable not just for inference but also for distributed training workloads. The current `PodClique` controller assumes pods run indefinitely and recreates them when they exit, which is incompatible with training jobs where pods exiting successfully is the expected terminal state. | ||
|
|
||
| ### Goals | ||
|
|
||
| - Introduce a `workloadType` field to distinguish `Inference` (default, no behavior change) from `Training` workloads. | ||
| - Add a `phase` field to `PodCliqueSetStatus` representing the workload lifecycle: `Pending`, `Running`, `Succeeded`, `Failed`. | ||
| - Add `maxRuntime` and `maxRestarts` controls under a `trainingSpec` stanza. | ||
| - Default `terminationDelay` to `0` for training workloads for immediate failure response. | ||
| - Disable rolling updates and replica count changes for training workloads via webhook enforcement. | ||
|
|
||
| ### Non-Goals | ||
|
|
||
| - Per-clique failure or success policies (e.g., tolerate worker failures but not launcher failures). | ||
| - Suspend/resume support. | ||
|
|
||
| ## Proposal | ||
|
|
||
| A new `workloadType` field is added to `PodCliqueSetSpec`. When set to `Training`, the operator changes its behavior in the following ways: | ||
|
|
||
| - Pods that exit with code 0 are not recreated. When all pods in a replica complete successfully, the replica is marked `Succeeded`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens if we support in-place update in the near future. If we in-place update the pod, would we consider it as success or fail? For this kind of action/job, we shouldn't block the recreate, right? |
||
| - A `trainingSpec` stanza provides `maxRuntime` (maximum allowed duration) and `maxRestarts` (total restart budget across all replicas). Exceeding either terminates all pods and marks the workload `Failed`. | ||
| - `terminationDelay` defaults to `0` instead of 4 hours, so a pod failure triggers immediate teardown rather than waiting for the scheduler to reschedule. | ||
| - Rolling updates and replica count changes are rejected by the validation webhook for the lifetime of the workload. | ||
|
|
||
| The `phase` field in `PodCliqueSetStatus` reflects the current lifecycle state. A `Failed` condition with a structured `Reason` is maintained for programmatic failure cause disambiguation. | ||
|
|
||
| ### Limitations/Risks & Mitigations | ||
|
|
||
| - **Backward compatibility:** `workloadType` defaults to `Inference`, so all existing `PodCliqueSet` resources are unaffected. | ||
| - **`terminationDelay` defaulting:** The change to default `terminationDelay: 0` only applies when `workloadType: Training`. Inference workloads retain the existing 4-hour default. | ||
|
|
||
| ## Design Details | ||
|
|
||
| ### WorkloadType Field | ||
|
|
||
| ```go | ||
| type WorkloadType string | ||
|
|
||
| const ( | ||
| // WorkloadTypeInference is the default. Pods run indefinitely; failed pods are recreated. | ||
| WorkloadTypeInference WorkloadType = "Inference" | ||
|
|
||
| // WorkloadTypeTraining marks a finite workload. Pods that exit successfully are not recreated. | ||
| // Rolling updates and replica count changes are disabled. | ||
| WorkloadTypeTraining WorkloadType = "Training" | ||
| ) | ||
| ``` | ||
|
|
||
| ```yaml | ||
| spec: | ||
| workloadType: Training # default: Inference | ||
| replicas: 1 | ||
| template: | ||
| ... | ||
| ``` | ||
|
|
||
| ### Phase | ||
|
|
||
| ```go | ||
| type PodCliqueSetPhase string | ||
|
|
||
| const ( | ||
| PodCliqueSetPhasePending PodCliqueSetPhase = "Pending" | ||
| PodCliqueSetPhaseRunning PodCliqueSetPhase = "Running" | ||
| PodCliqueSetPhaseSucceeded PodCliqueSetPhase = "Succeeded" | ||
| PodCliqueSetPhaseFailed PodCliqueSetPhase = "Failed" | ||
| ) | ||
| ``` | ||
|
|
||
| Phase semantics: | ||
| - **Pending** — Pods not yet scheduled. | ||
| - **Running** — At least one replica's pods are running. Due to grove's gang-scheduling, if any pod is running then all pods in that replica are running. | ||
| - **Succeeded** — All replicas completed successfully. (Training only.) | ||
| - **Failed** — `maxRestarts` exhausted or `maxRuntime` exceeded. All pods are terminated before the phase is set. (Training only.) | ||
|
|
||
| `Pending` and `Running` apply to both workload types. `Succeeded` and `Failed` are only meaningful for `WorkloadTypeTraining`. | ||
|
|
||
| ### trainingSpec Fields | ||
|
|
||
| ```go | ||
| type TrainingSpec struct { | ||
| // MaxRuntime is the maximum duration the workload is allowed to run. | ||
| // If exceeded, all pods across all PodCliques are terminated and the workload is marked Failed. | ||
| // +optional | ||
| MaxRuntime *metav1.Duration `json:"maxRuntime,omitempty"` | ||
|
|
||
| // MaxRestarts is the total number of times the operator will restart failed replicas | ||
| // across the entire PodCliqueSet before marking it as Failed. | ||
| // A restart recreates all pods across all PodCliques in the affected replica. | ||
| // Default: 0 (no restarts — any failure immediately fails the workload). | ||
| // +optional | ||
| MaxRestarts *int32 `json:"maxRestarts,omitempty"` | ||
| } | ||
| ``` | ||
|
|
||
| ```yaml | ||
| spec: | ||
| workloadType: Training | ||
| trainingSpec: | ||
| maxRuntime: 24h | ||
| maxRestarts: 2 | ||
| ``` | ||
|
|
||
| A "restart" means deleting and recreating all pods across all PodCliques in the affected replica. The counter is a single total across all replicas. The PodCliqueSet is marked `Failed` when the total restart count exceeds `maxRestarts`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we provide |
||
|
|
||
| ### terminationDelay | ||
|
|
||
| `terminationDelay` (default: 4 hours) is the grace period grove gives after `MinAvailableBreached` before tearing down the gang. For inference workloads, this allows time for transient pod failures to recover. For training workloads, a missing worker means the job cannot make progress — a 4-hour wait wastes reserved GPU capacity and delays the retry. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A training job can have several part, each part fail may don't affect other parts, all we need to do is to restart the part of job. A missing worker which may not mean the job can't make progress. |
||
|
|
||
| When `workloadType: Training`, `terminationDelay` defaults to `0`. The operator immediately tears down the replica and begins a restart (or marks it `Failed` if `maxRestarts` is exhausted). Users can override this with an explicit value if their framework tolerates brief pod gaps. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could we add multiple level of Tear down the replica of PCS/PCSG/PCLQ |
||
|
|
||
| ### Controller Changes | ||
|
|
||
| #### Defaulting Webhook | ||
|
|
||
| - Default `terminationDelay` to `0` when `workloadType: Training` (instead of 4 hours). | ||
| - Default pod `restartPolicy` to `Never` when `workloadType: Training` (instead of `Always`). For training, the operator controls restarts at the job level; kubelet-level pod restarts would interfere with the operator's failure detection and restart counting. | ||
|
|
||
| #### Validating Webhook | ||
|
|
||
| - Reject pod template spec changes that would produce a new generation hash (i.e., would trigger a rolling update) for Training workloads. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should support in-place update which could change spec. Maybe we should allow this action |
||
| - `spec.replicas` is immutable at all levels (PodCliqueSet, PodCliqueScalingGroup, PodClique) for Training workloads. The webhook reads `workloadType` from the parent PodCliqueSet via owner reference at validation time. | ||
| - Reject any `ScaleConfig` on PodClique or PodCliqueScalingGroup specs when `workloadType: Training`. HPA creation is driven entirely by the presence of `ScaleConfig` in `computeExpectedHPAs` — rejecting it at admission means no HPA will ever be created for training workloads, with no special-casing needed in the controller. | ||
|
|
||
| #### PodClique Controller | ||
|
|
||
| - Before recreating a pod, check its exit code. If `exitCode == 0`, do not recreate it. | ||
| - When all pods in the PodClique have exited with code 0, set a `Succeeded` condition on `PodClique.Status.Conditions`. This must be written **before** any pod cleanup, so the state is durable across operator restarts. | ||
|
|
||
| #### PodCliqueSet Controller | ||
|
|
||
| The PCS controller is responsible for lifecycle orchestration in training mode: | ||
|
|
||
| - **Phase computation** — on every reconcile, aggregate PodClique statuses to derive the phase: all pods pending → `Pending`; any replica running → `Running`; all PodCliques have `Succeeded` condition → `Succeeded`; restart budget exhausted or `maxRuntime` exceeded → `Failed`. | ||
| - **maxRuntime enforcement** — record `startTime` in PCS status when phase first transitions to `Running`. `startTime` is never reset on restarts — `maxRuntime` is a wall-clock deadline from first start, inclusive of any time spent failing and restarting. This matches the behavior of Kubernetes Job's `activeDeadlineSeconds` and Kubeflow's `runPolicy.activeDeadlineSeconds`, and bounds total GPU consumption regardless of retry count. On each reconcile, check elapsed time; if elapsed > `maxRuntime`, terminate all pods and set phase `Failed`. Use `requeueAfter(remainingDuration)` to avoid busy-polling. | ||
| - **Restart orchestration** — when a PodClique has `MinAvailableBreached: True`, check `status.restartCount` vs `maxRestarts`. If budget remains, delete all pods across all PodCliques in the replica, increment `restartCount`, and emit a `ReplicaRestarting` event. If exhausted, terminate all remaining pods, set phase `Failed`, and emit `MaxRestartsExceeded`. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One PodClique has |
||
| - **No HPA** — HPA creation requires `ScaleConfig` to be present on a PodClique or PodCliqueScalingGroup. Since the validating webhook rejects `ScaleConfig` for Training workloads, `computeExpectedHPAs` will always return an empty list — no controller-level special-casing needed. | ||
| - **Disable rolling updates** — skip generation hash / rolling update progression logic when `workloadType: Training`. | ||
|
|
||
| #### PodCliqueScalingGroup Controller | ||
|
|
||
| - Skip rolling update processing when `workloadType: Training`. | ||
|
|
||
| #### New Status Fields | ||
|
|
||
| ```go | ||
| type PodCliqueSetStatus struct { | ||
| // existing fields... | ||
| Phase PodCliqueSetPhase // Pending / Running / Succeeded / Failed | ||
| RestartCount int32 // total restarts consumed across all replicas | ||
| StartTime *metav1.Time // when phase first became Running; used for maxRuntime tracking | ||
| } | ||
| ``` | ||
|
|
||
| ### State Persistence | ||
|
|
||
| Grove can be restarted at any time; no in-memory state can be relied upon. The following persistence chain ensures the workload lifecycle is always recoverable from the Kubernetes API: | ||
|
|
||
| - **Pod level** — `pod.Status.Phase` and `pod.Status.ContainerStatuses[].State.Terminated.ExitCode` are stored in etcd by Kubernetes. On any reconcile, the controller can re-derive which pods have succeeded by listing owned pods. | ||
| - **PodClique level** — the PodClique controller writes a `Succeeded` condition to `PodClique.Status.Conditions` (etcd-backed) when all its pods exit with code 0. This must happen before any pod deletion so the state is not lost if pods are cleaned up. | ||
| - **PodCliqueSet level** — the PCS controller aggregates PodClique conditions into `PodCliqueSet.Status.Phase` (etcd-backed). `RestartCount` and `StartTime` are also stored in PCS status, making the full restart budget and runtime calculation recoverable after a restart. | ||
|
|
||
| The full chain is: **pod status (etcd) → PodClique `Succeeded` condition (etcd) → PodCliqueSet phase + restartCount + startTime (etcd)**. | ||
|
|
||
| ### Monitoring | ||
|
|
||
| **Condition:** | ||
|
|
||
| A single `Failed` condition is maintained on `PodCliqueSetStatus.Conditions`, adding detail the phase field cannot express: | ||
|
|
||
| | Type | Status | Reason | Set when | | ||
| |---|---|---|---| | ||
| | `Failed` | `True` | `MaxRestartsExceeded` | Restart budget exhausted | | ||
| | `Failed` | `True` | `MaxRuntimeExceeded` | Runtime limit elapsed | | ||
|
|
||
| The `Reason` field disambiguates the cause of failure, and `LastTransitionTime` records when the workload failed — neither of which the phase field carries. | ||
|
|
||
| **Events:** | ||
|
|
||
| | Reason | Type | When | | ||
| |---|---|---| | ||
| | `PodCliqueFailed` | `Warning` | A PodClique's `MinAvailable` is breached in a training workload | | ||
| | `ReplicaRestarting` | `Normal` | A replica restart is triggered; message includes current restart count | | ||
| | `MaxRestartsExceeded` | `Warning` | Restart budget exhausted; workload transitioning to `Failed` | | ||
| | `MaxRuntimeExceeded` | `Warning` | `maxRuntime` elapsed; all pods being terminated | | ||
| | `WorkloadSucceeded` | `Normal` | All replicas completed successfully | | ||
|
|
||
| ### Test Plan | ||
|
|
||
| E2e tests should cover: | ||
| - Successful completion: all pods exit 0 → phase `Succeeded`. | ||
| - Single pod failure with `maxRestarts: 0` → immediate `Failed`. | ||
| - Restart and eventual success within `maxRestarts` budget. | ||
| - `maxRuntime` exceeded → all pods terminated, phase `Failed`, event `MaxRuntimeExceeded`. | ||
| - Webhook rejection of rolling updates and replica count changes for training workloads. | ||
|
|
||
| ### Graduation Criteria | ||
|
|
||
| **Alpha:** MVP feature set implemented and covered by e2e tests — `workloadType`, phase, `trainingSpec`, `terminationDelay` defaulting, webhook enforcement. | ||
|
|
||
| **GA:** Stable after community feedback and at least one release of alpha usage. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we set terminationDelay default to 0, does it means that gang termination be triggered when one pod fail/deleted? Wouldn't we wait for the failed/deleted pod to be created again?