Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
234 changes: 234 additions & 0 deletions docs/proposals/0285-training-job-support/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
# GREP-0285: Training Job Support for PodCliqueSet

<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Limitations/Risks &amp; Mitigations](#limitationsrisks--mitigations)
- [Design Details](#design-details)
- [WorkloadType Field](#workloadtype-field)
- [Phase](#phase)
- [trainingSpec Fields](#trainingspec-fields)
- [terminationDelay](#terminationdelay)
- [Controller Changes](#controller-changes)
- [Defaulting Webhook](#defaulting-webhook)
- [Validating Webhook](#validating-webhook)
- [PodClique Controller](#podclique-controller)
- [PodCliqueSet Controller](#podcliqueset-controller)
- [PodCliqueScalingGroup Controller](#podcliquescalinggroup-controller)
- [New Status Fields](#new-status-fields)
- [State Persistence](#state-persistence)
- [Monitoring](#monitoring)
- [Test Plan](#test-plan)
- [Graduation Criteria](#graduation-criteria)
<!-- /toc -->

## Summary

Grove currently supports inference workloads exclusively. This proposal extends the `PodCliqueSet` API to support training jobs — finite workloads that complete normally once work is done. It introduces a `workloadType` field to distinguish training from inference, adds lifecycle phase tracking, and provides controls for runtime limits, restart budgets, and gang-aware failure handling.

## Motivation

Interest in grove and its adoption is growing in the community. Grove's unique capabilities — topology-aware scheduling, gang scheduling, clique startup ordering, and MNNVL support — are valuable not just for inference but also for distributed training workloads. The current `PodClique` controller assumes pods run indefinitely and recreates them when they exit, which is incompatible with training jobs where pods exiting successfully is the expected terminal state.

### Goals

- Introduce a `workloadType` field to distinguish `Inference` (default, no behavior change) from `Training` workloads.
- Add a `phase` field to `PodCliqueSetStatus` representing the workload lifecycle: `Pending`, `Running`, `Succeeded`, `Failed`.
- Add `maxRuntime` and `maxRestarts` controls under a `trainingSpec` stanza.
- Default `terminationDelay` to `0` for training workloads for immediate failure response.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set terminationDelay default to 0, does it means that gang termination be triggered when one pod fail/deleted? Wouldn't we wait for the failed/deleted pod to be created again?

- Disable rolling updates and replica count changes for training workloads via webhook enforcement.

### Non-Goals

- Per-clique failure or success policies (e.g., tolerate worker failures but not launcher failures).
- Suspend/resume support.

## Proposal

A new `workloadType` field is added to `PodCliqueSetSpec`. When set to `Training`, the operator changes its behavior in the following ways:

- Pods that exit with code 0 are not recreated. When all pods in a replica complete successfully, the replica is marked `Succeeded`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we support in-place update in the near future. If we in-place update the pod, would we consider it as success or fail? For this kind of action/job, we shouldn't block the recreate, right?

- A `trainingSpec` stanza provides `maxRuntime` (maximum allowed duration) and `maxRestarts` (total restart budget across all replicas). Exceeding either terminates all pods and marks the workload `Failed`.
- `terminationDelay` defaults to `0` instead of 4 hours, so a pod failure triggers immediate teardown rather than waiting for the scheduler to reschedule.
- Rolling updates and replica count changes are rejected by the validation webhook for the lifetime of the workload.

The `phase` field in `PodCliqueSetStatus` reflects the current lifecycle state. A `Failed` condition with a structured `Reason` is maintained for programmatic failure cause disambiguation.

### Limitations/Risks & Mitigations

- **Backward compatibility:** `workloadType` defaults to `Inference`, so all existing `PodCliqueSet` resources are unaffected.
- **`terminationDelay` defaulting:** The change to default `terminationDelay: 0` only applies when `workloadType: Training`. Inference workloads retain the existing 4-hour default.

## Design Details

### WorkloadType Field

```go
type WorkloadType string

const (
// WorkloadTypeInference is the default. Pods run indefinitely; failed pods are recreated.
WorkloadTypeInference WorkloadType = "Inference"

// WorkloadTypeTraining marks a finite workload. Pods that exit successfully are not recreated.
// Rolling updates and replica count changes are disabled.
WorkloadTypeTraining WorkloadType = "Training"
)
```

```yaml
spec:
workloadType: Training # default: Inference
replicas: 1
template:
...
```

### Phase

```go
type PodCliqueSetPhase string

const (
PodCliqueSetPhasePending PodCliqueSetPhase = "Pending"
PodCliqueSetPhaseRunning PodCliqueSetPhase = "Running"
PodCliqueSetPhaseSucceeded PodCliqueSetPhase = "Succeeded"
PodCliqueSetPhaseFailed PodCliqueSetPhase = "Failed"
)
```

Phase semantics:
- **Pending** — Pods not yet scheduled.
- **Running** — At least one replica's pods are running. Due to grove's gang-scheduling, if any pod is running then all pods in that replica are running.
- **Succeeded** — All replicas completed successfully. (Training only.)
- **Failed** — `maxRestarts` exhausted or `maxRuntime` exceeded. All pods are terminated before the phase is set. (Training only.)

`Pending` and `Running` apply to both workload types. `Succeeded` and `Failed` are only meaningful for `WorkloadTypeTraining`.

### trainingSpec Fields

```go
type TrainingSpec struct {
// MaxRuntime is the maximum duration the workload is allowed to run.
// If exceeded, all pods across all PodCliques are terminated and the workload is marked Failed.
// +optional
MaxRuntime *metav1.Duration `json:"maxRuntime,omitempty"`

// MaxRestarts is the total number of times the operator will restart failed replicas
// across the entire PodCliqueSet before marking it as Failed.
// A restart recreates all pods across all PodCliques in the affected replica.
// Default: 0 (no restarts — any failure immediately fails the workload).
// +optional
MaxRestarts *int32 `json:"maxRestarts,omitempty"`
}
```

```yaml
spec:
workloadType: Training
trainingSpec:
maxRuntime: 24h
maxRestarts: 2
```

A "restart" means deleting and recreating all pods across all PodCliques in the affected replica. The counter is a single total across all replicas. The PodCliqueSet is marked `Failed` when the total restart count exceeds `maxRestarts`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we provide restart action also for PCSG and PCLQ level ?


### terminationDelay

`terminationDelay` (default: 4 hours) is the grace period grove gives after `MinAvailableBreached` before tearing down the gang. For inference workloads, this allows time for transient pod failures to recover. For training workloads, a missing worker means the job cannot make progress — a 4-hour wait wastes reserved GPU capacity and delays the retry.
Copy link
Contributor

@kangclzjc kangclzjc Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A training job can have several part, each part fail may don't affect other parts, all we need to do is to restart the part of job. A missing worker which may not mean the job can't make progress.


When `workloadType: Training`, `terminationDelay` defaults to `0`. The operator immediately tears down the replica and begins a restart (or marks it `Failed` if `maxRestarts` is exhausted). Users can override this with an explicit value if their framework tolerates brief pod gaps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add multiple level of Tear down the replica of PCS/PCSG/PCLQ


### Controller Changes

#### Defaulting Webhook

- Default `terminationDelay` to `0` when `workloadType: Training` (instead of 4 hours).
- Default pod `restartPolicy` to `Never` when `workloadType: Training` (instead of `Always`). For training, the operator controls restarts at the job level; kubelet-level pod restarts would interfere with the operator's failure detection and restart counting.

#### Validating Webhook

- Reject pod template spec changes that would produce a new generation hash (i.e., would trigger a rolling update) for Training workloads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should support in-place update which could change spec. Maybe we should allow this action

- `spec.replicas` is immutable at all levels (PodCliqueSet, PodCliqueScalingGroup, PodClique) for Training workloads. The webhook reads `workloadType` from the parent PodCliqueSet via owner reference at validation time.
- Reject any `ScaleConfig` on PodClique or PodCliqueScalingGroup specs when `workloadType: Training`. HPA creation is driven entirely by the presence of `ScaleConfig` in `computeExpectedHPAs` — rejecting it at admission means no HPA will ever be created for training workloads, with no special-casing needed in the controller.

#### PodClique Controller

- Before recreating a pod, check its exit code. If `exitCode == 0`, do not recreate it.
- When all pods in the PodClique have exited with code 0, set a `Succeeded` condition on `PodClique.Status.Conditions`. This must be written **before** any pod cleanup, so the state is durable across operator restarts.

#### PodCliqueSet Controller

The PCS controller is responsible for lifecycle orchestration in training mode:

- **Phase computation** — on every reconcile, aggregate PodClique statuses to derive the phase: all pods pending → `Pending`; any replica running → `Running`; all PodCliques have `Succeeded` condition → `Succeeded`; restart budget exhausted or `maxRuntime` exceeded → `Failed`.
- **maxRuntime enforcement** — record `startTime` in PCS status when phase first transitions to `Running`. `startTime` is never reset on restarts — `maxRuntime` is a wall-clock deadline from first start, inclusive of any time spent failing and restarting. This matches the behavior of Kubernetes Job's `activeDeadlineSeconds` and Kubeflow's `runPolicy.activeDeadlineSeconds`, and bounds total GPU consumption regardless of retry count. On each reconcile, check elapsed time; if elapsed > `maxRuntime`, terminate all pods and set phase `Failed`. Use `requeueAfter(remainingDuration)` to avoid busy-polling.
- **Restart orchestration** — when a PodClique has `MinAvailableBreached: True`, check `status.restartCount` vs `maxRestarts`. If budget remains, delete all pods across all PodCliques in the replica, increment `restartCount`, and emit a `ReplicaRestarting` event. If exhausted, terminate all remaining pods, set phase `Failed`, and emit `MaxRestartsExceeded`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One PodClique has MinAvailableBreached: True should avoid delete all PodCliques's all pods

- **No HPA** — HPA creation requires `ScaleConfig` to be present on a PodClique or PodCliqueScalingGroup. Since the validating webhook rejects `ScaleConfig` for Training workloads, `computeExpectedHPAs` will always return an empty list — no controller-level special-casing needed.
- **Disable rolling updates** — skip generation hash / rolling update progression logic when `workloadType: Training`.

#### PodCliqueScalingGroup Controller

- Skip rolling update processing when `workloadType: Training`.

#### New Status Fields

```go
type PodCliqueSetStatus struct {
// existing fields...
Phase PodCliqueSetPhase // Pending / Running / Succeeded / Failed
RestartCount int32 // total restarts consumed across all replicas
StartTime *metav1.Time // when phase first became Running; used for maxRuntime tracking
}
```

### State Persistence

Grove can be restarted at any time; no in-memory state can be relied upon. The following persistence chain ensures the workload lifecycle is always recoverable from the Kubernetes API:

- **Pod level** — `pod.Status.Phase` and `pod.Status.ContainerStatuses[].State.Terminated.ExitCode` are stored in etcd by Kubernetes. On any reconcile, the controller can re-derive which pods have succeeded by listing owned pods.
- **PodClique level** — the PodClique controller writes a `Succeeded` condition to `PodClique.Status.Conditions` (etcd-backed) when all its pods exit with code 0. This must happen before any pod deletion so the state is not lost if pods are cleaned up.
- **PodCliqueSet level** — the PCS controller aggregates PodClique conditions into `PodCliqueSet.Status.Phase` (etcd-backed). `RestartCount` and `StartTime` are also stored in PCS status, making the full restart budget and runtime calculation recoverable after a restart.

The full chain is: **pod status (etcd) → PodClique `Succeeded` condition (etcd) → PodCliqueSet phase + restartCount + startTime (etcd)**.

### Monitoring

**Condition:**

A single `Failed` condition is maintained on `PodCliqueSetStatus.Conditions`, adding detail the phase field cannot express:

| Type | Status | Reason | Set when |
|---|---|---|---|
| `Failed` | `True` | `MaxRestartsExceeded` | Restart budget exhausted |
| `Failed` | `True` | `MaxRuntimeExceeded` | Runtime limit elapsed |

The `Reason` field disambiguates the cause of failure, and `LastTransitionTime` records when the workload failed — neither of which the phase field carries.

**Events:**

| Reason | Type | When |
|---|---|---|
| `PodCliqueFailed` | `Warning` | A PodClique's `MinAvailable` is breached in a training workload |
| `ReplicaRestarting` | `Normal` | A replica restart is triggered; message includes current restart count |
| `MaxRestartsExceeded` | `Warning` | Restart budget exhausted; workload transitioning to `Failed` |
| `MaxRuntimeExceeded` | `Warning` | `maxRuntime` elapsed; all pods being terminated |
| `WorkloadSucceeded` | `Normal` | All replicas completed successfully |

### Test Plan

E2e tests should cover:
- Successful completion: all pods exit 0 → phase `Succeeded`.
- Single pod failure with `maxRestarts: 0` → immediate `Failed`.
- Restart and eventual success within `maxRestarts` budget.
- `maxRuntime` exceeded → all pods terminated, phase `Failed`, event `MaxRuntimeExceeded`.
- Webhook rejection of rolling updates and replica count changes for training workloads.

### Graduation Criteria

**Alpha:** MVP feature set implemented and covered by e2e tests — `workloadType`, phase, `trainingSpec`, `terminationDelay` defaulting, webhook enforcement.

**GA:** Stable after community feedback and at least one release of alpha usage.
Loading