GREP 285: Training job support#446

Draft

shayasoolin wants to merge 5 commits intoai-dynamo:mainfrom

shayasoolin:training-job-support

shayasoolin commented Feb 24, 2026

/kind documentation
/kind feature

What this PR does / why we need it:

Adds GREP-285 for training job support in PodCliqueSet. The document covers the
proposed API changes, lifecycle phase semantics, trainingSpec fields (maxRuntime, maxRestarts),
terminationDelay defaulting, webhook enforcement, conditions, and events needed to support finite training
workloads alongside existing inference workloads.

Which issue(s) this PR fixes:

Fixes #285

Special notes for your reviewer:

This is a design/proposal PR only — no code changes. Feedback welcome on API naming, maxRestarts counting
semantics, and the terminationDelay defaulting behavior for training workloads.

Does this PR introduce a API change?

Extends the PodCliqueSet API with a workloadType field (Inference/Training),
a phase field in status, and a trainingSpec stanza (maxRuntime, maxRestarts)
to support finite training workloads alongside existing inference workloads.

shayasoolin added 3 commits

February 24, 2026 11:47


          docs: add PRD for training job support including phase, maxRestarts, …

00c6e1d

…maxRuntime


          docs: GREP for training job support including phase, maxRestarts, max…

99c29d0

…Runtime


          docs: GREP for training job support including phase, maxRestarts, max…

d41a872

…Runtime

shayasoolin requested review from Ronkahn21, gflarity, sanjaychatterjee and unmarshall as code owners

February 24, 2026 09:52

copy-pr-bot bot commented Feb 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

shayasoolin added 2 commits

February 24, 2026 12:01


          docs: rename GREP-285 directory and title to GREP-0285

0eea29e


          docs: GREP for training job support including phase, maxRestarts, max…

9a8b154

…Runtime

shayasoolin marked this pull request as draft

February 25, 2026 11:20

kangclzjc reviewed

View reviewed changes

docs/proposals/0285-training-job-support/README.md

+              - Introduce a `workloadType` field to distinguish `Inference` (default, no behavior change) from `Training` workloads.
+              - Add a `phase` field to `PodCliqueSetStatus` representing the workload lifecycle: `Pending`, `Running`, `Succeeded`, `Failed`.
+              - Add `maxRuntime` and `maxRestarts` controls under a `trainingSpec` stanza.
+              - Default `terminationDelay` to `0` for training workloads for immediate failure response.

Contributor

kangclzjc Mar 3, 2026

If we set terminationDelay default to 0, does it means that gang termination be triggered when one pod fail/deleted? Wouldn't we wait for the failed/deleted pod to be created again?

kangclzjc reviewed

View reviewed changes

docs/proposals/0285-training-job-support/README.md


		A new `workloadType` field is added to `PodCliqueSetSpec`. When set to `Training`, the operator changes its behavior in the following ways:

		- Pods that exit with code 0 are not recreated. When all pods in a replica complete successfully, the replica is marked `Succeeded`.

Contributor

kangclzjc Mar 3, 2026

What happens if we support in-place update in the near future. If we in-place update the pod, would we consider it as success or fail? For this kind of action/job, we shouldn't block the recreate, right?

kangclzjc reviewed

View reviewed changes

docs/proposals/0285-training-job-support/README.md

+                  maxRestarts: 2
+              ```
+              A "restart" means deleting and recreating all pods across all PodCliques in the affected replica. The counter is a single total across all replicas. The PodCliqueSet is marked `Failed` when the total restart count exceeds `maxRestarts`.

Contributor

kangclzjc Mar 3, 2026

Can we provide restart action also for PCSG and PCLQ level ?

kangclzjc reviewed

View reviewed changes

docs/proposals/0285-training-job-support/README.md


		### terminationDelay

		`terminationDelay` (default: 4 hours) is the grace period grove gives after `MinAvailableBreached` before tearing down the gang. For inference workloads, this allows time for transient pod failures to recover. For training workloads, a missing worker means the job cannot make progress — a 4-hour wait wastes reserved GPU capacity and delays the retry.

Contributor

kangclzjc Mar 3, 2026 •

edited

Loading

A training job can have several part, each part fail may don't affect other parts, all we need to do is to restart the part of job. A missing worker which may not mean the job can't make progress.

kangclzjc reviewed

View reviewed changes

docs/proposals/0285-training-job-support/README.md


		`terminationDelay` (default: 4 hours) is the grace period grove gives after `MinAvailableBreached` before tearing down the gang. For inference workloads, this allows time for transient pod failures to recover. For training workloads, a missing worker means the job cannot make progress — a 4-hour wait wastes reserved GPU capacity and delays the retry.

		When `workloadType: Training`, `terminationDelay` defaults to `0`. The operator immediately tears down the replica and begins a restart (or marks it `Failed` if `maxRestarts` is exhausted). Users can override this with an explicit value if their framework tolerates brief pod gaps.

Contributor

kangclzjc Mar 3, 2026

Could we add multiple level of Tear down the replica of PCS/PCSG/PCLQ

kangclzjc reviewed

View reviewed changes

docs/proposals/0285-training-job-support/README.md


		#### Validating Webhook

		- Reject pod template spec changes that would produce a new generation hash (i.e., would trigger a rolling update) for Training workloads.

Contributor

kangclzjc Mar 3, 2026

We should support in-place update which could change spec. Maybe we should allow this action

kangclzjc reviewed

View reviewed changes

docs/proposals/0285-training-job-support/README.md

+              - **Phase computation** — on every reconcile, aggregate PodClique statuses to derive the phase: all pods pending → `Pending`; any replica running → `Running`; all PodCliques have `Succeeded` condition → `Succeeded`; restart budget exhausted or `maxRuntime` exceeded → `Failed`.
+              - **maxRuntime enforcement** — record `startTime` in PCS status when phase first transitions to `Running`. `startTime` is never reset on restarts — `maxRuntime` is a wall-clock deadline from first start, inclusive of any time spent failing and restarting. This matches the behavior of Kubernetes Job's `activeDeadlineSeconds` and Kubeflow's `runPolicy.activeDeadlineSeconds`, and bounds total GPU consumption regardless of retry count. On each reconcile, check elapsed time; if elapsed > `maxRuntime`, terminate all pods and set phase `Failed`. Use `requeueAfter(remainingDuration)` to avoid busy-polling.
+              - **Restart orchestration** — when a PodClique has `MinAvailableBreached: True`, check `status.restartCount` vs `maxRestarts`. If budget remains, delete all pods across all PodCliques in the replica, increment `restartCount`, and emit a `ReplicaRestarting` event. If exhausted, terminate all remaining pods, set phase `Failed`, and emit `MaxRestartsExceeded`.

Contributor

kangclzjc Mar 3, 2026

One PodClique has MinAvailableBreached: True should avoid delete all PodCliques's all pods

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

kangclzjc kangclzjc left review comments

unmarshall Awaiting requested review from unmarshall unmarshall is a code owner

sanjaychatterjee Awaiting requested review from sanjaychatterjee sanjaychatterjee is a code owner

gflarity Awaiting requested review from gflarity gflarity is a code owner

Ronkahn21 Awaiting requested review from Ronkahn21 Ronkahn21 is a code owner

At least 2 approving reviews are required to merge this pull request.

Labels

None yet