Skip to content

GREP 285: Training job support#446

Draft
shayasoolin wants to merge 5 commits intoai-dynamo:mainfrom
shayasoolin:training-job-support
Draft

GREP 285: Training job support#446
shayasoolin wants to merge 5 commits intoai-dynamo:mainfrom
shayasoolin:training-job-support

Conversation

@shayasoolin
Copy link

/kind documentation
/kind feature

What this PR does / why we need it:

Adds GREP-285 for training job support in PodCliqueSet. The document covers the
proposed API changes, lifecycle phase semantics, trainingSpec fields (maxRuntime, maxRestarts),
terminationDelay defaulting, webhook enforcement, conditions, and events needed to support finite training
workloads alongside existing inference workloads.

Which issue(s) this PR fixes:

Fixes #285

Special notes for your reviewer:

This is a design/proposal PR only — no code changes. Feedback welcome on API naming, maxRestarts counting
semantics, and the terminationDelay defaulting behavior for training workloads.

Does this PR introduce a API change?

Extends the PodCliqueSet API with a workloadType field (Inference/Training),
a phase field in status, and a trainingSpec stanza (maxRuntime, maxRestarts)
to support finite training workloads alongside existing inference workloads.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Feb 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shayasoolin shayasoolin marked this pull request as draft February 25, 2026 11:20
- Introduce a `workloadType` field to distinguish `Inference` (default, no behavior change) from `Training` workloads.
- Add a `phase` field to `PodCliqueSetStatus` representing the workload lifecycle: `Pending`, `Running`, `Succeeded`, `Failed`.
- Add `maxRuntime` and `maxRestarts` controls under a `trainingSpec` stanza.
- Default `terminationDelay` to `0` for training workloads for immediate failure response.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set terminationDelay default to 0, does it means that gang termination be triggered when one pod fail/deleted? Wouldn't we wait for the failed/deleted pod to be created again?


A new `workloadType` field is added to `PodCliqueSetSpec`. When set to `Training`, the operator changes its behavior in the following ways:

- Pods that exit with code 0 are not recreated. When all pods in a replica complete successfully, the replica is marked `Succeeded`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we support in-place update in the near future. If we in-place update the pod, would we consider it as success or fail? For this kind of action/job, we shouldn't block the recreate, right?

maxRestarts: 2
```

A "restart" means deleting and recreating all pods across all PodCliques in the affected replica. The counter is a single total across all replicas. The PodCliqueSet is marked `Failed` when the total restart count exceeds `maxRestarts`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we provide restart action also for PCSG and PCLQ level ?


### terminationDelay

`terminationDelay` (default: 4 hours) is the grace period grove gives after `MinAvailableBreached` before tearing down the gang. For inference workloads, this allows time for transient pod failures to recover. For training workloads, a missing worker means the job cannot make progress — a 4-hour wait wastes reserved GPU capacity and delays the retry.
Copy link
Contributor

@kangclzjc kangclzjc Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A training job can have several part, each part fail may don't affect other parts, all we need to do is to restart the part of job. A missing worker which may not mean the job can't make progress.


`terminationDelay` (default: 4 hours) is the grace period grove gives after `MinAvailableBreached` before tearing down the gang. For inference workloads, this allows time for transient pod failures to recover. For training workloads, a missing worker means the job cannot make progress — a 4-hour wait wastes reserved GPU capacity and delays the retry.

When `workloadType: Training`, `terminationDelay` defaults to `0`. The operator immediately tears down the replica and begins a restart (or marks it `Failed` if `maxRestarts` is exhausted). Users can override this with an explicit value if their framework tolerates brief pod gaps.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add multiple level of Tear down the replica of PCS/PCSG/PCLQ


#### Validating Webhook

- Reject pod template spec changes that would produce a new generation hash (i.e., would trigger a rolling update) for Training workloads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should support in-place update which could change spec. Maybe we should allow this action


- **Phase computation** — on every reconcile, aggregate PodClique statuses to derive the phase: all pods pending → `Pending`; any replica running → `Running`; all PodCliques have `Succeeded` condition → `Succeeded`; restart budget exhausted or `maxRuntime` exceeded → `Failed`.
- **maxRuntime enforcement** — record `startTime` in PCS status when phase first transitions to `Running`. `startTime` is never reset on restarts — `maxRuntime` is a wall-clock deadline from first start, inclusive of any time spent failing and restarting. This matches the behavior of Kubernetes Job's `activeDeadlineSeconds` and Kubeflow's `runPolicy.activeDeadlineSeconds`, and bounds total GPU consumption regardless of retry count. On each reconcile, check elapsed time; if elapsed > `maxRuntime`, terminate all pods and set phase `Failed`. Use `requeueAfter(remainingDuration)` to avoid busy-polling.
- **Restart orchestration** — when a PodClique has `MinAvailableBreached: True`, check `status.restartCount` vs `maxRestarts`. If budget remains, delete all pods across all PodCliques in the replica, increment `restartCount`, and emit a `ReplicaRestarting` event. If exhausted, terminate all remaining pods, set phase `Failed`, and emit `MaxRestartsExceeded`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One PodClique has MinAvailableBreached: True should avoid delete all PodCliques's all pods

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhance grove to support training jobs

2 participants