-
Notifications
You must be signed in to change notification settings - Fork 862
Description
What happened?
TrainJob suspend/resume functionality is broken when using JobSet runtime. Attempting to suspend or resume a TrainJob fails with a webhook validation error, preventing the suspend state from propagating to the underlying JobSet.
Error message :
admission webhook "vjobset.kb.io" denied the request:
spec.replicatedJobs: Invalid value: []v1alpha2.ReplicatedJob(nil): field is immutable
Root Cause
The controller uses Server-Side Apply (SSA) to update the JobSet. SSA sends an ApplyConfiguration containing all fields, including immutable ones like spec.replicatedJobs. The JobSet webhook validates that immutable fields haven't changed, but it can't distinguish "unchanged" from "changed" in ApplyConfigurations, so it rejects the update.
Proposed Solution
When only the suspend field is changing, use a strategic merge patch instead of SSA:
if suspendChanged {
patch := client.MergeFrom(oldJobSet.DeepCopy())
oldJobSet.Spec.Suspend = ptr.To(newSuspend)
if err := j.client.Patch(ctx, oldJobSet, patch); err != nil {
return nil, fmt.Errorf("failed to patch JobSet suspend field: %w", err)
}
return nil, nil
}
This sends only the suspend field, bypassing immutable field validation.
What did you expect to happen?
- Setting
spec.suspend: trueon a running TrainJob should suspend the JobSet and terminate pods - Setting
spec.suspend: falseon a suspended TrainJob should resume the JobSet and recreate pods - Multiple suspend/resume cycles should work reliably
Environment
How can we reproduce it (as minimally and precisely as possible)?
- Create a TrainJob with JobSet runtime
- Wait for pods to start running
- Update TrainJob:
kubectl patch trainjob my-job --type=merge -p '{"spec":{"suspend":true}}' - Observe error in controller logs and JobSet remains unsuspended
Environment
- Kubernetes version: v1.29
- Trainer version: v2.1
- JobSet version: v0.9+
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍