Skip to content

TrainJob suspend/resume fails with JobSet webhook validation error #3008

@abhijeet-dhumal

Description

@abhijeet-dhumal

What happened?

TrainJob suspend/resume functionality is broken when using JobSet runtime. Attempting to suspend or resume a TrainJob fails with a webhook validation error, preventing the suspend state from propagating to the underlying JobSet.

Error message :

admission webhook "vjobset.kb.io" denied the request:
spec.replicatedJobs: Invalid value: []v1alpha2.ReplicatedJob(nil): field is immutable

Root Cause

The controller uses Server-Side Apply (SSA) to update the JobSet. SSA sends an ApplyConfiguration containing all fields, including immutable ones like spec.replicatedJobs. The JobSet webhook validates that immutable fields haven't changed, but it can't distinguish "unchanged" from "changed" in ApplyConfigurations, so it rejects the update.

Proposed Solution

When only the suspend field is changing, use a strategic merge patch instead of SSA:

if suspendChanged {
    patch := client.MergeFrom(oldJobSet.DeepCopy())
    oldJobSet.Spec.Suspend = ptr.To(newSuspend)
    if err := j.client.Patch(ctx, oldJobSet, patch); err != nil {
        return nil, fmt.Errorf("failed to patch JobSet suspend field: %w", err)
    }
    return nil, nil
}

This sends only the suspend field, bypassing immutable field validation.

What did you expect to happen?

  • Setting spec.suspend: true on a running TrainJob should suspend the JobSet and terminate pods
  • Setting spec.suspend: false on a suspended TrainJob should resume the JobSet and recreate pods
  • Multiple suspend/resume cycles should work reliably

Environment

How can we reproduce it (as minimally and precisely as possible)?

  1. Create a TrainJob with JobSet runtime
  2. Wait for pods to start running
  3. Update TrainJob: kubectl patch trainjob my-job --type=merge -p '{"spec":{"suspend":true}}'
  4. Observe error in controller logs and JobSet remains unsuspended

Environment

  • Kubernetes version: v1.29
  • Trainer version: v2.1
  • JobSet version: v0.9+

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions