Skip to content

JobSet completed but Trainer Controller keeps recreating Pods (V2.1.0) #2981

@guojin1037832414

Description

@guojin1037832414

I am using the latest kubeflow/trainer master branch (trainer-controller-manager digest ad75c388dd0f, built ~Nov 15, 2025) with built-in JobSet v1alpha2.
Important note: When I was using the official released v2.0 (or v2.0.1) images, this problem did not exist — after successful completion + TTL cleanup, the JobSet was deleted and the TrainJob controller did NOT recreate it.
The issue only started appearing after I upgraded to the current master/nightly build.
When I configure a custom ClusterTrainingRuntime with:

spec:
  template:
    spec:
      failurePolicy:
        restartStrategy: Recreate
        maxRestarts: 0          # successful completion does NOT trigger recreate
      ttlSecondsAfterFinished: 60

The training finishes normally → JobSet becomes Completed → no restart (thanks to maxRestarts: 0).
60 seconds later the JobSet is correctly deleted by TTL.
Immediately after deletion, the TrainJob controller reconciles, logs "JobSet not found", and recreates an identical JobSet. The new JobSet instantly becomes Completed again → TTL deletes it → infinite loop.
This causes Pods to be repeatedly created/destroyed every ~60 seconds even though training finished long ago.
2:04
Desired behavior (which worked perfectly in v2.0/v2.0.1):
Training finishes successfully
After configurable TTL, JobSet + Pods are automatically cleaned up to release GPU/resources
TrainJob controller does not recreate the JobSet after TTL cleanup
TrainJob object can remain for record-keeping
Question:
Did the behavior intentionally change after v2.1.0, or is this a regression in master? Is there any way to get the old (desired) behavior back while still using the latest master images?
Thank you!

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions