JobSet completed but Trainer Controller keeps recreating Pods (V2.1.0)

I am using the latest kubeflow/trainer master branch (trainer-controller-manager digest ad75c388dd0f, built ~Nov 15, 2025) with built-in JobSet v1alpha2.
Important note: When I was using the official released v2.0 (or v2.0.1) images, this problem did not exist — after successful completion + TTL cleanup, the JobSet was deleted and the TrainJob controller did NOT recreate it.
The issue only started appearing after I upgraded to the current master/nightly build.
When I configure a custom ClusterTrainingRuntime with:

```
spec:
  template:
    spec:
      failurePolicy:
        restartStrategy: Recreate
        maxRestarts: 0          # successful completion does NOT trigger recreate
      ttlSecondsAfterFinished: 60
```
The training finishes normally → JobSet becomes Completed → no restart (thanks to maxRestarts: 0).
60 seconds later the JobSet is correctly deleted by TTL.
Immediately after deletion, the TrainJob controller reconciles, logs "JobSet not found", and recreates an identical JobSet. The new JobSet instantly becomes Completed again → TTL deletes it → infinite loop.
This causes Pods to be repeatedly created/destroyed every ~60 seconds even though training finished long ago.
[2:04](https://cloud-native.slack.com/archives/C0742LDFZ4K/p1763445880890109)
Desired behavior (which worked perfectly in v2.0/v2.0.1):
Training finishes successfully
After configurable TTL, JobSet + Pods are automatically cleaned up to release GPU/resources
TrainJob controller does not recreate the JobSet after TTL cleanup
TrainJob object can remain for record-keeping
Question:
Did the behavior intentionally change after v2.1.0, or is this a regression in master? Is there any way to get the old (desired) behavior back while still using the latest master images?
Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JobSet completed but Trainer Controller keeps recreating Pods (V2.1.0) #2981

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

JobSet completed but Trainer Controller keeps recreating Pods (V2.1.0) #2981

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions