Skip to content

Add more status conditions to TrainJob for better visibility of execution state,like Add Running/Pending #2713

@sfeng1996

Description

@sfeng1996

What you would like to be added?

Feature Request: Enhanced TrainJob Status Conditions
Currently, TrainJob only exposes three status conditions:
Suspended - when the TrainJob is suspended
Complete - when the TrainJob has completed successfully
Failed - when the TrainJob has failed
Proposed additions:
Add Running status condition when the underlying JobSet/Jobs are actively executing
Add Pending status condition when the TrainJob is created but not yet running (e.g., waiting for resources, scheduling, etc.)
Implementation details:
Extend the status condition constants in pkg/apis/trainer/v1alpha1/trainjob_types.go
Update the controller logic in pkg/controller/trainjob_controller.go to detect and set these intermediate states
Modify the TerminalCondition method in runtime plugins to also report non-terminal states
Update the +kubebuilder:printcolumn annotation to show these states in kubectl get trainjob output
Example of desired behavior:
$ kubectl get trainjob NAME STATE AGE my-trainjob Pending 30s my-trainjob Running 2m my-trainjob Complete 10m

Why is this needed?

User Experience and Operational Visibility:
Poor UX for monitoring: Currently, when a TrainJob is created, users see no status condition until it completes or fails. This creates confusion about whether the job is actually running or stuck.
Debugging difficulties: Without intermediate states, it's hard to distinguish between:
A job that's waiting for resources (should be Pending)
A job that's actively training (should be Running)
A job that's stuck due to configuration issues
Inconsistent with Kubernetes patterns: Most Kubernetes resources (Pods, Jobs, Deployments) expose intermediate states. TrainJob's current design breaks user expectations.
This enhancement would align TrainJob with standard Kubernetes resource patterns and significantly improve the user experience for ML practitioners using Kubeflow Trainer.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions