Skip to content

Add utility for TrainJob progress reportingΒ #367

@abhijeet-dhumal

Description

@abhijeet-dhumal

What you would like to be added?

Add a utility function to the Kubeflow SDK that allows training scripts to report progress and metrics to the Kubeflow Trainer controller. This enables the trainerStatus field introduced in kubeflow/trainer#3227 and specified in KEP-2779.

Context

The Kubeflow Trainer controller (PR #3227) adds a Progress Plugin that:

  1. Injects environment variables into training pods (KUBEFLOW_TRAINER_STATUS_URL, KUBEFLOW_TRAINER_STATUS_TOKEN, KUBEFLOW_TRAINER_STATUS_CA_CERT)
  2. Runs a Status Server that accepts progress updates via HTTPS POST
  3. Updates TrainJob.status.trainerStatus with progress, ETA, and metrics

The SDK needs a simple, safe utility function that training scripts can call to report status.

Note: A KubeflowCallback implementation has been submitted to HuggingFace Transformers (PR #44487) which depends on this SDK utility. The callback auto-activates when running in Kubeflow and can call utility update_runtime_status() to report progress.

Why is this needed?

Problem: AI practitioners running training jobs on Kubernetes have no native way to monitor training progress. They must either:

  1. Parse container logs manually
  2. Set up external tracking systems (MLflow/W&B) which adds infrastructure overhead
  3. Wait blindly for jobs to complete

Solution: The Kubeflow Trainer controller (PR #3227) adds a Status Server that accepts progress updates from training pods. The SDK needs a client-side utility to POST updates to this server.

User experience improvement:

Before (no visibility):

kubectl get trainjob my-job
# NAME     STATUS    AGE
# my-job   Running   2h   ← Is it 10% done? 90% done? No idea.

After (with progress tracking):

kubectl get trainjob my-job -o jsonpath='{.status.trainerStatus}'
# {"progressPercentage": 67, "estimatedRemainingSeconds": 3600, "metrics": [...]}

Dependency: A KubeflowCallback has been submitted to HuggingFace Transformers (PR #44487 - In Review) which depends on this SDK utility. The callback auto-activates when running in Kubeflow and can call SDK's utility update_runtime_status() to report progress.

Proposed API

from kubeflow.trainer.progress import update_runtime_status

# Basic usage - SDK handles throttling (max 1 update/5s)
update_runtime_status(
    progress_percent=50,
    estimated_time_remaining=120,  # seconds or timedelta
    metrics={"loss": "0.234", "eval_accuracy": "0.89"}
)

# Force update (bypass throttling) - use for start/end
update_runtime_status(progress_percent=0, force=True)   # Training started
update_runtime_status(progress_percent=100, force=True) # Training complete

Love this feature?

Give it a πŸ‘ We prioritize the features with most πŸ‘

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions