-
Notifications
You must be signed in to change notification settings - Fork 154
Description
What you would like to be added?
Add a utility function to the Kubeflow SDK that allows training scripts to report progress and metrics to the Kubeflow Trainer controller. This enables the trainerStatus field introduced in kubeflow/trainer#3227 and specified in KEP-2779.
Context
The Kubeflow Trainer controller (PR #3227) adds a Progress Plugin that:
- Injects environment variables into training pods (
KUBEFLOW_TRAINER_STATUS_URL,KUBEFLOW_TRAINER_STATUS_TOKEN,KUBEFLOW_TRAINER_STATUS_CA_CERT) - Runs a Status Server that accepts progress updates via HTTPS POST
- Updates
TrainJob.status.trainerStatuswith progress, ETA, and metrics
The SDK needs a simple, safe utility function that training scripts can call to report status.
Note: A KubeflowCallback implementation has been submitted to HuggingFace Transformers (PR #44487) which depends on this SDK utility. The callback auto-activates when running in Kubeflow and can call utility update_runtime_status() to report progress.
Why is this needed?
Problem: AI practitioners running training jobs on Kubernetes have no native way to monitor training progress. They must either:
- Parse container logs manually
- Set up external tracking systems (MLflow/W&B) which adds infrastructure overhead
- Wait blindly for jobs to complete
Solution: The Kubeflow Trainer controller (PR #3227) adds a Status Server that accepts progress updates from training pods. The SDK needs a client-side utility to POST updates to this server.
User experience improvement:
Before (no visibility):
kubectl get trainjob my-job
# NAME STATUS AGE
# my-job Running 2h β Is it 10% done? 90% done? No idea.After (with progress tracking):
kubectl get trainjob my-job -o jsonpath='{.status.trainerStatus}'
# {"progressPercentage": 67, "estimatedRemainingSeconds": 3600, "metrics": [...]}Dependency: A KubeflowCallback has been submitted to HuggingFace Transformers (PR #44487 - In Review) which depends on this SDK utility. The callback auto-activates when running in Kubeflow and can call SDK's utility update_runtime_status() to report progress.
Proposed API
from kubeflow.trainer.progress import update_runtime_status
# Basic usage - SDK handles throttling (max 1 update/5s)
update_runtime_status(
progress_percent=50,
estimated_time_remaining=120, # seconds or timedelta
metrics={"loss": "0.234", "eval_accuracy": "0.89"}
)
# Force update (bypass throttling) - use for start/end
update_runtime_status(progress_percent=0, force=True) # Training started
update_runtime_status(progress_percent=100, force=True) # Training completeLove this feature?
Give it a π We prioritize the features with most π