Skip to content

Conversation

@abhijeet-dhumal
Copy link
Member

@abhijeet-dhumal abhijeet-dhumal commented Sep 8, 2025

This PR implements real-time progression tracking for TrainJobs, enabling users to monitor training progress directly through the Kubernetes API without requiring additional RBAC permissions for training pods.

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):

Fixes #2779
Summary and test results

Note : The progressionStatus and related metrics parameters shown are kept totally optional and can be configured via training progress tracker custom Jsonl file

apiVersion: trainer.kubeflow.org/v1alpha1
kind: TrainJob
metadata:
  name: my-training-job
spec:
  # ... your training job spec
status:
  conditions:
    - lastTransitionTime: '2025-09-08T16:48:40Z'
      message: TrainJob is resumed
      reason: Resumed
      status: 'False'
      type: Suspended
    - lastTransitionTime: '2025-09-08T16:51:32Z'
      message: jobset completed successfully
      reason: AllJobsCompleted
      status: 'True'
      type: Complete
  progressionStatus:
    metrics:
      total_batches: '469'
      total_samples: '30000'
    totalSteps: 3752
    currentStep: 3752
    message: Training completed
    trainingMetrics:
      accuracy: '0.8956666666666667'
      checkpointsStored: 8
      latestCheckpointPath: /workspace/checkpoints/epoch-8.pth
      loss: '0.2960562955405412'
    lastUpdateTime: '2025-09-08T16:50:59Z'
    estimatedTimeRemaining: 0
    totalEpochs: 8
    currentEpoch: 8
    percentageComplete: '100.00'

Usage :

# HuggingFace Transformers
trainer = Trainer(callbacks=[TrainJobProgressionCallback()])

# PyTorch
tracker = ProgressionTracker(total_epochs=10, steps_per_epoch=100)
tracker.update_step(epoch=0, step=50, loss=0.5)

Valid API request to get training progress :

# Get progression status from TrainJob
curl -H "Authorization: Bearer $TOKEN" \
-H "Accept: application/json" \
-k "$API_SERVER/apis/trainer.kubeflow.org/v1alpha1/namespaces/abdhumal-test/trainjobs/trl-demo" \
| jq '.status.progressionStatus'

Checklist:

  • Docs included if any changes are user facing

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coveralls
Copy link

coveralls commented Sep 8, 2025

Pull Request Test Coverage Report for Build 17559363064

Details

  • 73 of 202 (36.14%) changed or added relevant lines in 4 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-1.3%) to 50.833%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller/setup.go 0 1 0.0%
pkg/controller/trainjob_controller.go 0 53 0.0%
pkg/util/progression/reader.go 68 143 47.55%
Files with Coverage Reduction New Missed Lines %
pkg/controller/trainjob_controller.go 1 0.0%
Totals Coverage Status
Change from base Build 17556155553: -1.3%
Covered Lines: 1098
Relevant Lines: 2160

💛 - Coveralls

@abhijeet-dhumal abhijeet-dhumal force-pushed the training-progression#2779 branch from 76dc772 to 9c3b465 Compare September 8, 2025 17:25
Signed-off-by: Abhijeet Dhumal <abdhumal@redhat.com>
@abhijeet-dhumal abhijeet-dhumal force-pushed the training-progression#2779 branch from 9c3b465 to 7b5fe99 Compare September 8, 2025 17:43
@abhijeet-dhumal abhijeet-dhumal marked this pull request as ready for review September 8, 2025 17:44
@abhijeet-dhumal abhijeet-dhumal changed the title [WIP] Add TrainJob progression tracking with real-time status updates Add TrainJob progression tracking with real-time status updates Sep 9, 2025
@abhijeet-dhumal abhijeet-dhumal changed the title Add TrainJob progression tracking with real-time status updates feat: Add TrainJob progression tracking with real-time status updates Sep 9, 2025
@abhijeet-dhumal
Copy link
Member Author

@andreyvelich @astefanutti @kannon92 May I request your review here ..

- Added RHOAI-specific manifests for OpenShift AI deployment
- Added Dockerfile.odh for ODH-specific container builds
- Includes training runtimes for CUDA 2.4.1, 2.5.1 and ROCm 2.4.1, 2.5.1
- Added monitoring, RBAC, and configuration patches for RHOAI
@kannon92
Copy link
Contributor

This seems to be a pretty large change.

Should there be a KEP update or a design doc summarizing this?

Forgive me if I missed it.

@andreyvelich
Copy link
Member

andreyvelich commented Oct 21, 2025

Yeah, we should prepare KEP as part of: #2779
We had a few discussions with @astefanutti during Training WG calls on how to design API to expose some metrics and other parameters from the application.
That might be useful for TrainJobs and OptimizationJobs.

@astefanutti
Copy link
Contributor

@kannon92 you're right. As @andreyvelich mentioned we've discussed this during the community call and we are working on a draft for the KEP that we'll hopefully be able to share soon.

@abhijeet-dhumal
Copy link
Member Author

Closing this PR as the proposal is created in separate PR mentioned below and available now to be reviewed for investigating on the ideal implementation approach
Please follow : #2905

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track TrainJob progress and expose training metrics

5 participants