-
Notifications
You must be signed in to change notification settings - Fork 884
feat: Add TrainJob progression tracking with real-time status updates #2820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add TrainJob progression tracking with real-time status updates #2820
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
a64fa0d to
76dc772
Compare
Pull Request Test Coverage Report for Build 17559363064Details
💛 - Coveralls |
76dc772 to
9c3b465
Compare
Signed-off-by: Abhijeet Dhumal <abdhumal@redhat.com>
9c3b465 to
7b5fe99
Compare
|
@andreyvelich @astefanutti @kannon92 May I request your review here .. |
- Added RHOAI-specific manifests for OpenShift AI deployment - Added Dockerfile.odh for ODH-specific container builds - Includes training runtimes for CUDA 2.4.1, 2.5.1 and ROCm 2.4.1, 2.5.1 - Added monitoring, RBAC, and configuration patches for RHOAI
|
This seems to be a pretty large change. Should there be a KEP update or a design doc summarizing this? Forgive me if I missed it. |
|
Yeah, we should prepare KEP as part of: #2779 |
|
@kannon92 you're right. As @andreyvelich mentioned we've discussed this during the community call and we are working on a draft for the KEP that we'll hopefully be able to share soon. |
|
Closing this PR as the proposal is created in separate PR mentioned below and available now to be reviewed for investigating on the ideal implementation approach |
This PR implements real-time progression tracking for TrainJobs, enabling users to monitor training progress directly through the Kubernetes API without requiring additional RBAC permissions for training pods.
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #2779
Summary and test results
Usage :
Valid API request to get training progress :
Checklist: