forked from opendatahub-io/trainer
-
Notifications
You must be signed in to change notification settings - Fork 0
Merge changes from RHOAI main to 3.2 #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* fix(docs): convert commits to list in changelog.py for compatibility Signed-off-by: kramaranya <[email protected]> * chore(docs): add Changelog for Trainer v2.0.0-rc.0 Signed-off-by: kramaranya <[email protected]> --------- Signed-off-by: kramaranya <[email protected]>
…nShift (kubeflow#2682) Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
…#2685) * chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1 Signed-off-by: Andrey Velichkevich <[email protected]> * Update cuda to 12.8 Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
…w#2382) * Add the manifests overlay for Kubeflow Training V2 Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> * Update manifest: adjust permissions, and format changes Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> * Update manifest: rename overlay, adjust event permissions Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> * Update manifest: make namespace configurable Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> * Update manifest: move standalone, only-manager installation in namespace: kubeflow-system Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> * Update manifest: add overlay for Kubeflow Platform installation Signed-off-by: Xinmin Du <[email protected]> * add permission for pods log read & rm persistentvolumeclaims Signed-off-by: Xinmin Du <[email protected]> * create the runtimes before the webhooks Signed-off-by: Xinmin Du <[email protected]> * Specify sorting order: fifo Signed-off-by: Xinmin Du <[email protected]> * Deploy jobset first Signed-off-by: Xinmin Du <[email protected]> * remove edit permissions to runtimes; install runtimes after crds Signed-off-by: Xinmin Du <[email protected]> * remove pretraining directory Signed-off-by: Xinmin Du <[email protected]> * patch runtimes images Signed-off-by: Xinmin Du <[email protected]> * fix: correct image Signed-off-by: Xinmin Du <[email protected]> * add image patch for more runtimes Signed-off-by: Xinmin Du <[email protected]> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Du Xinmin <[email protected]> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Du Xinmin <[email protected]> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Du Xinmin <[email protected]> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Du Xinmin <[email protected]> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Du Xinmin <[email protected]> * role_bind for notebook & profile Signed-off-by: Xinmin Du <[email protected]> * fix: reorder images Signed-off-by: Xinmin Du <[email protected]> * fix: reuse overlay/manager & runtimes Signed-off-by: Xinmin Du <[email protected]> * fix: remove namespace with patch Signed-off-by: Xinmin Du <[email protected]> --------- Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Du Xinmin <[email protected]> Co-authored-by: Xinmin Du <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]>
…ith CTR and TrainJob yaml files (kubeflow#2669) * chore(mainfests): include torchtune runtimes. Signed-off-by: Electronic-Waste <[email protected]> * fix(manifests): Update torchtune runtimes.: Signed-off-by: Electronic-Waste <[email protected]> * chore(manifests): Update mounting path in CTRs. Signed-off-by: Electronic-Waste <[email protected]> * fix(manifests): Update output_dir. Signed-off-by: Electronic-Waste <[email protected]> * fix(manifests): Update numProcPerNode to auto. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>
…w#2675) * fix(plugins): fix errors in trainer command mutation of torchtune. Signed-off-by: Electronic-Waste <[email protected]> * fix(plugins): remove config file format suffix. Signed-off-by: Electronic-Waste <[email protected]> * fix(test): update UTs. Signed-off-by: Electronic-Waste <[email protected]> * fix(initializer): Update the workspace of dataset/model initializer. Signed-off-by: Electronic-Waste <[email protected]> * fix(plugins): parse nproc_per_node from GPU resource. Signed-off-by: Electronic-Waste <[email protected]> * fix(torchtune): Add bitsandbytes dependency in requirements.txt Signed-off-by: Electronic-Waste <[email protected]> * fix(lint): fix lint error. Signed-off-by: Electronic-Waste <[email protected]> * fix(torchtune): Remove unnecessary num_proc_per_node calculation. Signed-off-by: Electronic-Waste <[email protected]> * test(torch): Update invalid parameters. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
…ubeflow#2695) Signed-off-by: Yuki Iwai <[email protected]>
* feat: Mutable PodSpecOverrides for suspended TrainJob Signed-off-by: Antonin Stefanutti <[email protected]> * Include @tenzen-y review Signed-off-by: Antonin Stefanutti <[email protected]> * Add unit tests Signed-off-by: Antonin Stefanutti <[email protected]> --------- Signed-off-by: Antonin Stefanutti <[email protected]>
* feat(example): Add alpaca-trianjob-yaml.ipynb. Signed-off-by: Electronic-Waste <[email protected]> * fix(example): Update the overview of the torchtune llama3_2 example. Signed-off-by: Electronic-Waste <[email protected]> * fix(example): Update the pvc description. Signed-off-by: Electronic-Waste <[email protected]> * chore(example): Add the get the fine-tuned model section. Signed-off-by: Electronic-Waste <[email protected]> * fix(example): Fix some errors. Signed-off-by: Electronic-Waste <[email protected]> * fix(example): fix some errors. Signed-off-by: Electronic-Waste <[email protected]> * fix(manifests): Fix debug tag. Signed-off-by: Electronic-Waste <[email protected]> * fix(example): Change PVC creation method to Python SDK. Signed-off-by: Electronic-Waste <[email protected]> * fix(example): Remove config load. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
* feat: Add schedulingGates to PodSpecOverrides Signed-off-by: Antonin Stefanutti <[email protected]> * Change desired job to target job in PodSpecOverrides comments Signed-off-by: Antonin Stefanutti <[email protected]> --------- Signed-off-by: Antonin Stefanutti <[email protected]>
* fix(module): Change Go module name to v2 Signed-off-by: Andrey Velichkevich <[email protected]> * Bump x/net to v0.38.0 Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
* chore(docs): Add Changelog for v2.0.0-rc.1 Signed-off-by: Andrey Velichkevich <[email protected]> * Move example to misc Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
* Add Red Hat to ADOPTERS.md Signed-off-by: Yuan Tang <[email protected]> * Update ADOPTERS.md Signed-off-by: Yuan Tang <[email protected]> --------- Signed-off-by: Yuan Tang <[email protected]>
…d to job (kubeflow#2719) Signed-off-by: rudeigerc <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
…ow#2731) Signed-off-by: rudeigerc <[email protected]>
* chore(ci): Add GitHub action to verify PR titles Signed-off-by: Andrey Velichkevich <[email protected]> * Use operator scope Signed-off-by: Andrey Velichkevich <[email protected]> * Add examples scope Signed-off-by: Andrey Velichkevich <[email protected]> * Add scripts to scope Signed-off-by: Andrey Velichkevich <[email protected]> * Add exporter Signed-off-by: Andrey Velichkevich <[email protected]> * add wip ignore label Signed-off-by: Andrey Velichkevich <[email protected]> * Add PR title to the contrib guide Signed-off-by: Andrey Velichkevich <[email protected]> * Ignore dependencies label Signed-off-by: Andrey Velichkevich <[email protected]> * Fix text Signed-off-by: Andrey Velichkevich <[email protected]> * Use action only on master branch Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
…ue template (kubeflow#2732) Signed-off-by: rudeigerc <[email protected]>
… jobset (kubeflow#2734) Signed-off-by: rudeigerc <[email protected]>
Signed-off-by: Koray Oksay <[email protected]>
* chore(docs): Add Changelog for Kubeflow Trainer v2.0.0 Signed-off-by: Andrey Velichkevich <[email protected]> * Add links for blog post and migration guide Signed-off-by: Andrey Velichkevich <[email protected]> * Add links for blog post and website Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
* feat(docs): Kubeflow Trainer ROADMAP 2025 Signed-off-by: Andrey Velichkevich <[email protected]> * Update roadmap Signed-off-by: Andrey Velichkevich <[email protected]> * Add issue for Trainer UI Signed-off-by: Andrey Velichkevich <[email protected]> * Add issues for MPI and plugin extension Signed-off-by: Andrey Velichkevich <[email protected]> * Add issues for builtin trainers Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
…ubeflow#2754) Signed-off-by: Andrey Velichkevich <[email protected]>
kubeflow#2911) * feat(initializer): add s3 model and dataset initializers Signed-off-by: rudeigerc <[email protected]> * chore: refactor with opendal Signed-off-by: rudeigerc <[email protected]> * chore: support `role_arn` and add `ignore_patterns` field in the Initializers configs Signed-off-by: rudeigerc <[email protected]> --------- Signed-off-by: rudeigerc <[email protected]> Co-authored-by: rudeigerc <[email protected]>
…ubeflow#2912) * chore(operator): Use SSA throughout runtime framework Signed-off-by: Antonin Stefanutti <[email protected]> * Fix lint error Signed-off-by: Antonin Stefanutti <[email protected]> * Update go.mod file Signed-off-by: Antonin Stefanutti <[email protected]> --------- Signed-off-by: Antonin Stefanutti <[email protected]> Co-authored-by: Antonin Stefanutti <[email protected]>
…harts (kubeflow#2914) Signed-off-by: Antonin Stefanutti <[email protected]> Co-authored-by: Antonin Stefanutti <[email protected]>
…branch (kubeflow#2917) * feat(manifests): Publish Trainer Helm Charts (kubeflow#2906) * Solve Remaining Error and bugs Signed-off-by: adity1raut <[email protected]> * Solve the confige Signed-off-by: adity1raut <[email protected]> * Update The Suggest Change Signed-off-by: adity1raut <[email protected]> * Update After REview Signed-off-by: adity1raut <[email protected]> * Update the Helm publish action Signed-off-by: Andrey Velichkevich <[email protected]> * Update release doc Signed-off-by: Andrey Velichkevich <[email protected]> * Use 0.0.0 version for master branch Signed-off-by: Andrey Velichkevich <[email protected]> * Update release doc Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: adity1raut <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]> * fix(manifests): Fix Helm charts image name (kubeflow#2915) * fix(manifests): Fix Helm charts image name Signed-off-by: Andrey Velichkevich <[email protected]> * Always insert appVersion to the Chart.yaml file Signed-off-by: Andrey Velichkevich <[email protected]> * Fix comment Signed-off-by: Andrey Velichkevich <[email protected]> * Simplify action Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> * fix(manifests): Remove the default tag from the controller image (kubeflow#2916) * fix(manifests): Remove the default tag from the controller image Signed-off-by: Andrey Velichkevich <[email protected]> * Fix README template Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: adity1raut <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Aditya Raut <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
…cache nodes (kubeflow#2920) Signed-off-by: Akshay Chitneni <[email protected]> Co-authored-by: Akshay Chitneni <[email protected]>
…#2924) * add local docker training example Signed-off-by: Brian Gallagher <[email protected]> * feat: Adding local execution example notebook Co-authored-by Brian Gallagher <[email protected]> Signed-off-by: Fiona Waters <[email protected]> --------- Signed-off-by: Brian Gallagher <[email protected]> Signed-off-by: Fiona Waters <[email protected]> Co-authored-by: Brian Gallagher <[email protected]> Co-authored-by: Fiona Waters <[email protected]>
…ubeflow#2927) * fix(ci): Fix the Kubeflow SDK installation with Docker Signed-off-by: Andrey Velichkevich <[email protected]> * Uncomment delete job in local Notebooks Signed-off-by: Andrey Velichkevich <[email protected]> * Update .github/workflows/test-e2e.yaml Co-authored-by: Anya Kramar <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]> Co-authored-by: Anya Kramar <[email protected]>
…e and example (kubeflow#2928) Signed-off-by: Akshay Chitneni <[email protected]> Co-authored-by: Akshay Chitneni <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
…T example (kubeflow#2979) Signed-off-by: Antonin Stefanutti <[email protected]> Co-authored-by: Antonin Stefanutti <[email protected]>
* created github workflow for trainer * added workflow dispatcher * updating temp quay token in github * Remove odh-kfto-sdk-notebooks-sync workflow * updated build pipeline to use rhoai docker file * removed pre-build commands from build and publish * added multiarch docker file * fixed typo for multiarch * fixed multiarch file * temporary quay push * reverted local build image testing creds * Update Dockerfile.rhoai * update dockerfile.rhoai to dockerfile.odh * fixed nitpick comments * removed odh-release.yaml
- Add RHOAI specific Dockerfile for Trainer V2 controller image
- Add RHOAI overlay manifests for Trainer V2
- Add custom training runtimes in rhoai overlay
- Update component metadata and controller image to v2.1.0
- Add makefile automated command for trainer-rhoai manifests deployment and cleanup
Signed-off-by: abhijeet-dhumal <[email protected]>
Kubebuilder by default serves metrics on 8443 with tls. Signed-off-by: Rob Bell <[email protected]>
fix(manifests): fix prometheus monitoring config
Rebase latest ODH main into RHOAI main
fix monitoring config
…implementation (kubeflow#20) * feat: Add training progression tracking feature for experimental implementation Signed-off-by: abhijeet-dhumal <[email protected]> * fix: add only permission needed for controller - pod list/get Signed-off-by: abhijeet-dhumal <[email protected]> * fix: Add prestop webhhok to handle progresstion capture gracefully during trainjob termination Signed-off-by: abhijeet-dhumal <[email protected]> * fix: add preStop hook for progression tracking final status capture Signed-off-by: abhijeet-dhumal <[email protected]> * fix: add preStop hook e2e test cases Signed-off-by: abhijeet-dhumal <[email protected]> * fix: remove framework annotation Signed-off-by: abhijeet-dhumal <[email protected]> * fix: Reduce metrics polling timeout from 8s to 2s Signed-off-by: abhijeet-dhumal <[email protected]> * fix: replace metrics validation with selective cleaning Signed-off-by: abhijeet-dhumal <[email protected]> * fix: Integrate progression tracking into TrainJob controller Signed-off-by: abhijeet-dhumal <[email protected]> * fix: handle edge cases of trainjob completion and failures Signed-off-by: abhijeet-dhumal <[email protected]> * fix: minor fixes Signed-off-by: abhijeet-dhumal <[email protected]> * fix: add client reader to avoid watch on pods Signed-off-by: abhijeet-dhumal <[email protected]> * fix: adjust progression e2e tests Signed-off-by: abhijeet-dhumal <[email protected]> --------- Signed-off-by: abhijeet-dhumal <[email protected]>
Signed-off-by: Brian Gallagher <[email protected]>
robert-bell
approved these changes
Nov 28, 2025
72bb8e6
into
red-hat-data-services:rhoai-3.2
11 of 16 checks passed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #
Checklist: