forked from opendatahub-io/trainer
-
Notifications
You must be signed in to change notification settings - Fork 0
Rebase 3.2 #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Rebase 3.2 #50
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* fix(docs): convert commits to list in changelog.py for compatibility Signed-off-by: kramaranya <[email protected]> * chore(docs): add Changelog for Trainer v2.0.0-rc.0 Signed-off-by: kramaranya <[email protected]> --------- Signed-off-by: kramaranya <[email protected]>
…nShift (kubeflow#2682) Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
…#2685) * chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1 Signed-off-by: Andrey Velichkevich <[email protected]> * Update cuda to 12.8 Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
…w#2382) * Add the manifests overlay for Kubeflow Training V2 Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> * Update manifest: adjust permissions, and format changes Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> * Update manifest: rename overlay, adjust event permissions Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> * Update manifest: make namespace configurable Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> * Update manifest: move standalone, only-manager installation in namespace: kubeflow-system Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> * Update manifest: add overlay for Kubeflow Platform installation Signed-off-by: Xinmin Du <[email protected]> * add permission for pods log read & rm persistentvolumeclaims Signed-off-by: Xinmin Du <[email protected]> * create the runtimes before the webhooks Signed-off-by: Xinmin Du <[email protected]> * Specify sorting order: fifo Signed-off-by: Xinmin Du <[email protected]> * Deploy jobset first Signed-off-by: Xinmin Du <[email protected]> * remove edit permissions to runtimes; install runtimes after crds Signed-off-by: Xinmin Du <[email protected]> * remove pretraining directory Signed-off-by: Xinmin Du <[email protected]> * patch runtimes images Signed-off-by: Xinmin Du <[email protected]> * fix: correct image Signed-off-by: Xinmin Du <[email protected]> * add image patch for more runtimes Signed-off-by: Xinmin Du <[email protected]> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Du Xinmin <[email protected]> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Du Xinmin <[email protected]> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Du Xinmin <[email protected]> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Du Xinmin <[email protected]> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <[email protected]> Signed-off-by: Du Xinmin <[email protected]> * role_bind for notebook & profile Signed-off-by: Xinmin Du <[email protected]> * fix: reorder images Signed-off-by: Xinmin Du <[email protected]> * fix: reuse overlay/manager & runtimes Signed-off-by: Xinmin Du <[email protected]> * fix: remove namespace with patch Signed-off-by: Xinmin Du <[email protected]> --------- Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Xinmin Du <[email protected]> Signed-off-by: Du Xinmin <[email protected]> Co-authored-by: Xinmin Du <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]>
…ith CTR and TrainJob yaml files (kubeflow#2669) * chore(mainfests): include torchtune runtimes. Signed-off-by: Electronic-Waste <[email protected]> * fix(manifests): Update torchtune runtimes.: Signed-off-by: Electronic-Waste <[email protected]> * chore(manifests): Update mounting path in CTRs. Signed-off-by: Electronic-Waste <[email protected]> * fix(manifests): Update output_dir. Signed-off-by: Electronic-Waste <[email protected]> * fix(manifests): Update numProcPerNode to auto. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>
…w#2675) * fix(plugins): fix errors in trainer command mutation of torchtune. Signed-off-by: Electronic-Waste <[email protected]> * fix(plugins): remove config file format suffix. Signed-off-by: Electronic-Waste <[email protected]> * fix(test): update UTs. Signed-off-by: Electronic-Waste <[email protected]> * fix(initializer): Update the workspace of dataset/model initializer. Signed-off-by: Electronic-Waste <[email protected]> * fix(plugins): parse nproc_per_node from GPU resource. Signed-off-by: Electronic-Waste <[email protected]> * fix(torchtune): Add bitsandbytes dependency in requirements.txt Signed-off-by: Electronic-Waste <[email protected]> * fix(lint): fix lint error. Signed-off-by: Electronic-Waste <[email protected]> * fix(torchtune): Remove unnecessary num_proc_per_node calculation. Signed-off-by: Electronic-Waste <[email protected]> * test(torch): Update invalid parameters. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
…ubeflow#2695) Signed-off-by: Yuki Iwai <[email protected]>
* feat: Mutable PodSpecOverrides for suspended TrainJob Signed-off-by: Antonin Stefanutti <[email protected]> * Include @tenzen-y review Signed-off-by: Antonin Stefanutti <[email protected]> * Add unit tests Signed-off-by: Antonin Stefanutti <[email protected]> --------- Signed-off-by: Antonin Stefanutti <[email protected]>
* feat(example): Add alpaca-trianjob-yaml.ipynb. Signed-off-by: Electronic-Waste <[email protected]> * fix(example): Update the overview of the torchtune llama3_2 example. Signed-off-by: Electronic-Waste <[email protected]> * fix(example): Update the pvc description. Signed-off-by: Electronic-Waste <[email protected]> * chore(example): Add the get the fine-tuned model section. Signed-off-by: Electronic-Waste <[email protected]> * fix(example): Fix some errors. Signed-off-by: Electronic-Waste <[email protected]> * fix(example): fix some errors. Signed-off-by: Electronic-Waste <[email protected]> * fix(manifests): Fix debug tag. Signed-off-by: Electronic-Waste <[email protected]> * fix(example): Change PVC creation method to Python SDK. Signed-off-by: Electronic-Waste <[email protected]> * fix(example): Remove config load. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
* feat: Add schedulingGates to PodSpecOverrides Signed-off-by: Antonin Stefanutti <[email protected]> * Change desired job to target job in PodSpecOverrides comments Signed-off-by: Antonin Stefanutti <[email protected]> --------- Signed-off-by: Antonin Stefanutti <[email protected]>
* fix(module): Change Go module name to v2 Signed-off-by: Andrey Velichkevich <[email protected]> * Bump x/net to v0.38.0 Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
* chore(docs): Add Changelog for v2.0.0-rc.1 Signed-off-by: Andrey Velichkevich <[email protected]> * Move example to misc Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
* Add Red Hat to ADOPTERS.md Signed-off-by: Yuan Tang <[email protected]> * Update ADOPTERS.md Signed-off-by: Yuan Tang <[email protected]> --------- Signed-off-by: Yuan Tang <[email protected]>
…d to job (kubeflow#2719) Signed-off-by: rudeigerc <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
…ow#2731) Signed-off-by: rudeigerc <[email protected]>
* chore(ci): Add GitHub action to verify PR titles Signed-off-by: Andrey Velichkevich <[email protected]> * Use operator scope Signed-off-by: Andrey Velichkevich <[email protected]> * Add examples scope Signed-off-by: Andrey Velichkevich <[email protected]> * Add scripts to scope Signed-off-by: Andrey Velichkevich <[email protected]> * Add exporter Signed-off-by: Andrey Velichkevich <[email protected]> * add wip ignore label Signed-off-by: Andrey Velichkevich <[email protected]> * Add PR title to the contrib guide Signed-off-by: Andrey Velichkevich <[email protected]> * Ignore dependencies label Signed-off-by: Andrey Velichkevich <[email protected]> * Fix text Signed-off-by: Andrey Velichkevich <[email protected]> * Use action only on master branch Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
…ue template (kubeflow#2732) Signed-off-by: rudeigerc <[email protected]>
… jobset (kubeflow#2734) Signed-off-by: rudeigerc <[email protected]>
Signed-off-by: Koray Oksay <[email protected]>
* chore(docs): Add Changelog for Kubeflow Trainer v2.0.0 Signed-off-by: Andrey Velichkevich <[email protected]> * Add links for blog post and migration guide Signed-off-by: Andrey Velichkevich <[email protected]> * Add links for blog post and website Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
* feat(docs): Kubeflow Trainer ROADMAP 2025 Signed-off-by: Andrey Velichkevich <[email protected]> * Update roadmap Signed-off-by: Andrey Velichkevich <[email protected]> * Add issue for Trainer UI Signed-off-by: Andrey Velichkevich <[email protected]> * Add issues for MPI and plugin extension Signed-off-by: Andrey Velichkevich <[email protected]> * Add issues for builtin trainers Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
…s for data_cache (kubeflow#2890) Signed-off-by: Akshay Chitneni <[email protected]> Co-authored-by: Akshay Chitneni <[email protected]>
kubeflow#2898) Signed-off-by: Xinmin Du <[email protected]>
…obs (kubeflow#2653) * feat(runtime): add support for launcher resource allocation in MPI jobs Signed-off-by: Andrey Velichkevich <[email protected]> * Add unit tests Signed-off-by: Andrey Velichkevich <[email protected]> * Set numProcPerNode for MPI plugin Signed-off-by: Andrey Velichkevich <[email protected]> * Move util func to runtime package Signed-off-by: Andrey Velichkevich <[email protected]> * Fix torchtune plugin Signed-off-by: Andrey Velichkevich <[email protected]> * Inline if for GPU check Signed-off-by: Andrey Velichkevich <[email protected]> * Assign container resources once Signed-off-by: Andrey Velichkevich <[email protected]> * Add todo for test wrappers Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]>
…obs (kubeflow#2722) * feat(webhook): Add validation for required containers in replicatedJobs. Signed-off-by: Electronic-Waste <[email protected]> * test(webhook): Add UTs for validation in required containers. Signed-off-by: Electronic-Waste <[email protected]> * fix(lint): fix lint error. Signed-off-by: Electronic-Waste <[email protected]> * fix(webhook): add global map & remove launcher check. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>
* feat(manager): add controller manager configuration and configmap support Signed-off-by: kapil27 <[email protected]> * refactor: update configmap naming and leader election configuration Signed-off-by: kapil27 <[email protected]> * chore: clean up unused lines in configmap and test files Signed-off-by: kapil27 <[email protected]> --------- Signed-off-by: kapil27 <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]>
kubeflow#2911) * feat(initializer): add s3 model and dataset initializers Signed-off-by: rudeigerc <[email protected]> * chore: refactor with opendal Signed-off-by: rudeigerc <[email protected]> * chore: support `role_arn` and add `ignore_patterns` field in the Initializers configs Signed-off-by: rudeigerc <[email protected]> --------- Signed-off-by: rudeigerc <[email protected]> Co-authored-by: rudeigerc <[email protected]>
…ubeflow#2912) * chore(operator): Use SSA throughout runtime framework Signed-off-by: Antonin Stefanutti <[email protected]> * Fix lint error Signed-off-by: Antonin Stefanutti <[email protected]> * Update go.mod file Signed-off-by: Antonin Stefanutti <[email protected]> --------- Signed-off-by: Antonin Stefanutti <[email protected]> Co-authored-by: Antonin Stefanutti <[email protected]>
…harts (kubeflow#2914) Signed-off-by: Antonin Stefanutti <[email protected]> Co-authored-by: Antonin Stefanutti <[email protected]>
…branch (kubeflow#2917) * feat(manifests): Publish Trainer Helm Charts (kubeflow#2906) * Solve Remaining Error and bugs Signed-off-by: adity1raut <[email protected]> * Solve the confige Signed-off-by: adity1raut <[email protected]> * Update The Suggest Change Signed-off-by: adity1raut <[email protected]> * Update After REview Signed-off-by: adity1raut <[email protected]> * Update the Helm publish action Signed-off-by: Andrey Velichkevich <[email protected]> * Update release doc Signed-off-by: Andrey Velichkevich <[email protected]> * Use 0.0.0 version for master branch Signed-off-by: Andrey Velichkevich <[email protected]> * Update release doc Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: adity1raut <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]> * fix(manifests): Fix Helm charts image name (kubeflow#2915) * fix(manifests): Fix Helm charts image name Signed-off-by: Andrey Velichkevich <[email protected]> * Always insert appVersion to the Chart.yaml file Signed-off-by: Andrey Velichkevich <[email protected]> * Fix comment Signed-off-by: Andrey Velichkevich <[email protected]> * Simplify action Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> * fix(manifests): Remove the default tag from the controller image (kubeflow#2916) * fix(manifests): Remove the default tag from the controller image Signed-off-by: Andrey Velichkevich <[email protected]> * Fix README template Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: adity1raut <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Aditya Raut <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
…cache nodes (kubeflow#2920) Signed-off-by: Akshay Chitneni <[email protected]> Co-authored-by: Akshay Chitneni <[email protected]>
…#2924) * add local docker training example Signed-off-by: Brian Gallagher <[email protected]> * feat: Adding local execution example notebook Co-authored-by Brian Gallagher <[email protected]> Signed-off-by: Fiona Waters <[email protected]> --------- Signed-off-by: Brian Gallagher <[email protected]> Signed-off-by: Fiona Waters <[email protected]> Co-authored-by: Brian Gallagher <[email protected]> Co-authored-by: Fiona Waters <[email protected]>
…ubeflow#2927) * fix(ci): Fix the Kubeflow SDK installation with Docker Signed-off-by: Andrey Velichkevich <[email protected]> * Uncomment delete job in local Notebooks Signed-off-by: Andrey Velichkevich <[email protected]> * Update .github/workflows/test-e2e.yaml Co-authored-by: Anya Kramar <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Andrey Velichkevich <[email protected]> Co-authored-by: Anya Kramar <[email protected]>
…e and example (kubeflow#2928) Signed-off-by: Akshay Chitneni <[email protected]> Co-authored-by: Akshay Chitneni <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
…T example (kubeflow#2979) Signed-off-by: Antonin Stefanutti <[email protected]> Co-authored-by: Antonin Stefanutti <[email protected]>
* created github workflow for trainer * added workflow dispatcher * updating temp quay token in github * Remove odh-kfto-sdk-notebooks-sync workflow * updated build pipeline to use rhoai docker file * removed pre-build commands from build and publish * added multiarch docker file * fixed typo for multiarch * fixed multiarch file * temporary quay push * reverted local build image testing creds * Update Dockerfile.rhoai * update dockerfile.rhoai to dockerfile.odh * fixed nitpick comments * removed odh-release.yaml
- Add RHOAI specific Dockerfile for Trainer V2 controller image
- Add RHOAI overlay manifests for Trainer V2
- Add custom training runtimes in rhoai overlay
- Update component metadata and controller image to v2.1.0
- Add makefile automated command for trainer-rhoai manifests deployment and cleanup
Signed-off-by: abhijeet-dhumal <[email protected]>
Kubebuilder by default serves metrics on 8443 with tls. Signed-off-by: Rob Bell <[email protected]>
| @@ -0,0 +1,6 @@ | |||
| # MLX libraries. | |||
| mlx[cuda]==0.28.0 | |||
Check warning
Code scanning / Trivy
mlx: MLX has heap-buffer-overflow in load() Medium
Package: mlx
Installed Version: 0.28.0
Vulnerability CVE-2025-62608
Severity: MEDIUM
Fixed Version: 0.29.4
Link: CVE-2025-62608
Installed Version: 0.28.0
Vulnerability CVE-2025-62608
Severity: MEDIUM
Fixed Version: 0.29.4
Link: CVE-2025-62608
| @@ -0,0 +1,6 @@ | |||
| # MLX libraries. | |||
| mlx[cuda]==0.28.0 | |||
Check warning
Code scanning / Trivy
mlx: MLX has Wild Pointer Dereference in load_gguf() Medium
Package: mlx
Installed Version: 0.28.0
Vulnerability CVE-2025-62609
Severity: MEDIUM
Fixed Version: 0.29.4
Link: CVE-2025-62609
Installed Version: 0.28.0
Vulnerability CVE-2025-62609
Severity: MEDIUM
Fixed Version: 0.29.4
Link: CVE-2025-62609
sutaakar
approved these changes
Nov 24, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #
Checklist: