Skip to content

Conversation

@alexxfan
Copy link

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

sutaakar and others added 30 commits November 24, 2025 09:39
* fix(docs): convert commits to list in changelog.py for compatibility

Signed-off-by: kramaranya <[email protected]>

* chore(docs): add Changelog for Trainer v2.0.0-rc.0

Signed-off-by: kramaranya <[email protected]>

---------

Signed-off-by: kramaranya <[email protected]>
…#2685)

* chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update cuda to 12.8

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
…w#2382)

* Add the manifests overlay for Kubeflow Training V2

Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>

* Update manifest: adjust permissions, and format changes

Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>

* Update manifest: rename overlay, adjust event permissions

Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>

* Update manifest: make namespace configurable

Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>

* Update manifest: move standalone, only-manager installation in namespace: kubeflow-system

Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>

* Update manifest: add overlay for Kubeflow Platform installation

Signed-off-by: Xinmin Du <[email protected]>

* add permission for pods log read & rm persistentvolumeclaims

Signed-off-by: Xinmin Du <[email protected]>

* create the runtimes before the webhooks

Signed-off-by: Xinmin Du <[email protected]>

* Specify sorting order: fifo

Signed-off-by: Xinmin Du <[email protected]>

* Deploy jobset first

Signed-off-by: Xinmin Du <[email protected]>

* remove edit permissions to runtimes; install runtimes after crds

Signed-off-by: Xinmin Du <[email protected]>

* remove pretraining directory

Signed-off-by: Xinmin Du <[email protected]>

* patch runtimes images

Signed-off-by: Xinmin Du <[email protected]>

* fix: correct image

Signed-off-by: Xinmin Du <[email protected]>

* add image patch for more runtimes

Signed-off-by: Xinmin Du <[email protected]>

* Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml

Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Du Xinmin <[email protected]>

* Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml

Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Du Xinmin <[email protected]>

* Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml

Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Du Xinmin <[email protected]>

* Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml

Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Du Xinmin <[email protected]>

* Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml

Co-authored-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Du Xinmin <[email protected]>

* role_bind for notebook & profile

Signed-off-by: Xinmin Du <[email protected]>

* fix: reorder images

Signed-off-by: Xinmin Du <[email protected]>

* fix: reuse overlay/manager & runtimes

Signed-off-by: Xinmin Du <[email protected]>

* fix: remove namespace with patch

Signed-off-by: Xinmin Du <[email protected]>

---------

Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Xinmin Du <[email protected]>
Signed-off-by: Du Xinmin <[email protected]>
Co-authored-by: Xinmin Du <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]>
…ith CTR and TrainJob yaml files (kubeflow#2669)

* chore(mainfests): include torchtune runtimes.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(manifests): Update torchtune runtimes.:

Signed-off-by: Electronic-Waste <[email protected]>

* chore(manifests): Update mounting path in CTRs.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(manifests): Update output_dir.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(manifests): Update numProcPerNode to auto.

Signed-off-by: Electronic-Waste <[email protected]>

---------

Signed-off-by: Electronic-Waste <[email protected]>
…w#2675)

* fix(plugins): fix errors in trainer command mutation of torchtune.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugins): remove config file format suffix.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(test): update UTs.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(initializer): Update the workspace of dataset/model initializer.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(plugins): parse nproc_per_node from GPU resource.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(torchtune): Add bitsandbytes dependency in requirements.txt

Signed-off-by: Electronic-Waste <[email protected]>

* fix(lint): fix lint error.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(torchtune): Remove unnecessary num_proc_per_node calculation.

Signed-off-by: Electronic-Waste <[email protected]>

* test(torch): Update invalid parameters.

Signed-off-by: Electronic-Waste <[email protected]>

---------

Signed-off-by: Electronic-Waste <[email protected]>
* feat: Mutable PodSpecOverrides for suspended TrainJob

Signed-off-by: Antonin Stefanutti <[email protected]>

* Include @tenzen-y review

Signed-off-by: Antonin Stefanutti <[email protected]>

* Add unit tests

Signed-off-by: Antonin Stefanutti <[email protected]>

---------

Signed-off-by: Antonin Stefanutti <[email protected]>
* feat(example): Add alpaca-trianjob-yaml.ipynb.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(example): Update the overview of the torchtune llama3_2 example.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(example): Update the pvc description.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(example): Add the get the fine-tuned model section.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(example): Fix some errors.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(example): fix some errors.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(manifests): Fix debug tag.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(example): Change PVC creation method to Python SDK.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(example): Remove config load.

Signed-off-by: Electronic-Waste <[email protected]>

---------

Signed-off-by: Electronic-Waste <[email protected]>
* feat: Add schedulingGates to PodSpecOverrides

Signed-off-by: Antonin Stefanutti <[email protected]>

* Change desired job to target job in PodSpecOverrides comments

Signed-off-by: Antonin Stefanutti <[email protected]>

---------

Signed-off-by: Antonin Stefanutti <[email protected]>
* fix(module): Change Go module name to v2

Signed-off-by: Andrey Velichkevich <[email protected]>

* Bump x/net to v0.38.0

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
* chore(docs): Add Changelog for v2.0.0-rc.1

Signed-off-by: Andrey Velichkevich <[email protected]>

* Move example to misc

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
* Add Red Hat to ADOPTERS.md

Signed-off-by: Yuan Tang <[email protected]>

* Update ADOPTERS.md

Signed-off-by: Yuan Tang <[email protected]>

---------

Signed-off-by: Yuan Tang <[email protected]>
* chore(ci): Add GitHub action to verify PR titles

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use operator scope

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add examples scope

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add scripts to scope

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add exporter

Signed-off-by: Andrey Velichkevich <[email protected]>

* add wip ignore label

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add PR title to the contrib guide

Signed-off-by: Andrey Velichkevich <[email protected]>

* Ignore dependencies label

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix text

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use action only on master branch

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
* chore(docs): Add Changelog for Kubeflow Trainer v2.0.0

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add links for blog post and migration guide

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add links for blog post and website

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
* feat(docs): Kubeflow Trainer ROADMAP 2025

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update roadmap

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add issue for Trainer UI

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add issues for MPI and plugin extension

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add issues for builtin trainers

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]>
akshaychitneni and others added 26 commits November 24, 2025 09:39
…s for data_cache (kubeflow#2890)

Signed-off-by: Akshay Chitneni <[email protected]>
Co-authored-by: Akshay Chitneni <[email protected]>
…obs (kubeflow#2653)

* feat(runtime): add support for launcher resource allocation in MPI jobs

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add unit tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Set numProcPerNode for MPI plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Move util func to runtime package

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix torchtune plugin

Signed-off-by: Andrey Velichkevich <[email protected]>

* Inline if for GPU check

Signed-off-by: Andrey Velichkevich <[email protected]>

* Assign container resources once

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add todo for test wrappers

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]>
…obs (kubeflow#2722)

* feat(webhook): Add validation for required containers in replicatedJobs.

Signed-off-by: Electronic-Waste <[email protected]>

* test(webhook): Add UTs for validation in required containers.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(lint): fix lint error.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(webhook): add global map & remove launcher check.

Signed-off-by: Electronic-Waste <[email protected]>

---------

Signed-off-by: Electronic-Waste <[email protected]>
* feat(manager): add controller manager configuration and configmap support

Signed-off-by: kapil27 <[email protected]>

* refactor: update configmap naming and leader election configuration

Signed-off-by: kapil27 <[email protected]>

* chore: clean up unused lines in configmap and test files

Signed-off-by: kapil27 <[email protected]>

---------

Signed-off-by: kapil27 <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]>
kubeflow#2911)

* feat(initializer): add s3 model and dataset initializers

Signed-off-by: rudeigerc <[email protected]>

* chore: refactor with opendal

Signed-off-by: rudeigerc <[email protected]>

* chore: support `role_arn` and add `ignore_patterns` field in the Initializers configs

Signed-off-by: rudeigerc <[email protected]>

---------

Signed-off-by: rudeigerc <[email protected]>
Co-authored-by: rudeigerc <[email protected]>
…ubeflow#2912)

* chore(operator): Use SSA throughout runtime framework

Signed-off-by: Antonin Stefanutti <[email protected]>

* Fix lint error

Signed-off-by: Antonin Stefanutti <[email protected]>

* Update go.mod file

Signed-off-by: Antonin Stefanutti <[email protected]>

---------

Signed-off-by: Antonin Stefanutti <[email protected]>
Co-authored-by: Antonin Stefanutti <[email protected]>
…harts (kubeflow#2914)

Signed-off-by: Antonin Stefanutti <[email protected]>
Co-authored-by: Antonin Stefanutti <[email protected]>
…branch (kubeflow#2917)

* feat(manifests): Publish Trainer Helm Charts (kubeflow#2906)

* Solve Remaining Error and bugs

Signed-off-by: adity1raut <[email protected]>

* Solve the confige

Signed-off-by: adity1raut <[email protected]>

* Update The Suggest Change

Signed-off-by: adity1raut <[email protected]>

* Update After REview

Signed-off-by: adity1raut <[email protected]>

* Update the Helm publish action

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update release doc

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use 0.0.0 version for master branch

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update release doc

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: adity1raut <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]>

* fix(manifests): Fix Helm charts image name (kubeflow#2915)

* fix(manifests): Fix Helm charts image name

Signed-off-by: Andrey Velichkevich <[email protected]>

* Always insert appVersion to the Chart.yaml file

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix comment

Signed-off-by: Andrey Velichkevich <[email protected]>

* Simplify action

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>

* fix(manifests): Remove the default tag from the controller image (kubeflow#2916)

* fix(manifests): Remove the default tag from the controller image

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix README template

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: adity1raut <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Aditya Raut <[email protected]>
…cache nodes (kubeflow#2920)

Signed-off-by: Akshay Chitneni <[email protected]>
Co-authored-by: Akshay Chitneni <[email protected]>
…#2924)

* add local docker training example

Signed-off-by: Brian Gallagher <[email protected]>

* feat: Adding local execution example notebook Co-authored-by Brian Gallagher <[email protected]>

Signed-off-by: Fiona Waters <[email protected]>

---------

Signed-off-by: Brian Gallagher <[email protected]>
Signed-off-by: Fiona Waters <[email protected]>
Co-authored-by: Brian Gallagher <[email protected]>
Co-authored-by: Fiona Waters <[email protected]>
…ubeflow#2927)

* fix(ci): Fix the Kubeflow SDK installation with Docker

Signed-off-by: Andrey Velichkevich <[email protected]>

* Uncomment delete job in local Notebooks

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update .github/workflows/test-e2e.yaml

Co-authored-by: Anya Kramar <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Anya Kramar <[email protected]>
…e and example (kubeflow#2928)

Signed-off-by: Akshay Chitneni <[email protected]>
Co-authored-by: Akshay Chitneni <[email protected]>
…T example (kubeflow#2979)

Signed-off-by: Antonin Stefanutti <[email protected]>
Co-authored-by: Antonin Stefanutti <[email protected]>
* created github workflow for trainer

* added workflow dispatcher

* updating temp quay token in github

* Remove odh-kfto-sdk-notebooks-sync workflow

* updated build pipeline to use rhoai docker file

* removed pre-build commands from build and publish

* added multiarch docker file

* fixed typo for multiarch

* fixed multiarch file

* temporary quay push

* reverted local build image testing creds

* Update Dockerfile.rhoai

* update dockerfile.rhoai to dockerfile.odh

* fixed nitpick comments

* removed odh-release.yaml
    - Add RHOAI specific Dockerfile for Trainer V2 controller image
    - Add RHOAI overlay manifests for Trainer V2
    - Add custom training runtimes in rhoai overlay
    - Update component metadata and controller image to v2.1.0
    - Add makefile automated command for trainer-rhoai manifests deployment and cleanup

Signed-off-by: abhijeet-dhumal <[email protected]>
Kubebuilder by default serves metrics on 8443 with tls.

Signed-off-by: Rob Bell <[email protected]>
@@ -0,0 +1,6 @@
# MLX libraries.
mlx[cuda]==0.28.0

Check warning

Code scanning / Trivy

mlx: MLX has heap-buffer-overflow in load() Medium

Package: mlx
Installed Version: 0.28.0
Vulnerability CVE-2025-62608
Severity: MEDIUM
Fixed Version: 0.29.4
Link: CVE-2025-62608
@@ -0,0 +1,6 @@
# MLX libraries.
mlx[cuda]==0.28.0

Check warning

Code scanning / Trivy

mlx: MLX has Wild Pointer Dereference in load_gguf() Medium

Package: mlx
Installed Version: 0.28.0
Vulnerability CVE-2025-62609
Severity: MEDIUM
Fixed Version: 0.29.4
Link: CVE-2025-62609
@alexxfan alexxfan merged commit a712678 into rhoai-3.2 Nov 24, 2025
12 of 18 checks passed
@alexxfan alexxfan deleted the rebase-3.2 branch November 24, 2025 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.