Provide training runtimes from Training operator v1 by sutaakar · Pull Request #37 · opendatahub-io/trainer

sutaakar · 2025-12-05T15:07:40Z

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

Docs included if any changes are user facing

Summary by CodeRabbit

Chores
- Consolidated multiple runtime entries into a single ROCm-based distributed runtime.
- Removed several legacy runtime manifests.
- Added image replacement rules and new image parameters for runtime deployments.
- Updated container images across distributed and training-hub runtimes to newer PyTorch/ROCm builds.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-05T15:07:58Z

Walkthrough

Consolidates runtime manifests and kustomization entries: removes several per-variant CUDA/ROCm manifests, adds image replacement rules and params, and updates several ClusterTrainingRuntime node container image references and one runtime name.

Changes

Cohort / File(s)	Summary
Kustomization & params `manifests/rhoai/kustomization.yaml`, `manifests/rhoai/params.env`	Added kustomize image replacement rules for multiple runtimes and controller image; added two new image parameters in `params.env` for CUDA and ROCm runtime images.
Runtime resources list `manifests/rhoai/runtimes/kustomization.yaml`	Replaced several per-variant resource entries with a single `torch_distributed_rocm.yaml` entry (removed separate torch_cuda_241/251 and torch_rocm_241/251 entries).
Removed runtime manifests `manifests/rhoai/runtimes/torch_cuda_251.yaml`, `manifests/rhoai/runtimes/torch_rocm_241.yaml`, `manifests/rhoai/runtimes/torch_rocm_251.yaml`	Deleted three ClusterTrainingRuntime manifest files for specific CUDA/ROCm variants.
Updated runtime images & metadata `manifests/rhoai/runtimes/torch_distributed.yaml`, `manifests/rhoai/runtimes/torch_distributed_rocm.yaml`, `manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml`, `manifests/rhoai/runtimes/training_hub.yaml`, `manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml`	Replaced node/container image references to `quay.io/modh/training:...` variants (e.g., `py312-cuda128-torch280`, `py312-rocm64-torch280`); renamed `metadata.name` in ROCm runtime to `torch-distributed-rocm`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Verify the new image tags exist and match expected architectures (CUDA vs ROCm).
Confirm kustomize image replacement fieldPaths correctly target ClusterTrainingRuntime replicatedJobs node image fields.
Search repo for references to deleted manifest names to avoid dangling references.
Check the torch-distributed-rocm name for collisions with existing resources.

Poem

🐰 I hopped through manifests at morning light,
Swapped images, pruned the branches neat and bright,
ROCm and CUDA now tidy and lean,
One file fewer in the runtime scene,
Nibbles of change — a rabbit's tiny delight. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Provide training runtimes from Training operator v1' directly and clearly describes the main changes: consolidating and updating Kubernetes ClusterTrainingRuntime manifests to use v1 Training operator with updated container images and new runtime configurations.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

manifests/rhoai/runtimes/kustomization.yaml

robert-bell · 2025-12-05T15:11:44Z

manifests/rhoai/runtimes/kustomization.yaml

-  - training_hub_th03_cuda128_torch28_py312.yaml
-  - training_hub.yaml


I think we can keep the training hub runtimes too. @briangallagher the custom training hub trainer in the sdk should work with the old v1 runtime image still right?

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ddd0ab and d643433.

📒 Files selected for processing (9)

manifests/rhoai/runtimes/kustomization.yaml (1 hunks)
manifests/rhoai/runtimes/torch_cuda_251.yaml (0 hunks)
manifests/rhoai/runtimes/torch_distributed.yaml (1 hunks)
manifests/rhoai/runtimes/torch_distributed_rocm.yaml (2 hunks)
manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml (0 hunks)
manifests/rhoai/runtimes/torch_rocm_241.yaml (0 hunks)
manifests/rhoai/runtimes/torch_rocm_251.yaml (0 hunks)
manifests/rhoai/runtimes/training_hub.yaml (0 hunks)
manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml (0 hunks)

💤 Files with no reviewable changes (6)

manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml
manifests/rhoai/runtimes/torch_cuda_251.yaml
manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml
manifests/rhoai/runtimes/torch_rocm_241.yaml
manifests/rhoai/runtimes/training_hub.yaml
manifests/rhoai/runtimes/torch_rocm_251.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: pre-commit

🔇 Additional comments (3)

manifests/rhoai/runtimes/torch_distributed_rocm.yaml (2)

25-25: Verify the updated image tag exists in registry before merging.

The image reference was updated to quay.io/modh/training:py312-rocm64-torch280 with Python 3.12, ROCm 6.4, and Torch 2.80. Ensure this tag is available and functional in the Quay registry, as deployment will fail if it is not accessible.

4-4: Breaking change: Runtime name updated; verify existing references and migration needs.

The runtime name changed to torch-distributed-rocm. Confirm whether existing ClusterTrainingJob or deployment references to the old name exist in the codebase, and ensure migration guidance or backward-compatibility measures are documented.

manifests/rhoai/runtimes/torch_distributed.yaml (1)

25-25: Verify image availability and registry migration intent.

Registry changed from opendatahub to modh and image simplified from odh-training-th03-cuda128-torch28-py312-rhel9:latest to training:py312-cuda128-torch280. Confirm:

The modh registry is the correct/stable location

The new image tag exists and is accessible

The removal of rhel9 OS specificity is intentional

coderabbitai · 2025-12-05T15:13:45Z

manifests/rhoai/runtimes/kustomization.yaml

  - torch_distributed.yaml
-  - training_hub_th03_cuda128_torch28_py312.yaml
-  - training_hub.yaml
+  - torch_distributed_rocm.yaml


⚠️ Potential issue | 🟠 Major

❓ Verification inconclusive

🏁 Script executed:

cat -n manifests/rhoai/runtimes/kustomization.yaml

Repository: opendatahub-io/trainer

Length of output: 188

🏁 Script executed:

git log --oneline -20 -- manifests/rhoai/runtimes/kustomization.yaml

Repository: opendatahub-io/trainer

Length of output: 188

🏁 Script executed:

rg -n '(torch_cuda_251|torch_cuda_241|torch_rocm_241|torch_rocm_251|torch_distributed_th03_cuda128_torch28_py312|training_hub_th03_cuda128_torch28_py312|training_hub)' --type=yaml --type=md --type=txt --type=json -g '!manifests/rhoai/runtimes/kustomization.yaml'

Repository: opendatahub-io/trainer

Length of output: 188

🌐 Web query:

opendatahub-io/trainer kustomization torch_cuda_251 torch_rocm_241 runtimes documentation

💡 Result:

I found the runtime kustomization and related docs:

Kubeflow Trainer — manifests/overlays/runtimes (kustomization and runtime YAMLs). This is the upstream kustomize overlay that includes the training runtimes. [1]

Open Data Hub docs (Distributed workloads / Working with distributed workloads) showing ODH training image runtime tags (examples include training:py311-cuda124-torch251 and training:py311-rocm62-torch251 — i.e., PyTorch 2.5.1 CUDA/ROCm builds). See the table and examples in the “Working with distributed workloads” doc. [2]

If you want, I can:

Open the specific kustomization.yaml and the runtime YAMLs in the kubeflow/trainer repo and paste the torch CUDA/ROCm entries (or the exact image names and newTag values) here.

Or give step-by-step instructions to apply the kustomization (kubectl kustomize/apply) to install the runtimes in your cluster.

Sources:
[1] kubeflow/trainer — manifests/overlays/runtimes (kustomization). (github.com)
[2] Open Data Hub — Working with distributed workloads (training image runtime examples: training:py311-cuda124-torch251, training:py311-rocm62-torch251). (opendatahub.io)

Citations:

1: https://github.com/kubeflow/trainer/tree/master/manifests/overlays/runtimes

2: https://opendatahub.io/docs/working-with-distributed-workloads/?utm_source=openai

Major breaking change: 7 runtime variants removed from kustomization.

The kustomization now references only torch_distributed.yaml and torch_distributed_rocm.yaml. This represents a significant consolidation from previous runtime options.

Please validate and clarify:

Impact assessment: Are there existing deployments or users depending on the removed runtimes?

Migration path: Should removal be phased with deprecation warnings, or is immediate removal acceptable?

Documentation: Has a migration guide been added to help users transition to the consolidated options?

Alternatives: Is consolidation to ROCm + generic distributed the intended future direction?

🤖 Prompt for AI Agents

In manifests/rhoai/runtimes/kustomization.yaml around lines 4-5, the kustomization was reduced to only torch_distributed.yaml and torch_distributed_rocm.yaml, removing seven runtime variants; verify impact by scanning current deployments and CI/manifests for references to the removed runtime files and list affected clusters/users, restore the removed entries if any active consumers exist or add a phased deprecation: re-add the old variants with a deprecated annotation, emit warnings in deployment tooling for one release cycle, and remove them only after notice; add a migration guide in docs/ explaining mapping from each removed runtime to the new consolidated options and update changelog/release notes; if consolidation is intentional and final, ensure the README/kustomization header documents the rationale and update automation/templates to use the new defaults.

robert-bell · 2025-12-05T15:17:51Z

manifests/rhoai/runtimes/torch_distributed_rocm.yaml

 kind: ClusterTrainingRuntime
 metadata:
-  name: torch-cuda-241
+  name: torch-distributed-rocm


@briangallagher what's the right name for the rocm training runtime? Your refinement doc only has torch-distributed-rocm6.4-torch28-py312 but no torch-distributed-rocm. Was that deliberate?

@robert-bell We debated adding that and decided not too just to reduce the runtimes by 1. I think it's fine to add , it should use latest image always similar to torch_distributed.

coveralls · 2025-12-09T13:40:00Z

Pull Request Test Coverage Report for Build 20069199743

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 53.654%

Totals
Change from base Build 20063456671:	0.0%
Covered Lines:	1564
Relevant Lines:	2915

💛 - Coveralls

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

manifests/rhoai/kustomization.yaml (1)

18-29: Image replacements are structurally correct and match params/env keys

The new replacements:

Correctly source from rhoai-config data.odh-* keys added in params.env.

Target the expected ClusterTrainingRuntime names and the controller Deployment.

Use field paths that match the current replicatedJobs → containers[name=node].image layout.

This cleanly centralizes controller and runtime images behind the config map; just be mindful that any future CRD shape changes will require updating these field paths.

Also applies to: 31-95

manifests/rhoai/runtimes/kustomization.yaml (1)

3-8: Runtime resource set aligns with replacements and earlier review feedback

The kustomization now includes both general and specific torch‑distributed and training‑hub runtimes plus the ROCm variant, matching the runtime YAML filenames and the image replacement rules in the top‑level rhoai kustomization. This looks like a coherent consolidated runtime set; just confirm that docs and sample configs are updated to point at these five runtimes.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d643433 and ca17ace.

📒 Files selected for processing (11)

manifests/rhoai/kustomization.yaml (2 hunks)
manifests/rhoai/params.env (1 hunks)
manifests/rhoai/runtimes/kustomization.yaml (1 hunks)
manifests/rhoai/runtimes/torch_cuda_251.yaml (0 hunks)
manifests/rhoai/runtimes/torch_distributed.yaml (1 hunks)
manifests/rhoai/runtimes/torch_distributed_rocm.yaml (2 hunks)
manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml (1 hunks)
manifests/rhoai/runtimes/torch_rocm_241.yaml (0 hunks)
manifests/rhoai/runtimes/torch_rocm_251.yaml (0 hunks)
manifests/rhoai/runtimes/training_hub.yaml (1 hunks)
manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml (1 hunks)

💤 Files with no reviewable changes (3)

manifests/rhoai/runtimes/torch_rocm_241.yaml
manifests/rhoai/runtimes/torch_rocm_251.yaml
manifests/rhoai/runtimes/torch_cuda_251.yaml

🚧 Files skipped from review as they are similar to previous changes (3)

manifests/rhoai/runtimes/training_hub.yaml
manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml
manifests/rhoai/runtimes/torch_distributed.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Generate
GitHub Check: Test

🔇 Additional comments (3)

manifests/rhoai/params.env (1)

2-6: Params/env wiring for runtime images looks consistent

The new image keys line up with the kustomize replacements and the runtime manifests (ROCm vs CUDA) and give a single place to override tags. No issues from a kustomize/consistency perspective.

manifests/rhoai/runtimes/torch_distributed_rocm.yaml (1)

4-4: Runtime name and ROCm image align with kustomize replacements

metadata.name: torch-distributed-rocm matches the target selector in manifests/rhoai/kustomization.yaml, and the node image matches the default from odh-torch-distributed-rocm-image. This keeps the ROCm runtime self‑consistent across manifests; just ensure docs/SDK references are updated to this new name where relevant.

Also applies to: 25-25

manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml (1)

24-25: Training hub CUDA runtime image matches centralized config

The updated node image is consistent with the odh-training-hub-th03-cuda128-torch28-py312-image parameter and the kustomize replacement targeting this runtime, so the CUDA training hub variant is wired correctly for overrides.

robert-bell · 2025-12-09T15:12:48Z

manifests/rhoai/kustomization.yaml

 - params.yaml

 replacements:
+# Replace controller image


suggestion: can we use kustomizes' images to do the transformations? Might be a bit more concise and readable.

I think you add do some stuff with kustomizeconfig to set it up for CRs too. Check out the kustomize docs.

Unfortunately images transformation can't be used as we process the image references from params.env file, we don't explicitly declare images (currently declared images in training runtimes are placeholders only, should they be rather replaced by explicit placeholder value)?

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ca17ace and 0dea830.

📒 Files selected for processing (11)

manifests/rhoai/kustomization.yaml (2 hunks)
manifests/rhoai/params.env (1 hunks)
manifests/rhoai/runtimes/kustomization.yaml (1 hunks)
manifests/rhoai/runtimes/torch_cuda_251.yaml (0 hunks)
manifests/rhoai/runtimes/torch_distributed.yaml (1 hunks)
manifests/rhoai/runtimes/torch_distributed_rocm.yaml (2 hunks)
manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml (1 hunks)
manifests/rhoai/runtimes/torch_rocm_241.yaml (0 hunks)
manifests/rhoai/runtimes/torch_rocm_251.yaml (0 hunks)
manifests/rhoai/runtimes/training_hub.yaml (1 hunks)
manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml (1 hunks)

💤 Files with no reviewable changes (3)

manifests/rhoai/runtimes/torch_cuda_251.yaml
manifests/rhoai/runtimes/torch_rocm_241.yaml
manifests/rhoai/runtimes/torch_rocm_251.yaml

🚧 Files skipped from review as they are similar to previous changes (5)

manifests/rhoai/params.env
manifests/rhoai/kustomization.yaml
manifests/rhoai/runtimes/torch_distributed.yaml
manifests/rhoai/runtimes/training_hub.yaml
manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: pre-commit
GitHub Check: Test
GitHub Check: Generate

🔇 Additional comments (3)

manifests/rhoai/runtimes/kustomization.yaml (1)

1-8: Incomplete code snippet provided for comprehensive review.

The AI summary indicates that image replacement rules, controller image references, and v1 field path mappings were added to this kustomization, but the provided code snippet only shows lines 1-8 (the resources section). The image replacement rules mentioned in the summary are not visible in the provided excerpt.

The visible resources section correctly retains torch_distributed_th03_cuda128_torch28_py312.yaml and torch_distributed.yaml as requested in previous review feedback.

Please provide the complete kustomization.yaml file (including the image replacement rules, params, and imageReplacements sections mentioned in the AI summary) for full validation of:

Correct image replacement rules targeting the right resources

Valid v1 field path syntax for ClusterTrainingRuntime updates

Consistency of image naming across all replacement rules

manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml (1)

25-25: Image reference verified and correct.

The image quay.io/modh/training:py312-cuda128-torch280 exists in the quay.io registry and is officially supported by Red Hat OpenShift AI. The image contains Python 3.12, CUDA 12.8, and PyTorch 2.8.0, matching the version identifiers in the tag. This change from the floating :latest tag to a specific version tag is a best practice improvement.

manifests/rhoai/runtimes/torch_distributed_rocm.yaml (1)

25-25: Image tag py312-rocm64-torch280 does not exist at quay.io — use py312-rocm63-torch280 instead.

The image tag specified in this file could not be found in quay.io. Red Hat OpenShift AI documentation lists quay.io/modh/training:py312-rocm63-torch280 as the available KFTO training image with Python 3.12, PyTorch 2.8.0, and ROCm 6.3. Update the image reference to use ROCm 6.3 instead of the non-existent ROCm 6.4 tag.

Likely an incorrect or invalid review comment.

manifests/rhoai/runtimes/torch_distributed_rocm.yaml

briangallagher · 2025-12-09T16:22:13Z

/lgtm @sutaakar

sutaakar requested review from briangallagher and robert-bell December 5, 2025 15:07

robert-bell reviewed Dec 5, 2025

View reviewed changes

manifests/rhoai/runtimes/kustomization.yaml Show resolved Hide resolved

robert-bell reviewed Dec 5, 2025

View reviewed changes

coderabbitai bot reviewed Dec 5, 2025

View reviewed changes

robert-bell reviewed Dec 5, 2025

View reviewed changes

sutaakar marked this pull request as draft December 5, 2025 15:29

sutaakar force-pushed the runtimes branch from d643433 to ca17ace Compare December 9, 2025 13:35

sutaakar marked this pull request as ready for review December 9, 2025 13:36

coderabbitai bot reviewed Dec 9, 2025

View reviewed changes

robert-bell reviewed Dec 9, 2025

View reviewed changes

Provide training runtimes from Training operator v1

0dea830

sutaakar force-pushed the runtimes branch from ca17ace to 0dea830 Compare December 9, 2025 15:37

coderabbitai bot reviewed Dec 9, 2025

View reviewed changes

manifests/rhoai/runtimes/torch_distributed_rocm.yaml Show resolved Hide resolved

robert-bell approved these changes Dec 9, 2025

View reviewed changes

sutaakar merged commit 6fea7ab into opendatahub-io:main Dec 9, 2025
9 checks passed

sutaakar deleted the runtimes branch December 9, 2025 16:26

		- training_hub_th03_cuda128_torch28_py312.yaml
		- training_hub.yaml

Comments

Conversation

sutaakar commented Dec 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

robert-bell Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

robert-bell Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

briangallagher Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

coveralls commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 20069199743

Details

💛 - Coveralls

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

robert-bell Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

sutaakar Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

briangallagher commented Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sutaakar commented Dec 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 5, 2025 •

edited

Loading

coveralls commented Dec 9, 2025 •

edited

Loading