Skip to content

Comments

Provide training runtimes from Training operator v1#37

Merged
sutaakar merged 1 commit intoopendatahub-io:mainfrom
sutaakar:runtimes
Dec 9, 2025
Merged

Provide training runtimes from Training operator v1#37
sutaakar merged 1 commit intoopendatahub-io:mainfrom
sutaakar:runtimes

Conversation

@sutaakar
Copy link
Collaborator

@sutaakar sutaakar commented Dec 5, 2025

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Summary by CodeRabbit

  • Chores
    • Consolidated multiple runtime entries into a single ROCm-based distributed runtime.
    • Removed several legacy runtime manifests.
    • Added image replacement rules and new image parameters for runtime deployments.
    • Updated container images across distributed and training-hub runtimes to newer PyTorch/ROCm builds.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 5, 2025

Walkthrough

Consolidates runtime manifests and kustomization entries: removes several per-variant CUDA/ROCm manifests, adds image replacement rules and params, and updates several ClusterTrainingRuntime node container image references and one runtime name.

Changes

Cohort / File(s) Summary
Kustomization & params
manifests/rhoai/kustomization.yaml, manifests/rhoai/params.env
Added kustomize image replacement rules for multiple runtimes and controller image; added two new image parameters in params.env for CUDA and ROCm runtime images.
Runtime resources list
manifests/rhoai/runtimes/kustomization.yaml
Replaced several per-variant resource entries with a single torch_distributed_rocm.yaml entry (removed separate torch_cuda_241/251 and torch_rocm_241/251 entries).
Removed runtime manifests
manifests/rhoai/runtimes/torch_cuda_251.yaml, manifests/rhoai/runtimes/torch_rocm_241.yaml, manifests/rhoai/runtimes/torch_rocm_251.yaml
Deleted three ClusterTrainingRuntime manifest files for specific CUDA/ROCm variants.
Updated runtime images & metadata
manifests/rhoai/runtimes/torch_distributed.yaml, manifests/rhoai/runtimes/torch_distributed_rocm.yaml, manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml, manifests/rhoai/runtimes/training_hub.yaml, manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml
Replaced node/container image references to quay.io/modh/training:... variants (e.g., py312-cuda128-torch280, py312-rocm64-torch280); renamed metadata.name in ROCm runtime to torch-distributed-rocm.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify the new image tags exist and match expected architectures (CUDA vs ROCm).
  • Confirm kustomize image replacement fieldPaths correctly target ClusterTrainingRuntime replicatedJobs node image fields.
  • Search repo for references to deleted manifest names to avoid dangling references.
  • Check the torch-distributed-rocm name for collisions with existing resources.

Poem

🐰 I hopped through manifests at morning light,
Swapped images, pruned the branches neat and bright,
ROCm and CUDA now tidy and lean,
One file fewer in the runtime scene,
Nibbles of change — a rabbit's tiny delight. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Provide training runtimes from Training operator v1' directly and clearly describes the main changes: consolidating and updating Kubernetes ClusterTrainingRuntime manifests to use v1 Training operator with updated container images and new runtime configurations.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment on lines 10 to 11
- training_hub_th03_cuda128_torch28_py312.yaml
- training_hub.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep the training hub runtimes too. @briangallagher the custom training hub trainer in the sdk should work with the old v1 runtime image still right?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ddd0ab and d643433.

📒 Files selected for processing (9)
  • manifests/rhoai/runtimes/kustomization.yaml (1 hunks)
  • manifests/rhoai/runtimes/torch_cuda_251.yaml (0 hunks)
  • manifests/rhoai/runtimes/torch_distributed.yaml (1 hunks)
  • manifests/rhoai/runtimes/torch_distributed_rocm.yaml (2 hunks)
  • manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml (0 hunks)
  • manifests/rhoai/runtimes/torch_rocm_241.yaml (0 hunks)
  • manifests/rhoai/runtimes/torch_rocm_251.yaml (0 hunks)
  • manifests/rhoai/runtimes/training_hub.yaml (0 hunks)
  • manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml (0 hunks)
💤 Files with no reviewable changes (6)
  • manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml
  • manifests/rhoai/runtimes/torch_cuda_251.yaml
  • manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml
  • manifests/rhoai/runtimes/torch_rocm_241.yaml
  • manifests/rhoai/runtimes/training_hub.yaml
  • manifests/rhoai/runtimes/torch_rocm_251.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: pre-commit
🔇 Additional comments (3)
manifests/rhoai/runtimes/torch_distributed_rocm.yaml (2)

25-25: Verify the updated image tag exists in registry before merging.

The image reference was updated to quay.io/modh/training:py312-rocm64-torch280 with Python 3.12, ROCm 6.4, and Torch 2.80. Ensure this tag is available and functional in the Quay registry, as deployment will fail if it is not accessible.


4-4: Breaking change: Runtime name updated; verify existing references and migration needs.

The runtime name changed to torch-distributed-rocm. Confirm whether existing ClusterTrainingJob or deployment references to the old name exist in the codebase, and ensure migration guidance or backward-compatibility measures are documented.

manifests/rhoai/runtimes/torch_distributed.yaml (1)

25-25: Verify image availability and registry migration intent.

Registry changed from opendatahub to modh and image simplified from odh-training-th03-cuda128-torch28-py312-rhel9:latest to training:py312-cuda128-torch280. Confirm:

  • The modh registry is the correct/stable location
  • The new image tag exists and is accessible
  • The removal of rhel9 OS specificity is intentional

Comment on lines 4 to 5
- torch_distributed.yaml
- training_hub_th03_cuda128_torch28_py312.yaml
- training_hub.yaml
- torch_distributed_rocm.yaml
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

❓ Verification inconclusive

🏁 Script executed:

cat -n manifests/rhoai/runtimes/kustomization.yaml

Repository: opendatahub-io/trainer

Length of output: 188


🏁 Script executed:

git log --oneline -20 -- manifests/rhoai/runtimes/kustomization.yaml

Repository: opendatahub-io/trainer

Length of output: 188


🏁 Script executed:

rg -n '(torch_cuda_251|torch_cuda_241|torch_rocm_241|torch_rocm_251|torch_distributed_th03_cuda128_torch28_py312|training_hub_th03_cuda128_torch28_py312|training_hub)' --type=yaml --type=md --type=txt --type=json -g '!manifests/rhoai/runtimes/kustomization.yaml'

Repository: opendatahub-io/trainer

Length of output: 188


🌐 Web query:

opendatahub-io/trainer kustomization torch_cuda_251 torch_rocm_241 runtimes documentation

💡 Result:

I found the runtime kustomization and related docs:

  • Kubeflow Trainer — manifests/overlays/runtimes (kustomization and runtime YAMLs). This is the upstream kustomize overlay that includes the training runtimes. [1]

  • Open Data Hub docs (Distributed workloads / Working with distributed workloads) showing ODH training image runtime tags (examples include training:py311-cuda124-torch251 and training:py311-rocm62-torch251 — i.e., PyTorch 2.5.1 CUDA/ROCm builds). See the table and examples in the “Working with distributed workloads” doc. [2]

If you want, I can:

  • Open the specific kustomization.yaml and the runtime YAMLs in the kubeflow/trainer repo and paste the torch CUDA/ROCm entries (or the exact image names and newTag values) here.
  • Or give step-by-step instructions to apply the kustomization (kubectl kustomize/apply) to install the runtimes in your cluster.

Sources:
[1] kubeflow/trainer — manifests/overlays/runtimes (kustomization). (github.com)
[2] Open Data Hub — Working with distributed workloads (training image runtime examples: training:py311-cuda124-torch251, training:py311-rocm62-torch251). (opendatahub.io)

Citations:


Major breaking change: 7 runtime variants removed from kustomization.

The kustomization now references only torch_distributed.yaml and torch_distributed_rocm.yaml. This represents a significant consolidation from previous runtime options.

Please validate and clarify:

  1. Impact assessment: Are there existing deployments or users depending on the removed runtimes?
  2. Migration path: Should removal be phased with deprecation warnings, or is immediate removal acceptable?
  3. Documentation: Has a migration guide been added to help users transition to the consolidated options?
  4. Alternatives: Is consolidation to ROCm + generic distributed the intended future direction?
🤖 Prompt for AI Agents
In manifests/rhoai/runtimes/kustomization.yaml around lines 4-5, the
kustomization was reduced to only torch_distributed.yaml and
torch_distributed_rocm.yaml, removing seven runtime variants; verify impact by
scanning current deployments and CI/manifests for references to the removed
runtime files and list affected clusters/users, restore the removed entries if
any active consumers exist or add a phased deprecation: re-add the old variants
with a deprecated annotation, emit warnings in deployment tooling for one
release cycle, and remove them only after notice; add a migration guide in docs/
explaining mapping from each removed runtime to the new consolidated options and
update changelog/release notes; if consolidation is intentional and final,
ensure the README/kustomization header documents the rationale and update
automation/templates to use the new defaults.

kind: ClusterTrainingRuntime
metadata:
name: torch-cuda-241
name: torch-distributed-rocm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@briangallagher what's the right name for the rocm training runtime? Your refinement doc only has torch-distributed-rocm6.4-torch28-py312 but no torch-distributed-rocm. Was that deliberate?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robert-bell We debated adding that and decided not too just to reduce the runtimes by 1. I think it's fine to add , it should use latest image always similar to torch_distributed.

@sutaakar sutaakar marked this pull request as draft December 5, 2025 15:29
@sutaakar sutaakar marked this pull request as ready for review December 9, 2025 13:36
@coveralls
Copy link

coveralls commented Dec 9, 2025

Pull Request Test Coverage Report for Build 20069199743

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 53.654%

Totals Coverage Status
Change from base Build 20063456671: 0.0%
Covered Lines: 1564
Relevant Lines: 2915

💛 - Coveralls

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
manifests/rhoai/kustomization.yaml (1)

18-29: Image replacements are structurally correct and match params/env keys

The new replacements:

  • Correctly source from rhoai-config data.odh-* keys added in params.env.
  • Target the expected ClusterTrainingRuntime names and the controller Deployment.
  • Use field paths that match the current replicatedJobscontainers[name=node].image layout.

This cleanly centralizes controller and runtime images behind the config map; just be mindful that any future CRD shape changes will require updating these field paths.

Also applies to: 31-95

manifests/rhoai/runtimes/kustomization.yaml (1)

3-8: Runtime resource set aligns with replacements and earlier review feedback

The kustomization now includes both general and specific torch‑distributed and training‑hub runtimes plus the ROCm variant, matching the runtime YAML filenames and the image replacement rules in the top‑level rhoai kustomization. This looks like a coherent consolidated runtime set; just confirm that docs and sample configs are updated to point at these five runtimes.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d643433 and ca17ace.

📒 Files selected for processing (11)
  • manifests/rhoai/kustomization.yaml (2 hunks)
  • manifests/rhoai/params.env (1 hunks)
  • manifests/rhoai/runtimes/kustomization.yaml (1 hunks)
  • manifests/rhoai/runtimes/torch_cuda_251.yaml (0 hunks)
  • manifests/rhoai/runtimes/torch_distributed.yaml (1 hunks)
  • manifests/rhoai/runtimes/torch_distributed_rocm.yaml (2 hunks)
  • manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml (1 hunks)
  • manifests/rhoai/runtimes/torch_rocm_241.yaml (0 hunks)
  • manifests/rhoai/runtimes/torch_rocm_251.yaml (0 hunks)
  • manifests/rhoai/runtimes/training_hub.yaml (1 hunks)
  • manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml (1 hunks)
💤 Files with no reviewable changes (3)
  • manifests/rhoai/runtimes/torch_rocm_241.yaml
  • manifests/rhoai/runtimes/torch_rocm_251.yaml
  • manifests/rhoai/runtimes/torch_cuda_251.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
  • manifests/rhoai/runtimes/training_hub.yaml
  • manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml
  • manifests/rhoai/runtimes/torch_distributed.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Generate
  • GitHub Check: Test
🔇 Additional comments (3)
manifests/rhoai/params.env (1)

2-6: Params/env wiring for runtime images looks consistent

The new image keys line up with the kustomize replacements and the runtime manifests (ROCm vs CUDA) and give a single place to override tags. No issues from a kustomize/consistency perspective.

manifests/rhoai/runtimes/torch_distributed_rocm.yaml (1)

4-4: Runtime name and ROCm image align with kustomize replacements

metadata.name: torch-distributed-rocm matches the target selector in manifests/rhoai/kustomization.yaml, and the node image matches the default from odh-torch-distributed-rocm-image. This keeps the ROCm runtime self‑consistent across manifests; just ensure docs/SDK references are updated to this new name where relevant.

Also applies to: 25-25

manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml (1)

24-25: Training hub CUDA runtime image matches centralized config

The updated node image is consistent with the odh-training-hub-th03-cuda128-torch28-py312-image parameter and the kustomize replacement targeting this runtime, so the CUDA training hub variant is wired correctly for overrides.

- params.yaml

replacements:
# Replace controller image
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: can we use kustomizes' images to do the transformations? Might be a bit more concise and readable.

I think you add do some stuff with kustomizeconfig to set it up for CRs too. Check out the kustomize docs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately images transformation can't be used as we process the image references from params.env file, we don't explicitly declare images (currently declared images in training runtimes are placeholders only, should they be rather replaced by explicit placeholder value)?

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ca17ace and 0dea830.

📒 Files selected for processing (11)
  • manifests/rhoai/kustomization.yaml (2 hunks)
  • manifests/rhoai/params.env (1 hunks)
  • manifests/rhoai/runtimes/kustomization.yaml (1 hunks)
  • manifests/rhoai/runtimes/torch_cuda_251.yaml (0 hunks)
  • manifests/rhoai/runtimes/torch_distributed.yaml (1 hunks)
  • manifests/rhoai/runtimes/torch_distributed_rocm.yaml (2 hunks)
  • manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml (1 hunks)
  • manifests/rhoai/runtimes/torch_rocm_241.yaml (0 hunks)
  • manifests/rhoai/runtimes/torch_rocm_251.yaml (0 hunks)
  • manifests/rhoai/runtimes/training_hub.yaml (1 hunks)
  • manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml (1 hunks)
💤 Files with no reviewable changes (3)
  • manifests/rhoai/runtimes/torch_cuda_251.yaml
  • manifests/rhoai/runtimes/torch_rocm_241.yaml
  • manifests/rhoai/runtimes/torch_rocm_251.yaml
🚧 Files skipped from review as they are similar to previous changes (5)
  • manifests/rhoai/params.env
  • manifests/rhoai/kustomization.yaml
  • manifests/rhoai/runtimes/torch_distributed.yaml
  • manifests/rhoai/runtimes/training_hub.yaml
  • manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: pre-commit
  • GitHub Check: Test
  • GitHub Check: Generate
🔇 Additional comments (3)
manifests/rhoai/runtimes/kustomization.yaml (1)

1-8: Incomplete code snippet provided for comprehensive review.

The AI summary indicates that image replacement rules, controller image references, and v1 field path mappings were added to this kustomization, but the provided code snippet only shows lines 1-8 (the resources section). The image replacement rules mentioned in the summary are not visible in the provided excerpt.

The visible resources section correctly retains torch_distributed_th03_cuda128_torch28_py312.yaml and torch_distributed.yaml as requested in previous review feedback.

Please provide the complete kustomization.yaml file (including the image replacement rules, params, and imageReplacements sections mentioned in the AI summary) for full validation of:

  • Correct image replacement rules targeting the right resources
  • Valid v1 field path syntax for ClusterTrainingRuntime updates
  • Consistency of image naming across all replacement rules
manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml (1)

25-25: Image reference verified and correct.

The image quay.io/modh/training:py312-cuda128-torch280 exists in the quay.io registry and is officially supported by Red Hat OpenShift AI. The image contains Python 3.12, CUDA 12.8, and PyTorch 2.8.0, matching the version identifiers in the tag. This change from the floating :latest tag to a specific version tag is a best practice improvement.

manifests/rhoai/runtimes/torch_distributed_rocm.yaml (1)

25-25: Image tag py312-rocm64-torch280 does not exist at quay.io — use py312-rocm63-torch280 instead.

The image tag specified in this file could not be found in quay.io. Red Hat OpenShift AI documentation lists quay.io/modh/training:py312-rocm63-torch280 as the available KFTO training image with Python 3.12, PyTorch 2.8.0, and ROCm 6.3. Update the image reference to use ROCm 6.3 instead of the non-existent ROCm 6.4 tag.

Likely an incorrect or invalid review comment.

@briangallagher
Copy link

/lgtm @sutaakar

@sutaakar sutaakar merged commit 6fea7ab into opendatahub-io:main Dec 9, 2025
9 checks passed
@sutaakar sutaakar deleted the runtimes branch December 9, 2025 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants