Provide training runtimes from Training operator v1#37
Provide training runtimes from Training operator v1#37sutaakar merged 1 commit intoopendatahub-io:mainfrom
Conversation
WalkthroughConsolidates runtime manifests and kustomization entries: removes several per-variant CUDA/ROCm manifests, adds image replacement rules and params, and updates several ClusterTrainingRuntime node container image references and one runtime name. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| - training_hub_th03_cuda128_torch28_py312.yaml | ||
| - training_hub.yaml |
There was a problem hiding this comment.
I think we can keep the training hub runtimes too. @briangallagher the custom training hub trainer in the sdk should work with the old v1 runtime image still right?
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
manifests/rhoai/runtimes/kustomization.yaml(1 hunks)manifests/rhoai/runtimes/torch_cuda_251.yaml(0 hunks)manifests/rhoai/runtimes/torch_distributed.yaml(1 hunks)manifests/rhoai/runtimes/torch_distributed_rocm.yaml(2 hunks)manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml(0 hunks)manifests/rhoai/runtimes/torch_rocm_241.yaml(0 hunks)manifests/rhoai/runtimes/torch_rocm_251.yaml(0 hunks)manifests/rhoai/runtimes/training_hub.yaml(0 hunks)manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml(0 hunks)
💤 Files with no reviewable changes (6)
- manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml
- manifests/rhoai/runtimes/torch_cuda_251.yaml
- manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml
- manifests/rhoai/runtimes/torch_rocm_241.yaml
- manifests/rhoai/runtimes/training_hub.yaml
- manifests/rhoai/runtimes/torch_rocm_251.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: pre-commit
🔇 Additional comments (3)
manifests/rhoai/runtimes/torch_distributed_rocm.yaml (2)
25-25: Verify the updated image tag exists in registry before merging.The image reference was updated to
quay.io/modh/training:py312-rocm64-torch280with Python 3.12, ROCm 6.4, and Torch 2.80. Ensure this tag is available and functional in the Quay registry, as deployment will fail if it is not accessible.
4-4: Breaking change: Runtime name updated; verify existing references and migration needs.The runtime name changed to
torch-distributed-rocm. Confirm whether existing ClusterTrainingJob or deployment references to the old name exist in the codebase, and ensure migration guidance or backward-compatibility measures are documented.manifests/rhoai/runtimes/torch_distributed.yaml (1)
25-25: Verify image availability and registry migration intent.Registry changed from
opendatahubtomodhand image simplified fromodh-training-th03-cuda128-torch28-py312-rhel9:latesttotraining:py312-cuda128-torch280. Confirm:
- The
modhregistry is the correct/stable location- The new image tag exists and is accessible
- The removal of
rhel9OS specificity is intentional
| - torch_distributed.yaml | ||
| - training_hub_th03_cuda128_torch28_py312.yaml | ||
| - training_hub.yaml | ||
| - torch_distributed_rocm.yaml |
There was a problem hiding this comment.
❓ Verification inconclusive
🏁 Script executed:
cat -n manifests/rhoai/runtimes/kustomization.yamlRepository: opendatahub-io/trainer
Length of output: 188
🏁 Script executed:
git log --oneline -20 -- manifests/rhoai/runtimes/kustomization.yamlRepository: opendatahub-io/trainer
Length of output: 188
🏁 Script executed:
rg -n '(torch_cuda_251|torch_cuda_241|torch_rocm_241|torch_rocm_251|torch_distributed_th03_cuda128_torch28_py312|training_hub_th03_cuda128_torch28_py312|training_hub)' --type=yaml --type=md --type=txt --type=json -g '!manifests/rhoai/runtimes/kustomization.yaml'Repository: opendatahub-io/trainer
Length of output: 188
🌐 Web query:
opendatahub-io/trainer kustomization torch_cuda_251 torch_rocm_241 runtimes documentation
💡 Result:
I found the runtime kustomization and related docs:
-
Kubeflow Trainer — manifests/overlays/runtimes (kustomization and runtime YAMLs). This is the upstream kustomize overlay that includes the training runtimes. [1]
-
Open Data Hub docs (Distributed workloads / Working with distributed workloads) showing ODH training image runtime tags (examples include training:py311-cuda124-torch251 and training:py311-rocm62-torch251 — i.e., PyTorch 2.5.1 CUDA/ROCm builds). See the table and examples in the “Working with distributed workloads” doc. [2]
If you want, I can:
- Open the specific kustomization.yaml and the runtime YAMLs in the kubeflow/trainer repo and paste the torch CUDA/ROCm entries (or the exact image names and newTag values) here.
- Or give step-by-step instructions to apply the kustomization (kubectl kustomize/apply) to install the runtimes in your cluster.
Sources:
[1] kubeflow/trainer — manifests/overlays/runtimes (kustomization). (github.com)
[2] Open Data Hub — Working with distributed workloads (training image runtime examples: training:py311-cuda124-torch251, training:py311-rocm62-torch251). (opendatahub.io)
Citations:
- 1: https://github.com/kubeflow/trainer/tree/master/manifests/overlays/runtimes
- 2: https://opendatahub.io/docs/working-with-distributed-workloads/?utm_source=openai
Major breaking change: 7 runtime variants removed from kustomization.
The kustomization now references only torch_distributed.yaml and torch_distributed_rocm.yaml. This represents a significant consolidation from previous runtime options.
Please validate and clarify:
- Impact assessment: Are there existing deployments or users depending on the removed runtimes?
- Migration path: Should removal be phased with deprecation warnings, or is immediate removal acceptable?
- Documentation: Has a migration guide been added to help users transition to the consolidated options?
- Alternatives: Is consolidation to ROCm + generic distributed the intended future direction?
🤖 Prompt for AI Agents
In manifests/rhoai/runtimes/kustomization.yaml around lines 4-5, the
kustomization was reduced to only torch_distributed.yaml and
torch_distributed_rocm.yaml, removing seven runtime variants; verify impact by
scanning current deployments and CI/manifests for references to the removed
runtime files and list affected clusters/users, restore the removed entries if
any active consumers exist or add a phased deprecation: re-add the old variants
with a deprecated annotation, emit warnings in deployment tooling for one
release cycle, and remove them only after notice; add a migration guide in docs/
explaining mapping from each removed runtime to the new consolidated options and
update changelog/release notes; if consolidation is intentional and final,
ensure the README/kustomization header documents the rationale and update
automation/templates to use the new defaults.
| kind: ClusterTrainingRuntime | ||
| metadata: | ||
| name: torch-cuda-241 | ||
| name: torch-distributed-rocm |
There was a problem hiding this comment.
@briangallagher what's the right name for the rocm training runtime? Your refinement doc only has torch-distributed-rocm6.4-torch28-py312 but no torch-distributed-rocm. Was that deliberate?
There was a problem hiding this comment.
@robert-bell We debated adding that and decided not too just to reduce the runtimes by 1. I think it's fine to add , it should use latest image always similar to torch_distributed.
Pull Request Test Coverage Report for Build 20069199743Details
💛 - Coveralls |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
manifests/rhoai/kustomization.yaml (1)
18-29: Image replacements are structurally correct and match params/env keysThe new
replacements:
- Correctly source from
rhoai-configdata.odh-*keys added inparams.env.- Target the expected
ClusterTrainingRuntimenames and the controller Deployment.- Use field paths that match the current
replicatedJobs→containers[name=node].imagelayout.This cleanly centralizes controller and runtime images behind the config map; just be mindful that any future CRD shape changes will require updating these field paths.
Also applies to: 31-95
manifests/rhoai/runtimes/kustomization.yaml (1)
3-8: Runtime resource set aligns with replacements and earlier review feedbackThe kustomization now includes both general and specific torch‑distributed and training‑hub runtimes plus the ROCm variant, matching the runtime YAML filenames and the image replacement rules in the top‑level rhoai kustomization. This looks like a coherent consolidated runtime set; just confirm that docs and sample configs are updated to point at these five runtimes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
manifests/rhoai/kustomization.yaml(2 hunks)manifests/rhoai/params.env(1 hunks)manifests/rhoai/runtimes/kustomization.yaml(1 hunks)manifests/rhoai/runtimes/torch_cuda_251.yaml(0 hunks)manifests/rhoai/runtimes/torch_distributed.yaml(1 hunks)manifests/rhoai/runtimes/torch_distributed_rocm.yaml(2 hunks)manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml(1 hunks)manifests/rhoai/runtimes/torch_rocm_241.yaml(0 hunks)manifests/rhoai/runtimes/torch_rocm_251.yaml(0 hunks)manifests/rhoai/runtimes/training_hub.yaml(1 hunks)manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml(1 hunks)
💤 Files with no reviewable changes (3)
- manifests/rhoai/runtimes/torch_rocm_241.yaml
- manifests/rhoai/runtimes/torch_rocm_251.yaml
- manifests/rhoai/runtimes/torch_cuda_251.yaml
🚧 Files skipped from review as they are similar to previous changes (3)
- manifests/rhoai/runtimes/training_hub.yaml
- manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml
- manifests/rhoai/runtimes/torch_distributed.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Generate
- GitHub Check: Test
🔇 Additional comments (3)
manifests/rhoai/params.env (1)
2-6: Params/env wiring for runtime images looks consistentThe new image keys line up with the kustomize
replacementsand the runtime manifests (ROCm vs CUDA) and give a single place to override tags. No issues from a kustomize/consistency perspective.manifests/rhoai/runtimes/torch_distributed_rocm.yaml (1)
4-4: Runtime name and ROCm image align with kustomize replacements
metadata.name: torch-distributed-rocmmatches the target selector inmanifests/rhoai/kustomization.yaml, and the node image matches the default fromodh-torch-distributed-rocm-image. This keeps the ROCm runtime self‑consistent across manifests; just ensure docs/SDK references are updated to this new name where relevant.Also applies to: 25-25
manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml (1)
24-25: Training hub CUDA runtime image matches centralized configThe updated node image is consistent with the
odh-training-hub-th03-cuda128-torch28-py312-imageparameter and the kustomize replacement targeting this runtime, so the CUDA training hub variant is wired correctly for overrides.
| - params.yaml | ||
|
|
||
| replacements: | ||
| # Replace controller image |
There was a problem hiding this comment.
suggestion: can we use kustomizes' images to do the transformations? Might be a bit more concise and readable.
I think you add do some stuff with kustomizeconfig to set it up for CRs too. Check out the kustomize docs.
There was a problem hiding this comment.
Unfortunately images transformation can't be used as we process the image references from params.env file, we don't explicitly declare images (currently declared images in training runtimes are placeholders only, should they be rather replaced by explicit placeholder value)?
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
manifests/rhoai/kustomization.yaml(2 hunks)manifests/rhoai/params.env(1 hunks)manifests/rhoai/runtimes/kustomization.yaml(1 hunks)manifests/rhoai/runtimes/torch_cuda_251.yaml(0 hunks)manifests/rhoai/runtimes/torch_distributed.yaml(1 hunks)manifests/rhoai/runtimes/torch_distributed_rocm.yaml(2 hunks)manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml(1 hunks)manifests/rhoai/runtimes/torch_rocm_241.yaml(0 hunks)manifests/rhoai/runtimes/torch_rocm_251.yaml(0 hunks)manifests/rhoai/runtimes/training_hub.yaml(1 hunks)manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml(1 hunks)
💤 Files with no reviewable changes (3)
- manifests/rhoai/runtimes/torch_cuda_251.yaml
- manifests/rhoai/runtimes/torch_rocm_241.yaml
- manifests/rhoai/runtimes/torch_rocm_251.yaml
🚧 Files skipped from review as they are similar to previous changes (5)
- manifests/rhoai/params.env
- manifests/rhoai/kustomization.yaml
- manifests/rhoai/runtimes/torch_distributed.yaml
- manifests/rhoai/runtimes/training_hub.yaml
- manifests/rhoai/runtimes/training_hub_th03_cuda128_torch28_py312.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: pre-commit
- GitHub Check: Test
- GitHub Check: Generate
🔇 Additional comments (3)
manifests/rhoai/runtimes/kustomization.yaml (1)
1-8: Incomplete code snippet provided for comprehensive review.The AI summary indicates that image replacement rules, controller image references, and v1 field path mappings were added to this kustomization, but the provided code snippet only shows lines 1-8 (the resources section). The image replacement rules mentioned in the summary are not visible in the provided excerpt.
The visible resources section correctly retains
torch_distributed_th03_cuda128_torch28_py312.yamlandtorch_distributed.yamlas requested in previous review feedback.Please provide the complete kustomization.yaml file (including the image replacement rules, params, and imageReplacements sections mentioned in the AI summary) for full validation of:
- Correct image replacement rules targeting the right resources
- Valid v1 field path syntax for ClusterTrainingRuntime updates
- Consistency of image naming across all replacement rules
manifests/rhoai/runtimes/torch_distributed_th03_cuda128_torch28_py312.yaml (1)
25-25: Image reference verified and correct.The image
quay.io/modh/training:py312-cuda128-torch280exists in the quay.io registry and is officially supported by Red Hat OpenShift AI. The image contains Python 3.12, CUDA 12.8, and PyTorch 2.8.0, matching the version identifiers in the tag. This change from the floating:latesttag to a specific version tag is a best practice improvement.manifests/rhoai/runtimes/torch_distributed_rocm.yaml (1)
25-25: Image tagpy312-rocm64-torch280does not exist at quay.io — usepy312-rocm63-torch280instead.The image tag specified in this file could not be found in quay.io. Red Hat OpenShift AI documentation lists
quay.io/modh/training:py312-rocm63-torch280as the available KFTO training image with Python 3.12, PyTorch 2.8.0, and ROCm 6.3. Update the image reference to use ROCm 6.3 instead of the non-existent ROCm 6.4 tag.Likely an incorrect or invalid review comment.
|
/lgtm @sutaakar |
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #
Checklist:
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.