-
Notifications
You must be signed in to change notification settings - Fork 8
Provide training runtimes from Training operator v1 #37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1 +1,3 @@ | ||
| odh-kubeflow-trainer-controller-image=quay.io/opendatahub/trainer:v2.1.0 | ||
| odh-training-cuda128-torch28-py312-image=quay.io/modh/training:py312-cuda128-torch280 | ||
| odh-training-rocm64-torch28-py312-image=quay.io/modh/training:py312-rocm64-torch280 |
This file was deleted.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,7 +1,7 @@ | ||
| apiVersion: trainer.kubeflow.org/v1alpha1 | ||
| kind: ClusterTrainingRuntime | ||
| metadata: | ||
| name: torch-cuda-241 | ||
| name: torch-distributed-rocm | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @briangallagher what's the right name for the rocm training runtime? Your refinement doc only has There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @robert-bell We debated adding that and decided not too just to reduce the runtimes by 1. I think it's fine to add , it should use latest image always similar to torch_distributed.
coderabbitai[bot] marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| labels: | ||
| trainer.kubeflow.org/framework: torch | ||
| spec: | ||
|
|
@@ -22,4 +22,4 @@ spec: | |
| spec: | ||
| containers: | ||
| - name: node | ||
| image: quay.io/modh/training:py311-cuda121-torch241 | ||
| image: quay.io/modh/training:py312-rocm64-torch280 | ||
This file was deleted.
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: can we use kustomizes'
imagesto do the transformations? Might be a bit more concise and readable.I think you add do some stuff with kustomizeconfig to set it up for CRs too. Check out the kustomize docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately images transformation can't be used as we process the image references from
params.envfile, we don't explicitly declare images (currently declared images in training runtimes are placeholders only, should they be rather replaced by explicit placeholder value)?