Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions manifests/rhoai/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ configurations:
- params.yaml

replacements:
# Replace controller image
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: can we use kustomizes' images to do the transformations? Might be a bit more concise and readable.

I think you add do some stuff with kustomizeconfig to set it up for CRs too. Check out the kustomize docs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately images transformation can't be used as we process the image references from params.env file, we don't explicitly declare images (currently declared images in training runtimes are placeholders only, should they be rather replaced by explicit placeholder value)?

- source:
kind: ConfigMap
name: rhoai-config
Expand All @@ -27,6 +28,71 @@ replacements:
fieldPaths:
- spec.template.spec.containers.0.image

# Replace image for torch-distributed-rocm
- source:
kind: ConfigMap
name: rhoai-config
version: v1
fieldPath: data.odh-training-rocm64-torch28-py312-image
targets:
- select:
kind: ClusterTrainingRuntime
name: torch-distributed-rocm
fieldPaths:
- spec.template.spec.replicatedJobs.[name=node].template.spec.template.spec.containers.[name=node].image

# Replace image for torch-distributed-th03-cuda128-torch28-py312
- source:
kind: ConfigMap
name: rhoai-config
version: v1
fieldPath: data.odh-training-cuda128-torch28-py312-image
targets:
- select:
kind: ClusterTrainingRuntime
name: torch-distributed-th03-cuda128-torch28-py312
fieldPaths:
- spec.template.spec.replicatedJobs.[name=node].template.spec.template.spec.containers.[name=node].image

# Replace image for torch-distributed
- source:
kind: ConfigMap
name: rhoai-config
version: v1
fieldPath: data.odh-training-cuda128-torch28-py312-image
targets:
- select:
kind: ClusterTrainingRuntime
name: torch-distributed
fieldPaths:
- spec.template.spec.replicatedJobs.[name=node].template.spec.template.spec.containers.[name=node].image

# Replace image for training-hub-th03-cuda128-torch28-py312
- source:
kind: ConfigMap
name: rhoai-config
version: v1
fieldPath: data.odh-training-cuda128-torch28-py312-image
targets:
- select:
kind: ClusterTrainingRuntime
name: training-hub03-cuda128-torch28-py312
fieldPaths:
- spec.template.spec.replicatedJobs.[name=node].template.spec.template.spec.containers.[name=node].image

# Replace image for training-hub
- source:
kind: ConfigMap
name: rhoai-config
version: v1
fieldPath: data.odh-training-cuda128-torch28-py312-image
targets:
- select:
kind: ClusterTrainingRuntime
name: training-hub
fieldPaths:
- spec.template.spec.replicatedJobs.[name=node].template.spec.template.spec.containers.[name=node].image

# Labels to add to all resources and selectors.
labels:
- includeSelectors: true
Expand Down
2 changes: 2 additions & 0 deletions manifests/rhoai/params.env
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
odh-kubeflow-trainer-controller-image=quay.io/opendatahub/trainer:v2.1.0
odh-training-cuda128-torch28-py312-image=quay.io/modh/training:py312-cuda128-torch280
odh-training-rocm64-torch28-py312-image=quay.io/modh/training:py312-rocm64-torch280
5 changes: 1 addition & 4 deletions manifests/rhoai/runtimes/kustomization.yaml
Original file line number Diff line number Diff line change
@@ -1,10 +1,7 @@
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- torch_cuda_241.yaml
- torch_cuda_251.yaml
- torch_rocm_241.yaml
- torch_rocm_251.yaml
- torch_distributed_rocm.yaml
- torch_distributed_th03_cuda128_torch28_py312.yaml
- torch_distributed.yaml
- training_hub_th03_cuda128_torch28_py312.yaml
Expand Down
25 changes: 0 additions & 25 deletions manifests/rhoai/runtimes/torch_cuda_251.yaml

This file was deleted.

2 changes: 1 addition & 1 deletion manifests/rhoai/runtimes/torch_distributed.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,4 @@ spec:
spec:
containers:
- name: node
image: quay.io/opendatahub/odh-training-th03-cuda128-torch28-py312-rhel9:latest
image: quay.io/modh/training:py312-cuda128-torch280
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
name: torch-cuda-241
name: torch-distributed-rocm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@briangallagher what's the right name for the rocm training runtime? Your refinement doc only has torch-distributed-rocm6.4-torch28-py312 but no torch-distributed-rocm. Was that deliberate?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robert-bell We debated adding that and decided not too just to reduce the runtimes by 1. I think it's fine to add , it should use latest image always similar to torch_distributed.

labels:
trainer.kubeflow.org/framework: torch
spec:
Expand All @@ -22,4 +22,4 @@ spec:
spec:
containers:
- name: node
image: quay.io/modh/training:py311-cuda121-torch241
image: quay.io/modh/training:py312-rocm64-torch280
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,4 @@ spec:
spec:
containers:
- name: node
image: quay.io/opendatahub/odh-training-th03-cuda128-torch28-py312-rhel9:latest
image: quay.io/modh/training:py312-cuda128-torch280
25 changes: 0 additions & 25 deletions manifests/rhoai/runtimes/torch_rocm_241.yaml

This file was deleted.

25 changes: 0 additions & 25 deletions manifests/rhoai/runtimes/torch_rocm_251.yaml

This file was deleted.

2 changes: 1 addition & 1 deletion manifests/rhoai/runtimes/training_hub.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,4 @@ spec:
spec:
containers:
- name: node
image: quay.io/opendatahub/odh-training-th03-cuda128-torch28-py312-rhel9:latest
image: quay.io/modh/training:py312-cuda128-torch280
Original file line number Diff line number Diff line change
Expand Up @@ -22,4 +22,4 @@ spec:
spec:
containers:
- name: node
image: quay.io/opendatahub/odh-training-th03-cuda128-torch28-py312-rhel9:latest
image: quay.io/modh/training:py312-cuda128-torch280
Loading