-
Notifications
You must be signed in to change notification settings - Fork 887
feat(runtimes): add support for ClusterTrainingRuntimes in Helm chart #3124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
5879258
0890f7e
2bb3fc6
f6aecc1
348b015
b626e31
91f69d7
4337eeb
1f4d1fe
298bd60
959ff3a
ebd8832
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| # Example values.yaml configuration for deploying ClusterTrainingRuntimes | ||
|
|
||
| # Deploy torch-distributed runtime (no custom images needed) | ||
| runtimes: | ||
| torchDistributed: | ||
| enabled: true | ||
|
|
||
| # Deploy torch-distributed with data cache support | ||
| torchDistributedWithCache: | ||
| enabled: true | ||
| # cacheImage will use chart version by default | ||
| # To override, specify custom tag: | ||
| # cacheImage: | ||
| # tag: "v2.0.0" | ||
|
|
||
| # Deploy DeepSpeed runtime | ||
khushiiagrawal marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| deepspeedDistributed: | ||
| enabled: true | ||
| # Override image tag if needed: | ||
| # image: | ||
| # tag: "custom-v1.0.0" | ||
|
|
||
| # Deploy MLX runtime | ||
| mlxDistributed: | ||
| enabled: false | ||
| # Can enable and customize: | ||
| # enabled: true | ||
| # image: | ||
| # registry: my-registry.io | ||
| # repository: custom/mlx-runtime | ||
| # tag: "v1.0.0" | ||
|
|
||
| # Note: torch-distributed-with-cache requires data cache support | ||
| dataCache: | ||
| enabled: true | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -64,24 +64,61 @@ app.kubernetes.io/name: {{ include "trainer.name" . }} | |
| app.kubernetes.io/instance: {{ .Release.Name }} | ||
| {{- end }} | ||
|
|
||
| {{/* | ||
| Resolve the effective image tag, using a provided tag if present or | ||
| falling back to the default image tag derived from the chart version. | ||
| Usage: include "trainer.resolveImageTag" (dict "tag" .Values.image.tag "context" .) | ||
| */}} | ||
| {{- define "trainer.resolveImageTag" -}} | ||
| {{- if .tag }} | ||
| {{- .tag -}} | ||
| {{- else -}} | ||
| {{- include "trainer.defaultImageTag" .context -}} | ||
| {{- end -}} | ||
| {{- end }} | ||
|
|
||
| {{- define "trainer.image" -}} | ||
| {{- $imageRegistry := .Values.image.registry | default "docker.io" }} | ||
| {{- $imageRepository := .Values.image.repository }} | ||
| {{- $imageTag := .Values.image.tag -}} | ||
| {{- if not $imageTag -}} | ||
| {{- if hasPrefix "0.0.0-" .Chart.Version -}} | ||
| {{- $imageTag = trimPrefix "0.0.0-" .Chart.Version -}} | ||
| {{- else -}} | ||
| {{- $imageTag = printf "v%s" .Chart.Version -}} | ||
| {{- end -}} | ||
| {{- end -}} | ||
|
Comment on lines
-70
to
-77
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do do you need to make changes to it?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. code was refactored to use a more modular approach - the inline logic was split into
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see that make sense, thanks for clarifying! |
||
| {{- $imageTag := include "trainer.resolveImageTag" (dict "tag" .Values.image.tag "context" .) -}} | ||
| {{- if eq $imageRegistry "docker.io" }} | ||
| {{- printf "%s:%s" $imageRepository $imageTag }} | ||
| {{- else }} | ||
| {{- printf "%s/%s:%s" $imageRegistry $imageRepository $imageTag }} | ||
| {{- end }} | ||
| {{- end }} | ||
|
|
||
| {{/* | ||
| Generate the default image tag for runtimes based on chart version | ||
| */}} | ||
| {{- define "trainer.defaultImageTag" -}} | ||
| {{- if hasPrefix "0.0.0-" .Chart.Version -}} | ||
| {{- trimPrefix "0.0.0-" .Chart.Version -}} | ||
| {{- else -}} | ||
| {{- printf "v%s" .Chart.Version -}} | ||
| {{- end -}} | ||
| {{- end }} | ||
|
|
||
| {{/* | ||
| Generate runtime image with registry, repository, and tag from values | ||
| Usage: include "trainer.runtimeImage" (list .Values.runtimes.deepspeedDistributed.image .) | ||
| */}} | ||
| {{- define "trainer.runtimeImage" -}} | ||
| {{- $imageConfig := index . 0 }} | ||
| {{- $root := index . 1 }} | ||
| {{- $registry := $imageConfig.registry | default "ghcr.io" }} | ||
| {{- $repository := $imageConfig.repository }} | ||
| {{- $tag := include "trainer.resolveImageTag" (dict "tag" ($imageConfig.tag) "context" $root) -}} | ||
| {{- if eq $registry "docker.io" }} | ||
| {{- printf "%s:%s" $repository $tag }} | ||
| {{- else }} | ||
| {{- printf "%s/%s:%s" $registry $repository $tag }} | ||
| {{- end }} | ||
| {{- end }} | ||
| {{/* | ||
| Return the version of the trainer. | ||
| If the version is 0.0.0, we assume it is a development version. | ||
| */}} | ||
| {{- define "trainer.version" -}} | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add the comment to explain how trainer.version variable is used.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added the comment. thanks. |
||
| {{- if hasPrefix "0.0.0-" .Chart.Version -}} | ||
| dev | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| {{- /* | ||
| Copyright 2025 The Kubeflow authors. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); | ||
| you may not use this file except in compliance with the License. | ||
| You may obtain a copy of the License at | ||
|
|
||
| https://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| */ -}} | ||
|
|
||
| {{- if or .Values.runtimes.deepspeedDistributed.enabled .Values.runtimes.defaultEnabled }} | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You need to check for defaultEnabled in these runtimes too: mlx, torch, torchtune runtimes.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thanks for the review. i've updated the others (mlx, torch and torchtune) to respect the defaultEnabled flag as well. |
||
| apiVersion: trainer.kubeflow.org/v1alpha1 | ||
| kind: ClusterTrainingRuntime | ||
| metadata: | ||
| name: deepspeed-distributed | ||
| labels: | ||
| trainer.kubeflow.org/framework: deepspeed | ||
| {{- include "trainer.labels" . | nindent 4 }} | ||
| spec: | ||
| mlPolicy: | ||
| numNodes: 1 | ||
| mpi: | ||
| numProcPerNode: 1 | ||
| mpiImplementation: OpenMPI | ||
| sshAuthMountPath: /home/mpiuser/.ssh | ||
| runLauncherAsNode: true | ||
| template: | ||
| spec: | ||
| network: | ||
| publishNotReadyAddresses: true | ||
| successPolicy: | ||
| operator: All | ||
| targetReplicatedJobs: | ||
| - launcher | ||
| replicatedJobs: | ||
| - name: launcher | ||
| template: | ||
| metadata: | ||
| labels: | ||
| trainer.kubeflow.org/trainjob-ancestor-step: trainer | ||
| spec: | ||
| template: | ||
| spec: | ||
| containers: | ||
| - name: node | ||
| image: {{ include "trainer.runtimeImage" (list .Values.runtimes.deepspeedDistributed.image .) }} | ||
| securityContext: | ||
| runAsUser: 1000 | ||
| - name: node | ||
| template: | ||
| spec: | ||
| template: | ||
| spec: | ||
| containers: | ||
| - name: node | ||
| image: {{ include "trainer.runtimeImage" (list .Values.runtimes.deepspeedDistributed.image .) }} | ||
| securityContext: | ||
| runAsUser: 1000 | ||
| command: | ||
| - /usr/sbin/sshd | ||
| args: | ||
| - -De | ||
| - -f | ||
| - /home/mpiuser/.sshd_config | ||
| readinessProbe: | ||
| tcpSocket: | ||
| port: 2222 | ||
| initialDelaySeconds: 5 | ||
| {{- end }} | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add example where enabledDefault is true