-
Notifications
You must be signed in to change notification settings - Fork 891
feat(runtimes): add support for ClusterTrainingRuntimes in Helm chart #3124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 9 commits
5879258
0890f7e
2bb3fc6
f6aecc1
348b015
b626e31
91f69d7
4337eeb
1f4d1fe
298bd60
959ff3a
ebd8832
b733249
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,6 +31,57 @@ Alternatively, you can install the latest version from the master branch (e.g. ` | |
| helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 0.0.0-sha-bfccb7b | ||
| ``` | ||
|
|
||
| ### Install with ClusterTrainingRuntimes | ||
|
|
||
| You can optionally deploy ClusterTrainingRuntimes as part of the Helm installation. Runtimes are disabled by default to keep the chart lightweight. | ||
|
|
||
| To enable specific runtimes: | ||
|
|
||
| ```bash | ||
| helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \ | ||
| --version 2.1.0 \ | ||
| --set runtimes.torchDistributed.enabled=true \ | ||
| --set runtimes.deepspeedDistributed.enabled=true | ||
| ``` | ||
|
|
||
| Or use a custom values file: | ||
|
|
||
| ```yaml | ||
| # values.yaml | ||
| runtimes: | ||
| torchDistributed: | ||
| enabled: true | ||
| torchDistributedWithCache: | ||
| enabled: true | ||
| dataCache: | ||
| enabled: true | ||
| cacheImage: | ||
| tag: "v2.0.0" | ||
| deepspeedDistributed: | ||
| enabled: true | ||
| mlxDistributed: | ||
| enabled: true | ||
|
|
||
| # Required for torch-distributed-with-cache | ||
| dataCache: | ||
| enabled: true | ||
| ``` | ||
|
|
||
| Then install with: | ||
|
|
||
| ```bash | ||
| helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \ | ||
| --version 2.1.0 \ | ||
| -f values.yaml | ||
| ``` | ||
|
|
||
| ### Available Runtimes | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please can you also:
|
||
|
|
||
| - **torch-distributed**: PyTorch distributed training (no custom images) | ||
| - **torch-distributed-with-cache**: PyTorch with distributed data cache support (requires `dataCache.enabled=true`) | ||
| - **deepspeed-distributed**: DeepSpeed distributed training with MPI | ||
| - **mlx-distributed**: MLX distributed training with MPI | ||
|
|
||
| ### Uninstall the chart | ||
|
|
||
| ```shell | ||
|
|
@@ -72,6 +123,30 @@ See [helm uninstall](https://helm.sh/docs/helm/helm_uninstall) for command docum | |
| | dataCache.enabled | bool | `false` | Enable/disable data cache support (LWS dependency, ClusterRole). Set to `true` to install data cache components. | | ||
| | dataCache.lws.install | bool | `true` | Whether to install LeaderWorkerSet as a dependency. Set to `false` if LeaderWorkerSet is already installed in the cluster. | | ||
| | dataCache.lws.fullnameOverride | string | `"lws"` | String to fully override LeaderWorkerSet release name. | | ||
| | runtimes | object | `{"deepspeedDistributed":{"enabled":false,"image":{"registry":"ghcr.io","repository":"kubeflow/trainer/deepspeed-runtime","tag":""}},"defaultEnabled":false,"mlxDistributed":{"enabled":false,"image":{"registry":"ghcr.io","repository":"kubeflow/trainer/mlx-runtime","tag":""}},"torchDistributed":{"enabled":false},"torchDistributedWithCache":{"dataCache":{"cacheImage":{"registry":"ghcr.io","repository":"kubeflow/trainer/data-cache","tag":""},"enabled":true},"enabled":false},"torchtuneDistributed":{"enabled":false,"image":{"registry":"ghcr.io","repository":"kubeflow/trainer/torchtune-runtime","tag":""}}}` | ClusterTrainingRuntimes configuration These are optional runtime templates that can be deployed with the Helm chart. Each runtime provides a blueprint for different ML frameworks and configurations. | | ||
| | runtimes.defaultEnabled | bool | `false` | Enable all default runtimes (torch, deepspeed, mlx, torchtune) when set to true. Individual runtime settings will be ignored if this is enabled. | | ||
| | runtimes.torchDistributed | object | `{"enabled":false}` | PyTorch distributed training runtime (no custom images required) | | ||
| | runtimes.torchDistributed.enabled | bool | `false` | Enable deployment of torch-distributed runtime | | ||
| | runtimes.torchDistributedWithCache | object | `{"dataCache":{"cacheImage":{"registry":"ghcr.io","repository":"kubeflow/trainer/data-cache","tag":""},"enabled":true},"enabled":false}` | PyTorch distributed training with data cache support | | ||
| | runtimes.torchDistributedWithCache.enabled | bool | `false` | Enable deployment of torch-distributed-with-cache runtime | | ||
| | runtimes.torchDistributedWithCache.dataCache.cacheImage.registry | string | `"ghcr.io"` | Data cache image registry | | ||
| | runtimes.torchDistributedWithCache.dataCache.cacheImage.repository | string | `"kubeflow/trainer/data-cache"` | Data cache image repository | | ||
| | runtimes.torchDistributedWithCache.dataCache.cacheImage.tag | string | `""` | Data cache image tag. Defaults to chart version if empty. | | ||
| | runtimes.deepspeedDistributed | object | `{"enabled":false,"image":{"registry":"ghcr.io","repository":"kubeflow/trainer/deepspeed-runtime","tag":""}}` | DeepSpeed distributed training runtime | | ||
| | runtimes.deepspeedDistributed.enabled | bool | `false` | Enable deployment of deepspeed-distributed runtime | | ||
| | runtimes.deepspeedDistributed.image.registry | string | `"ghcr.io"` | DeepSpeed runtime image registry | | ||
| | runtimes.deepspeedDistributed.image.repository | string | `"kubeflow/trainer/deepspeed-runtime"` | DeepSpeed runtime image repository | | ||
| | runtimes.deepspeedDistributed.image.tag | string | `""` | DeepSpeed runtime image tag. Defaults to chart version if empty. | | ||
| | runtimes.mlxDistributed | object | `{"enabled":false,"image":{"registry":"ghcr.io","repository":"kubeflow/trainer/mlx-runtime","tag":""}}` | MLX distributed training runtime | | ||
| | runtimes.mlxDistributed.enabled | bool | `false` | Enable deployment of mlx-distributed runtime | | ||
| | runtimes.mlxDistributed.image.registry | string | `"ghcr.io"` | MLX runtime image registry | | ||
| | runtimes.mlxDistributed.image.repository | string | `"kubeflow/trainer/mlx-runtime"` | MLX runtime image repository | | ||
| | runtimes.mlxDistributed.image.tag | string | `""` | MLX runtime image tag. Defaults to chart version if empty. | | ||
| | runtimes.torchtuneDistributed | object | `{"enabled":false,"image":{"registry":"ghcr.io","repository":"kubeflow/trainer/torchtune-runtime","tag":""}}` | TorchTune distributed training runtime | | ||
| | runtimes.torchtuneDistributed.enabled | bool | `false` | Enable deployment of torchtune-distributed runtime | | ||
| | runtimes.torchtuneDistributed.image.registry | string | `"ghcr.io"` | TorchTune runtime image registry | | ||
| | runtimes.torchtuneDistributed.image.repository | string | `"kubeflow/trainer/torchtune-runtime"` | TorchTune runtime image repository | | ||
| | runtimes.torchtuneDistributed.image.tag | string | `""` | TorchTune runtime image tag. Defaults to chart version if empty. | | ||
|
|
||
| ## Maintainers | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| # Example values.yaml configuration for deploying ClusterTrainingRuntimes | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's remove this example from this PR, I think it is sufficient to explain what is needed in the README. |
||
|
|
||
| # Deploy torch-distributed runtime (no custom images needed) | ||
| runtimes: | ||
| torchDistributed: | ||
| enabled: true | ||
|
|
||
| # Deploy torch-distributed with data cache support | ||
| torchDistributedWithCache: | ||
| enabled: true | ||
| # cacheImage will use chart version by default | ||
| # To override, specify custom tag: | ||
| # cacheImage: | ||
| # tag: "v2.0.0" | ||
|
|
||
| # Deploy DeepSpeed runtime | ||
khushiiagrawal marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| deepspeedDistributed: | ||
| enabled: true | ||
| # Override image tag if needed: | ||
| # image: | ||
| # tag: "custom-v1.0.0" | ||
|
|
||
| # Deploy MLX runtime | ||
| mlxDistributed: | ||
| enabled: false | ||
| # Can enable and customize: | ||
| # enabled: true | ||
| # image: | ||
| # registry: my-registry.io | ||
| # repository: custom/mlx-runtime | ||
| # tag: "v1.0.0" | ||
|
|
||
| # Note: torch-distributed-with-cache requires data cache support | ||
| dataCache: | ||
| enabled: true | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -64,24 +64,57 @@ app.kubernetes.io/name: {{ include "trainer.name" . }} | |
| app.kubernetes.io/instance: {{ .Release.Name }} | ||
| {{- end }} | ||
|
|
||
| {{/* | ||
| Resolve the effective image tag, using a provided tag if present or | ||
| falling back to the default image tag derived from the chart version. | ||
| Usage: include "trainer.resolveImageTag" (dict "tag" .Values.image.tag "context" .) | ||
| */}} | ||
| {{- define "trainer.resolveImageTag" -}} | ||
| {{- if .tag }} | ||
| {{- .tag -}} | ||
| {{- else -}} | ||
| {{- include "trainer.defaultImageTag" .context -}} | ||
| {{- end -}} | ||
| {{- end }} | ||
|
|
||
| {{- define "trainer.image" -}} | ||
| {{- $imageRegistry := .Values.image.registry | default "docker.io" }} | ||
| {{- $imageRepository := .Values.image.repository }} | ||
| {{- $imageTag := .Values.image.tag -}} | ||
| {{- if not $imageTag -}} | ||
| {{- if hasPrefix "0.0.0-" .Chart.Version -}} | ||
| {{- $imageTag = trimPrefix "0.0.0-" .Chart.Version -}} | ||
| {{- else -}} | ||
| {{- $imageTag = printf "v%s" .Chart.Version -}} | ||
| {{- end -}} | ||
| {{- end -}} | ||
|
Comment on lines
-70
to
-77
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do do you need to make changes to it?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. code was refactored to use a more modular approach - the inline logic was split into
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see that make sense, thanks for clarifying! |
||
| {{- $imageTag := include "trainer.resolveImageTag" (dict "tag" .Values.image.tag "context" .) -}} | ||
| {{- if eq $imageRegistry "docker.io" }} | ||
| {{- printf "%s:%s" $imageRepository $imageTag }} | ||
| {{- else }} | ||
| {{- printf "%s/%s:%s" $imageRegistry $imageRepository $imageTag }} | ||
| {{- end }} | ||
| {{- end }} | ||
|
|
||
| {{/* | ||
| Generate the default image tag for runtimes based on chart version | ||
| */}} | ||
| {{- define "trainer.defaultImageTag" -}} | ||
| {{- if hasPrefix "0.0.0-" .Chart.Version -}} | ||
| {{- trimPrefix "0.0.0-" .Chart.Version -}} | ||
| {{- else -}} | ||
| {{- printf "v%s" .Chart.Version -}} | ||
| {{- end -}} | ||
| {{- end }} | ||
|
|
||
| {{/* | ||
| Generate runtime image with registry, repository, and tag from values | ||
| Usage: include "trainer.runtimeImage" (list .Values.runtimes.deepspeedDistributed.image .) | ||
| */}} | ||
| {{- define "trainer.runtimeImage" -}} | ||
| {{- $imageConfig := index . 0 }} | ||
| {{- $root := index . 1 }} | ||
| {{- $registry := $imageConfig.registry | default "ghcr.io" }} | ||
| {{- $repository := $imageConfig.repository }} | ||
| {{- $tag := include "trainer.resolveImageTag" (dict "tag" ($imageConfig.tag) "context" $root) -}} | ||
| {{- if eq $registry "docker.io" }} | ||
| {{- printf "%s:%s" $repository $tag }} | ||
| {{- else }} | ||
| {{- printf "%s/%s:%s" $registry $repository $tag }} | ||
| {{- end }} | ||
| {{- end }} | ||
| {{- define "trainer.version" -}} | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add the comment to explain how trainer.version variable is used.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added the comment. thanks. |
||
| {{- if hasPrefix "0.0.0-" .Chart.Version -}} | ||
| dev | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,75 @@ | ||
| {{- /* | ||
| Copyright 2025 The Kubeflow authors. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); | ||
| you may not use this file except in compliance with the License. | ||
| You may obtain a copy of the License at | ||
|
|
||
| https://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software | ||
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. | ||
| */ -}} | ||
|
|
||
| {{- if .Values.runtimes.deepspeedDistributed.enabled }} | ||
|
||
| apiVersion: trainer.kubeflow.org/v1alpha1 | ||
| kind: ClusterTrainingRuntime | ||
| metadata: | ||
| name: deepspeed-distributed | ||
| labels: | ||
| trainer.kubeflow.org/framework: deepspeed | ||
| {{- include "trainer.labels" . | nindent 4 }} | ||
| spec: | ||
| mlPolicy: | ||
| numNodes: 1 | ||
| mpi: | ||
| numProcPerNode: 1 | ||
| mpiImplementation: OpenMPI | ||
| sshAuthMountPath: /home/mpiuser/.ssh | ||
| runLauncherAsNode: true | ||
| template: | ||
| spec: | ||
| network: | ||
| publishNotReadyAddresses: true | ||
| successPolicy: | ||
| operator: All | ||
| targetReplicatedJobs: | ||
| - launcher | ||
| replicatedJobs: | ||
| - name: launcher | ||
| template: | ||
| metadata: | ||
| labels: | ||
| trainer.kubeflow.org/trainjob-ancestor-step: trainer | ||
| spec: | ||
| template: | ||
| spec: | ||
| containers: | ||
| - name: node | ||
| image: {{ include "trainer.runtimeImage" (list .Values.runtimes.deepspeedDistributed.image .) }} | ||
| securityContext: | ||
| runAsUser: 1000 | ||
| - name: node | ||
| template: | ||
| spec: | ||
| template: | ||
| spec: | ||
| containers: | ||
| - name: node | ||
| image: {{ include "trainer.runtimeImage" (list .Values.runtimes.deepspeedDistributed.image .) }} | ||
| securityContext: | ||
| runAsUser: 1000 | ||
| command: | ||
| - /usr/sbin/sshd | ||
| args: | ||
| - -De | ||
| - -f | ||
| - /home/mpiuser/.sshd_config | ||
| readinessProbe: | ||
| tcpSocket: | ||
| port: 2222 | ||
| initialDelaySeconds: 5 | ||
| {{- end }} | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add example where enabledDefault is true