Skip to content
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
5879258
feat(runtimes): add support for ClusterTrainingRuntimes in Helm chart
khushiiagrawal Jan 24, 2026
0890f7e
fix: remove initializerImage from user-configurable values
khushiiagrawal Jan 24, 2026
2bb3fc6
chore: regenerate README with helm-docs
khushiiagrawal Jan 24, 2026
f6aecc1
fix: address Copilot review suggestions
khushiiagrawal Jan 24, 2026
348b015
feat: Introduce helper to centralize image tag resolution
khushiiagrawal Jan 25, 2026
b626e31
refactor: nest cache image configuration and update copyright year.
khushiiagrawal Jan 26, 2026
91f69d7
chore: run make generate to sync
khushiiagrawal Jan 26, 2026
4337eeb
feat: add TorchTune distributed runtime, image helper usage, and defa…
khushiiagrawal Jan 27, 2026
1f4d1fe
Merge branch 'master' into feature/support-for-ClusterTrainingRuntimes
khushiiagrawal Jan 27, 2026
298bd60
feat: enable runtime via a new default flag and add a comment
khushiiagrawal Jan 28, 2026
959ff3a
refactor: Torchtune runtimes to use model specific configurations, ad…
khushiiagrawal Jan 29, 2026
ebd8832
fix: update README and fix trailing whitespace for CI
khushiiagrawal Jan 29, 2026
b733249
refactor: relocate runtime template and update its configuration
khushiiagrawal Feb 7, 2026
7d4a26f
Merge branch 'master' of https://github.com/kubeflow/trainer into fea…
khushiiagrawal Feb 13, 2026
38afadf
feat: add JAX distributed training support and update runtime configu…
khushiiagrawal Feb 13, 2026
f3ee525
fix(docs): remove JAX runtime from default runtimes in README
khushiiagrawal Feb 13, 2026
12db38c
fix(docs): update default runtimes in README and values.yaml to inclu…
khushiiagrawal Feb 13, 2026
299d5ca
fix(workflows): increase Papermill timeout for e2e tests in GPU cluster
khushiiagrawal Feb 13, 2026
1cd03b9
fix(notebooks): add missing newline at end of qwen2.5-1.5B-with-alpac…
khushiiagrawal Feb 13, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions charts/kubeflow-trainer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,57 @@ Alternatively, you can install the latest version from the master branch (e.g. `
helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 0.0.0-sha-bfccb7b
```

### Install with ClusterTrainingRuntimes

You can optionally deploy ClusterTrainingRuntimes as part of the Helm installation. Runtimes are disabled by default to keep the chart lightweight.

To enable specific runtimes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add example where enabledDefault is true


```bash
helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \
--version 2.1.0 \
--set runtimes.torchDistributed.enabled=true \
--set runtimes.deepspeedDistributed.enabled=true
```

Or use a custom values file:

```yaml
# values.yaml
runtimes:
torchDistributed:
enabled: true
torchDistributedWithCache:
enabled: true
dataCache:
enabled: true
cacheImage:
tag: "v2.0.0"
deepspeedDistributed:
enabled: true
mlxDistributed:
enabled: true

# Required for torch-distributed-with-cache
dataCache:
enabled: true
```

Then install with:

```bash
helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \
--version 2.1.0 \
-f values.yaml
```

### Available Runtimes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can you also:

  1. Add JAX Runtime since we merge: feat(runtimes): Add JAX training runtime #3151
  2. Update the VolumeClaimPolicies for torchtune runtime, like I did in this PR: feat(runtimes): Use JobSet VolumeClaimPolicies APIs for LLM Runtimes #3150


- **torch-distributed**: PyTorch distributed training (no custom images)
- **torch-distributed-with-cache**: PyTorch with distributed data cache support (requires `dataCache.enabled=true`)
- **deepspeed-distributed**: DeepSpeed distributed training with MPI
- **mlx-distributed**: MLX distributed training with MPI

### Uninstall the chart

```shell
Expand Down Expand Up @@ -72,6 +123,24 @@ See [helm uninstall](https://helm.sh/docs/helm/helm_uninstall) for command docum
| dataCache.enabled | bool | `false` | Enable/disable data cache support (LWS dependency, ClusterRole). Set to `true` to install data cache components. |
| dataCache.lws.install | bool | `true` | Whether to install LeaderWorkerSet as a dependency. Set to `false` if LeaderWorkerSet is already installed in the cluster. |
| dataCache.lws.fullnameOverride | string | `"lws"` | String to fully override LeaderWorkerSet release name. |
| runtimes | object | `{"deepspeedDistributed":{"enabled":false,"image":{"registry":"ghcr.io","repository":"kubeflow/trainer/deepspeed-runtime","tag":""}},"mlxDistributed":{"enabled":false,"image":{"registry":"ghcr.io","repository":"kubeflow/trainer/mlx-runtime","tag":""}},"torchDistributed":{"enabled":false},"torchDistributedWithCache":{"dataCache":{"cacheImage":{"registry":"ghcr.io","repository":"kubeflow/trainer/data-cache","tag":""},"enabled":true},"enabled":false}}` | ClusterTrainingRuntimes configuration These are optional runtime templates that can be deployed with the Helm chart. Each runtime provides a blueprint for different ML frameworks and configurations. |
| runtimes.torchDistributed | object | `{"enabled":false}` | PyTorch distributed training runtime (no custom images required) |
| runtimes.torchDistributed.enabled | bool | `false` | Enable deployment of torch-distributed runtime |
| runtimes.torchDistributedWithCache | object | `{"dataCache":{"cacheImage":{"registry":"ghcr.io","repository":"kubeflow/trainer/data-cache","tag":""},"enabled":true},"enabled":false}` | PyTorch distributed training with data cache support |
| runtimes.torchDistributedWithCache.enabled | bool | `false` | Enable deployment of torch-distributed-with-cache runtime |
| runtimes.torchDistributedWithCache.dataCache.cacheImage.registry | string | `"ghcr.io"` | Data cache image registry |
| runtimes.torchDistributedWithCache.dataCache.cacheImage.repository | string | `"kubeflow/trainer/data-cache"` | Data cache image repository |
| runtimes.torchDistributedWithCache.dataCache.cacheImage.tag | string | `""` | Data cache image tag. Defaults to chart version if empty. |
| runtimes.deepspeedDistributed | object | `{"enabled":false,"image":{"registry":"ghcr.io","repository":"kubeflow/trainer/deepspeed-runtime","tag":""}}` | DeepSpeed distributed training runtime |
| runtimes.deepspeedDistributed.enabled | bool | `false` | Enable deployment of deepspeed-distributed runtime |
| runtimes.deepspeedDistributed.image.registry | string | `"ghcr.io"` | DeepSpeed runtime image registry |
| runtimes.deepspeedDistributed.image.repository | string | `"kubeflow/trainer/deepspeed-runtime"` | DeepSpeed runtime image repository |
| runtimes.deepspeedDistributed.image.tag | string | `""` | DeepSpeed runtime image tag. Defaults to chart version if empty. |
| runtimes.mlxDistributed | object | `{"enabled":false,"image":{"registry":"ghcr.io","repository":"kubeflow/trainer/mlx-runtime","tag":""}}` | MLX distributed training runtime |
| runtimes.mlxDistributed.enabled | bool | `false` | Enable deployment of mlx-distributed runtime |
| runtimes.mlxDistributed.image.registry | string | `"ghcr.io"` | MLX runtime image registry |
| runtimes.mlxDistributed.image.repository | string | `"kubeflow/trainer/mlx-runtime"` | MLX runtime image repository |
| runtimes.mlxDistributed.image.tag | string | `""` | MLX runtime image tag. Defaults to chart version if empty. |

## Maintainers

Expand Down
53 changes: 52 additions & 1 deletion charts/kubeflow-trainer/README.md.gotmpl
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{{- /*
Copyright 2025 The Kubeflow authors.
Copyright 2026 The Kubeflow authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -49,6 +49,57 @@ Alternatively, you can install the latest version from the master branch (e.g. `
helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer --version 0.0.0-sha-bfccb7b
```

### Install with ClusterTrainingRuntimes

You can optionally deploy ClusterTrainingRuntimes as part of the Helm installation. Runtimes are disabled by default to keep the chart lightweight.

To enable specific runtimes:

```bash
helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \
--version 2.1.0 \
--set runtimes.torchDistributed.enabled=true \
--set runtimes.deepspeedDistributed.enabled=true
```

Or use a custom values file:

```yaml
# values.yaml
runtimes:
torchDistributed:
enabled: true
torchDistributedWithCache:
enabled: true
dataCache:
enabled: true
cacheImage:
tag: "v2.0.0"
deepspeedDistributed:
enabled: true
mlxDistributed:
enabled: true

# Required for torch-distributed-with-cache
dataCache:
enabled: true
```

Then install with:

```bash
helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \
--version 2.1.0 \
-f values.yaml
```

### Available Runtimes

- **torch-distributed**: PyTorch distributed training (no custom images)
- **torch-distributed-with-cache**: PyTorch with distributed data cache support (requires `dataCache.enabled=true`)
- **deepspeed-distributed**: DeepSpeed distributed training with MPI
- **mlx-distributed**: MLX distributed training with MPI

### Uninstall the chart

```shell
Expand Down
35 changes: 35 additions & 0 deletions charts/kubeflow-trainer/examples/runtimes-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Example values.yaml configuration for deploying ClusterTrainingRuntimes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this example from this PR, I think it is sufficient to explain what is needed in the README.


# Deploy torch-distributed runtime (no custom images needed)
runtimes:
torchDistributed:
enabled: true

# Deploy torch-distributed with data cache support
torchDistributedWithCache:
enabled: true
# cacheImage will use chart version by default
# To override, specify custom tag:
# cacheImage:
# tag: "v2.0.0"

# Deploy DeepSpeed runtime
deepspeedDistributed:
enabled: true
# Override image tag if needed:
# image:
# tag: "custom-v1.0.0"

# Deploy MLX runtime
mlxDistributed:
enabled: false
# Can enable and customize:
# enabled: true
# image:
# registry: my-registry.io
# repository: custom/mlx-runtime
# tag: "v1.0.0"

# Note: torch-distributed-with-cache requires data cache support
dataCache:
enabled: true
48 changes: 40 additions & 8 deletions charts/kubeflow-trainer/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -64,20 +64,52 @@ app.kubernetes.io/name: {{ include "trainer.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}

{{/*
Resolve the effective image tag, using a provided tag if present or
falling back to the default image tag derived from the chart version.
Usage: include "trainer.resolveImageTag" (dict "tag" .Values.image.tag "context" .)
*/}}
{{- define "trainer.resolveImageTag" -}}
{{- if .tag }}
{{- .tag -}}
{{- else -}}
{{- include "trainer.defaultImageTag" .context -}}
{{- end -}}
{{- end }}

{{- define "trainer.image" -}}
{{- $imageRegistry := .Values.image.registry | default "docker.io" }}
{{- $imageRepository := .Values.image.repository }}
{{- $imageTag := .Values.image.tag -}}
{{- if not $imageTag -}}
{{- if hasPrefix "0.0.0-" .Chart.Version -}}
{{- $imageTag = trimPrefix "0.0.0-" .Chart.Version -}}
{{- else -}}
{{- $imageTag = printf "v%s" .Chart.Version -}}
{{- end -}}
{{- end -}}
Comment on lines -70 to -77
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do do you need to make changes to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code was refactored to use a more modular approach - the inline logic was split into trainer.resolveImageTag and trainer.defaultImageTag helpers for better maintainability. no functional changes, just better organization that allows both manager and runtime images to share the same tag resolution logic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that make sense, thanks for clarifying!

{{- $imageTag := include "trainer.resolveImageTag" (dict "tag" .Values.image.tag "context" .) -}}
{{- if eq $imageRegistry "docker.io" }}
{{- printf "%s:%s" $imageRepository $imageTag }}
{{- else }}
{{- printf "%s/%s:%s" $imageRegistry $imageRepository $imageTag }}
{{- end }}
{{- end }}

{{/*
Generate the default image tag for runtimes based on chart version
*/}}
{{- define "trainer.defaultImageTag" -}}
{{- if hasPrefix "0.0.0-" .Chart.Version -}}
{{- trimPrefix "0.0.0-" .Chart.Version -}}
{{- else -}}
{{- printf "v%s" .Chart.Version -}}
{{- end -}}
{{- end }}

{{/*
Generate runtime image with registry, repository, and tag
Usage: include "trainer.runtimeImage" (dict "registry" .Values.runtimes.deepspeedDistributed.image.registry "repository" .Values.runtimes.deepspeedDistributed.image.repository "tag" .Values.runtimes.deepspeedDistributed.image.tag "context" .)
*/}}
{{- define "trainer.runtimeImage" -}}
{{- $registry := .registry | default "docker.io" }}
{{- $repository := .repository }}
{{- $tag := include "trainer.resolveImageTag" (dict "tag" .tag "context" .context) -}}
{{- if eq $registry "docker.io" }}
{{- printf "%s:%s" $repository $tag }}
{{- else }}
{{- printf "%s/%s:%s" $registry $repository $tag }}
{{- end }}
{{- end }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
{{- /*
Copyright 2025 The Kubeflow authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/ -}}

{{- if .Values.runtimes.deepspeedDistributed.enabled }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to check for defaultEnabled too ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the deepspeed runtime check to respect defaultEnabled as well.

apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
name: deepspeed-distributed
labels:
trainer.kubeflow.org/framework: deepspeed
{{- include "trainer.labels" . | nindent 4 }}
spec:
mlPolicy:
numNodes: 1
mpi:
numProcPerNode: 1
mpiImplementation: OpenMPI
sshAuthMountPath: /home/mpiuser/.ssh
runLauncherAsNode: true
template:
spec:
network:
publishNotReadyAddresses: true
successPolicy:
operator: All
targetReplicatedJobs:
- launcher
replicatedJobs:
- name: launcher
template:
metadata:
labels:
trainer.kubeflow.org/trainjob-ancestor-step: trainer
spec:
template:
spec:
containers:
- name: node
image: {{ include "trainer.runtimeImage" (dict "registry" .Values.runtimes.deepspeedDistributed.image.registry "repository" .Values.runtimes.deepspeedDistributed.image.repository "tag" .Values.runtimes.deepspeedDistributed.image.tag "context" .) }}
securityContext:
runAsUser: 1000
- name: node
template:
spec:
template:
spec:
containers:
- name: node
image: {{ include "trainer.runtimeImage" (dict "registry" .Values.runtimes.deepspeedDistributed.image.registry "repository" .Values.runtimes.deepspeedDistributed.image.repository "tag" .Values.runtimes.deepspeedDistributed.image.tag "context" .) }}
securityContext:
runAsUser: 1000
command:
- /usr/sbin/sshd
args:
- -De
- -f
- /home/mpiuser/.sshd_config
readinessProbe:
tcpSocket:
port: 2222
initialDelaySeconds: 5
{{- end }}
75 changes: 75 additions & 0 deletions charts/kubeflow-trainer/templates/runtimes/mlx-distributed.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
{{- /*
Copyright 2026 The Kubeflow authors.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/ -}}

{{- if .Values.runtimes.mlxDistributed.enabled }}
apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
name: mlx-distributed
labels:
trainer.kubeflow.org/framework: mlx
{{- include "trainer.labels" . | nindent 4 }}
spec:
mlPolicy:
numNodes: 1
mpi:
numProcPerNode: 1
mpiImplementation: OpenMPI
sshAuthMountPath: /home/mpiuser/.ssh
runLauncherAsNode: true
template:
spec:
network:
publishNotReadyAddresses: true
successPolicy:
operator: All
targetReplicatedJobs:
- launcher
replicatedJobs:
- name: launcher
template:
metadata:
labels:
trainer.kubeflow.org/trainjob-ancestor-step: trainer
spec:
template:
spec:
containers:
- name: node
image: {{ include "trainer.runtimeImage" (dict "registry" .Values.runtimes.mlxDistributed.image.registry "repository" .Values.runtimes.mlxDistributed.image.repository "tag" .Values.runtimes.mlxDistributed.image.tag "context" .) }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we process it in the _helper.tpl instead of using dict here, so we can define image here similarly as for manager: https://github.com/kubeflow/trainer/blob/91f69d72628208aa72cb36d8b3dc8fbee9395d7b/charts/kubeflow-trainer/templates/manager/deployment.yaml#L36C36-L36C41

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, refactored the trainer.runtimeImage helper to accept values more cleanly - now uses {{ include "trainer.runtimeImage" (list .Values.runtimes.mlxDistributed.image .) }} instead of the verbose dict syntax
updated all runtime templates (mlx, deepspeed, torch-with-cache and the new torchtune)

securityContext:
runAsUser: 1000
- name: node
template:
spec:
template:
spec:
containers:
- name: node
image: {{ include "trainer.runtimeImage" (dict "registry" .Values.runtimes.mlxDistributed.image.registry "repository" .Values.runtimes.mlxDistributed.image.repository "tag" .Values.runtimes.mlxDistributed.image.tag "context" .) }}
securityContext:
runAsUser: 1000
command:
- /usr/sbin/sshd
args:
- -De
- -f
- /home/mpiuser/.sshd_config
readinessProbe:
tcpSocket:
port: 2222
initialDelaySeconds: 5
{{- end }}
Loading
Loading