feat(runtimes): add support for ClusterTrainingRuntimes in Helm chart by khushiiagrawal · Pull Request #3124 · kubeflow/trainer

khushiiagrawal · 2026-01-24T18:27:37Z

PR Description

What this PR does / why we need it:

This PR adds optional support for deploying ClusterTrainingRuntimes as part of the Kubeflow Trainer Helm chart installation.

Fixes #3115

Changes

New runtimes section in values.yaml with configurable options for:

torchDistributed - PyTorch distributed training (no custom images required)
torchDistributedWithCache - PyTorch with data cache support (configurable cacheImage)
deepspeedDistributed - DeepSpeed distributed training (configurable image)
mlxDistributed - MLX distributed training (configurable image)

Helm templates for ClusterTrainingRuntimes that are conditionally deployed when enabled:

torch-distributed.yaml
torch-distributed-with-cache.yaml
deepspeed-distributed.yaml
mlx-distributed.yaml

Helper template trainer.runtimeImage in _helpers.tpl for generating runtime image references with automatic tag defaulting to chart version.
Example configuration in runtimes-values.yaml
Updated documentation in README.md with usage instructions

Design Decisions

Following the feedback:

Each runtime uses a single enabled flag
Image tags default to "" which automatically uses the same imageTag as the controller (via trainer.defaultImageTag helper)
For runtimes without custom images (e.g., torch-distributed), only the enabled flag is exposed
The initializerImage is NOT user-configurable - it uses a hardcoded image path. This may be set via Trainer config in the future as part of Support user‑specified initializers in TrainJob when runtime has none #2886

Usage

# Enable specific runtimes
helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \
  --set runtimes.torchDistributed.enabled=true \
  --set runtimes.deepspeedDistributed.enabled=true

# For torch-distributed-with-cache, also enable dataCache
helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \
  --set runtimes.torchDistributedWithCache.enabled=true \
  --set dataCache.enabled=true

Checklist:

Docs included if any changes are user facing (README.md updated with usage instructions)

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

google-oss-prow · 2026-01-24T18:27:43Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

khushiiagrawal · 2026-01-24T18:29:28Z

@akshaychitneni @andreyvelich i have made all the changes that were discussed and needed to add support for ClusterTrainingRuntimes in Helm chart .
Please take a look.
thanks!

coveralls · 2026-01-24T18:32:21Z

Pull Request Test Coverage Report for Build 21483070842

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
40 unchanged lines in 4 files lost coverage.
Overall coverage increased (+0.09%) to 51.217%

Files with Coverage Reduction	New Missed Lines	%
pkg/webhooks/clustertrainingruntime_webhook.go	5	48.0%
pkg/webhooks/trainjob_webhook.go	8	30.0%
pkg/webhooks/trainingruntime_webhook.go	10	62.79%
pkg/controller/trainjob_controller.go	17	0.0%

Totals
Change from base Build 21369554754:	0.09%
Covered Lines:	1241
Relevant Lines:	2423

💛 - Coveralls

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

Copilot

Pull request overview

This PR adds optional support for deploying ClusterTrainingRuntimes as part of the Kubeflow Trainer Helm chart installation, addressing issue #3115. Previously, these runtimes were only available via kustomize manifests, requiring users to manually apply them separately after Helm installation.

Changes:

Added runtimes section in values.yaml with configurable options for four runtime types: torchDistributed, torchDistributedWithCache, deepspeedDistributed, and mlxDistributed
Created four Helm template files for conditionally deploying ClusterTrainingRuntimes when enabled
Added helper templates in _helpers.tpl for generating runtime image references with automatic tag defaulting

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
charts/kubeflow-trainer/values.yaml	Adds runtimes configuration section with enable flags and image settings for each runtime type
charts/kubeflow-trainer/templates/runtimes/torch-distributed.yaml	Template for basic PyTorch distributed training runtime
charts/kubeflow-trainer/templates/runtimes/torch-distributed-with-cache.yaml	Template for PyTorch runtime with data cache support and dataset initializer
charts/kubeflow-trainer/templates/runtimes/deepspeed-distributed.yaml	Template for DeepSpeed distributed training runtime with MPI support
charts/kubeflow-trainer/templates/runtimes/mlx-distributed.yaml	Template for MLX distributed training runtime with MPI support
charts/kubeflow-trainer/templates/_helpers.tpl	Adds helper functions for default image tag generation and runtime image construction
charts/kubeflow-trainer/examples/runtimes-values.yaml	Example configuration file demonstrating runtime enablement and customization
charts/kubeflow-trainer/README.md.gotmpl	Documentation updates with installation instructions for ClusterTrainingRuntimes
charts/kubeflow-trainer/README.md	Generated documentation with runtime configuration examples and values table

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

charts/kubeflow-trainer/templates/runtimes/torch-distributed-with-cache.yaml

charts/kubeflow-trainer/README.md

charts/kubeflow-trainer/examples/runtimes-values.yaml

charts/kubeflow-trainer/templates/_helpers.tpl

charts/kubeflow-trainer/templates/runtimes/torch-distributed.yaml

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

akshaychitneni · 2026-01-26T03:06:50Z

charts/kubeflow-trainer/README.md.gotmpl

+  torchDistributedWithCache:
+    enabled: true
+    # Custom image tags (optional, defaults to chart version)
+    cacheImage:


Should it be

torchDistributedWithCache: enabled: true dataCache: enabled: true cacheImage: tag: "v2.0.0"

thanks for the review. I've updated the configuration to nest cacheImage under dataCache.
change has been applied across all relevant files: values.yaml, README.md.gotmpl and the runtime template.

Can we place the runtimes related to dataCache under dataCache for now?
I think that will be clearer for users:

dataCache: enabled: true runtimes: torchDistributedWithCache enabled: true

akshaychitneni · 2026-01-26T03:07:07Z

charts/kubeflow-trainer/templates/runtimes/mlx-distributed.yaml

@@ -0,0 +1,75 @@
+{{- /*
+Copyright 2025 The Kubeflow authors.


missed it again, updated now. thanks.

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

khushiiagrawal · 2026-01-26T14:35:29Z

@akshaychitneni @andreyvelich Please take a look, addressed all the comments.
thanks!

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

andreyvelich · 2026-01-27T01:22:45Z

charts/kubeflow-trainer/values.yaml

+runtimes:
+  # -- PyTorch distributed training runtime (no custom images required)


Maybe we can also have the flag to deploy all default runtimes (e.g. torch, deepspeed, mlx, torchtune runtimes) for simplicity?

Suggested change

runtimes:

# -- PyTorch distributed training runtime (no custom images required)

runtimes:

defaultEnabled: false

# -- PyTorch distributed training runtime (no custom images required)

yeah, right. i've added a defaultEnabled flag at the top of the runtimes section.
when set to true, it will deploy all default runtimes (torch, deepspeed, mlx, torchtune). individual runtime settings remain available for granular control when needed

andreyvelich · 2026-01-27T01:24:13Z

charts/kubeflow-trainer/values.yaml

+      # -- MLX runtime image repository
+      repository: kubeflow/trainer/mlx-runtime
+      # -- MLX runtime image tag. Defaults to chart version if empty.
+      tag: ""


You should also add TorchTune runtimes.

added TorchTune runtime support
thanks!

andreyvelich · 2026-01-27T01:27:01Z

charts/kubeflow-trainer/templates/_helpers.tpl

-{{- $imageTag := .Values.image.tag -}}
-{{- if not $imageTag -}}
-{{- if hasPrefix "0.0.0-" .Chart.Version -}}
-{{- $imageTag = trimPrefix "0.0.0-" .Chart.Version -}}
-{{- else -}}
-{{- $imageTag = printf "v%s" .Chart.Version -}}
-{{- end -}}
-{{- end -}}


Do do you need to make changes to it?

code was refactored to use a more modular approach - the inline logic was split into trainer.resolveImageTag and trainer.defaultImageTag helpers for better maintainability. no functional changes, just better organization that allows both manager and runtime images to share the same tag resolution logic

I see that make sense, thanks for clarifying!

andreyvelich · 2026-01-27T01:30:55Z

charts/kubeflow-trainer/templates/runtimes/mlx-distributed.yaml

+                spec:
+                  containers:
+                    - name: node
+                      image: {{ include "trainer.runtimeImage" (dict "registry" .Values.runtimes.mlxDistributed.image.registry "repository" .Values.runtimes.mlxDistributed.image.repository "tag" .Values.runtimes.mlxDistributed.image.tag "context" .) }}


Can we process it in the _helper.tpl instead of using dict here, so we can define image here similarly as for manager: https://github.com/kubeflow/trainer/blob/91f69d72628208aa72cb36d8b3dc8fbee9395d7b/charts/kubeflow-trainer/templates/manager/deployment.yaml#L36C36-L36C41

sure, refactored the trainer.runtimeImage helper to accept values more cleanly - now uses {{ include "trainer.runtimeImage" (list .Values.runtimes.mlxDistributed.image .) }} instead of the verbose dict syntax
updated all runtime templates (mlx, deepspeed, torch-with-cache and the new torchtune)

…ult runtimes enabled flag Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

Signed-off-by: Khushi Agrawal <149886195+khushiiagrawal@users.noreply.github.com>

khushiiagrawal · 2026-01-27T10:54:39Z

@andreyvelich Please take a look. addressed all the comments.
thanks!

andreyvelich · 2026-01-27T19:27:35Z

charts/kubeflow-trainer/templates/_helpers.tpl

+{{- printf "%s/%s:%s" $registry $repository $tag }}
+{{- end }}
+{{- end }}
 {{- define "trainer.version" -}}


Please add the comment to explain how trainer.version variable is used.

added the comment. thanks.

andreyvelich · 2026-01-27T19:36:16Z

charts/kubeflow-trainer/templates/_helpers.tpl

-{{- $imageTag := .Values.image.tag -}}
-{{- if not $imageTag -}}
-{{- if hasPrefix "0.0.0-" .Chart.Version -}}
-{{- $imageTag = trimPrefix "0.0.0-" .Chart.Version -}}
-{{- else -}}
-{{- $imageTag = printf "v%s" .Chart.Version -}}
-{{- end -}}
-{{- end -}}


I see that make sense, thanks for clarifying!

andreyvelich · 2026-01-27T19:36:38Z

charts/kubeflow-trainer/templates/runtimes/deepspeed-distributed.yaml

+limitations under the License.
+*/ -}}
+
+{{- if .Values.runtimes.deepspeedDistributed.enabled }}


Do you need to check for defaultEnabled too ?

updated the deepspeed runtime check to respect defaultEnabled as well.

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

khushiiagrawal · 2026-01-28T13:34:40Z

addressed the comments. @andreyvelich PTAL.
thanks!

andreyvelich · 2026-01-28T15:59:11Z

charts/kubeflow-trainer/templates/runtimes/deepspeed-distributed.yaml

+limitations under the License.
+*/ -}}
+
+{{- if or .Values.runtimes.deepspeedDistributed.enabled .Values.runtimes.defaultEnabled }}


You need to check for defaultEnabled in these runtimes too: mlx, torch, torchtune runtimes.

thanks for the review. i've updated the others (mlx, torch and torchtune) to respect the defaultEnabled flag as well.

andreyvelich · 2026-01-28T15:59:56Z

charts/kubeflow-trainer/templates/runtimes/torchtune-distributed.yaml

+apiVersion: trainer.kubeflow.org/v1alpha1
+kind: ClusterTrainingRuntime
+metadata:
+  name: torchtune-distributed


this is incorrect.
Please create subdirectories for torchtune runtimes like we do here: https://github.com/kubeflow/trainer/tree/master/manifests/base/runtimes/torchtune

thanks, that makes sense. i've removed the generic template and created a templates/runtimes/torchtune/ directory with specific templates for llama3-2-1b, llama3-2-3b and qwen2.5-1.5b to match the manifests.

…d default enabled option Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

khushiiagrawal · 2026-01-29T15:11:47Z

@andreyvelich Please take a look. Addressed all the changes.
Thanks!

khushiiagrawal · 2026-02-03T17:37:36Z

@andreyvelich @akshaychitneni do let me know if any changes required.
thanks!

andreyvelich · 2026-02-05T02:24:46Z

charts/kubeflow-trainer/templates/runtimes/torch-distributed-with-cache.yaml

+limitations under the License.
+*/ -}}
+
+{{- if or .Values.runtimes.torchDistributedWithCache.enabled .Values.runtimes.defaultEnabled }}


torch distributed with cache should not be enabled by default.
As here, can you place it to a separate folder: https://github.com/kubeflow/trainer/tree/master/manifests/base/runtimes/data-cache
And install only when dataCache.enabled and torchDistributedWithCache.enabled

andreyvelich · 2026-02-05T02:26:50Z

charts/kubeflow-trainer/values.yaml

+  # -- PyTorch distributed training with data cache support
+  torchDistributedWithCache:
+    # -- Enable deployment of torch-distributed-with-cache runtime
+    enabled: false
+    dataCache:
+      enabled: true
+      cacheImage:
+        # -- Data cache image registry
+        registry: ghcr.io
+        # -- Data cache image repository
+        repository: kubeflow/trainer/data-cache
+        # -- Data cache image tag. Defaults to chart version if empty.
+        tag: ""


Can you separate data cache to another block in Helm Charts?
Something like this:

dataCache: enabled: true cacheImage: registry: .... runtimes: torchDistributed: enable: false

andreyvelich · 2026-02-05T02:27:34Z

charts/kubeflow-trainer/README.md

+
+You can optionally deploy ClusterTrainingRuntimes as part of the Helm installation. Runtimes are disabled by default to keep the chart lightweight.
+
+To enable specific runtimes:


Add example where enabledDefault is true

khushiiagrawal added 2 commits January 24, 2026 23:47

feat(runtimes): add support for ClusterTrainingRuntimes in Helm chart

5879258

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

fix: remove initializerImage from user-configurable values

0890f7e

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

Copilot AI review requested due to automatic review settings January 24, 2026 18:27

google-oss-prow bot requested a review from jinchihe January 24, 2026 18:27

google-oss-prow bot requested a review from kuizhiqing January 24, 2026 18:27

google-oss-prow bot added the size/XL label Jan 24, 2026

Copilot started reviewing on behalf of khushiiagrawal January 24, 2026 18:27 View session

chore: regenerate README with helm-docs

2bb3fc6

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

google-oss-prow bot added size/L and removed size/XL labels Jan 24, 2026

Copilot AI reviewed Jan 24, 2026

View reviewed changes

fix: address Copilot review suggestions

f6aecc1

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

google-oss-prow bot added size/XL and removed size/L labels Jan 24, 2026

feat: Introduce helper to centralize image tag resolution

348b015

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

akshaychitneni reviewed Jan 26, 2026

View reviewed changes

refactor: nest cache image configuration and update copyright year.

b626e31

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

chore: run make generate to sync

91f69d7

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

andreyvelich reviewed Jan 27, 2026

View reviewed changes

feat: add TorchTune distributed runtime, image helper usage, and defa…

4337eeb

…ult runtimes enabled flag Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

google-oss-prow bot added size/XXL and removed size/XL labels Jan 27, 2026

Merge branch 'master' into feature/support-for-ClusterTrainingRuntimes

1f4d1fe

Signed-off-by: Khushi Agrawal <149886195+khushiiagrawal@users.noreply.github.com>

andreyvelich reviewed Jan 27, 2026

View reviewed changes

feat: enable runtime via a new default flag and add a comment

298bd60

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

andreyvelich reviewed Jan 28, 2026

View reviewed changes

khushiiagrawal added 2 commits January 29, 2026 20:16

refactor: Torchtune runtimes to use model specific configurations, ad…

959ff3a

…d default enabled option Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

fix: update README and fix trailing whitespace for CI

ebd8832

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>

khushiiagrawal mentioned this pull request Feb 4, 2026

Support for Tensor Caching in Kubeflow Data Cache #3173

Open

andreyvelich reviewed Feb 5, 2026

View reviewed changes

		runtimes:
		# -- PyTorch distributed training runtime (no custom images required)


		You can optionally deploy ClusterTrainingRuntimes as part of the Helm installation. Runtimes are disabled by default to keep the chart lightweight.

		To enable specific runtimes:

Conversation

khushiiagrawal commented Jan 24, 2026

PR Description

Changes

Design Decisions

Usage

Checklist:

Uh oh!

google-oss-prow bot commented Jan 24, 2026

Uh oh!

khushiiagrawal commented Jan 24, 2026

Uh oh!

coveralls commented Jan 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 21483070842

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khushiiagrawal commented Jan 26, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khushiiagrawal commented Jan 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

khushiiagrawal commented Jan 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Jan 24, 2026 •

edited

Loading