Skip to content

feat(runtimes): add support for ClusterTrainingRuntimes in Helm chart#3124

Open
khushiiagrawal wants to merge 12 commits intokubeflow:masterfrom
khushiiagrawal:feature/support-for-ClusterTrainingRuntimes
Open

feat(runtimes): add support for ClusterTrainingRuntimes in Helm chart#3124
khushiiagrawal wants to merge 12 commits intokubeflow:masterfrom
khushiiagrawal:feature/support-for-ClusterTrainingRuntimes

Conversation

@khushiiagrawal
Copy link
Contributor

PR Description

What this PR does / why we need it:

This PR adds optional support for deploying ClusterTrainingRuntimes as part of the Kubeflow Trainer Helm chart installation.

Fixes #3115

Changes

  1. New runtimes section in values.yaml with configurable options for:
  • torchDistributed - PyTorch distributed training (no custom images required)
  • torchDistributedWithCache - PyTorch with data cache support (configurable cacheImage)
  • deepspeedDistributed - DeepSpeed distributed training (configurable image)
  • mlxDistributed - MLX distributed training (configurable image)
  1. Helm templates for ClusterTrainingRuntimes that are conditionally deployed when enabled:
  • torch-distributed.yaml
  • torch-distributed-with-cache.yaml
  • deepspeed-distributed.yaml
  • mlx-distributed.yaml
  1. Helper template trainer.runtimeImage in _helpers.tpl for generating runtime image references with automatic tag defaulting to chart version.
  2. Example configuration in runtimes-values.yaml
  3. Updated documentation in README.md with usage instructions

Design Decisions

Following the feedback:

  • Each runtime uses a single enabled flag
  • Image tags default to "" which automatically uses the same imageTag as the controller (via trainer.defaultImageTag helper)
  • For runtimes without custom images (e.g., torch-distributed), only the enabled flag is exposed
  • The initializerImage is NOT user-configurable - it uses a hardcoded image path. This may be set via Trainer config in the future as part of Support user‑specified initializers in TrainJob when runtime has none #2886

Usage

# Enable specific runtimes
helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \
  --set runtimes.torchDistributed.enabled=true \
  --set runtimes.deepspeedDistributed.enabled=true

# For torch-distributed-with-cache, also enable dataCache
helm install kubeflow-trainer oci://ghcr.io/kubeflow/charts/kubeflow-trainer \
  --set runtimes.torchDistributedWithCache.enabled=true \
  --set dataCache.enabled=true

Checklist:

  • Docs included if any changes are user facing (README.md updated with usage instructions)

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
Copilot AI review requested due to automatic review settings January 24, 2026 18:27
@google-oss-prow google-oss-prow bot requested a review from jinchihe January 24, 2026 18:27
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@khushiiagrawal
Copy link
Contributor Author

@akshaychitneni @andreyvelich i have made all the changes that were discussed and needed to add support for ClusterTrainingRuntimes in Helm chart .
Please take a look.
thanks!

@coveralls
Copy link

coveralls commented Jan 24, 2026

Pull Request Test Coverage Report for Build 21483070842

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 40 unchanged lines in 4 files lost coverage.
  • Overall coverage increased (+0.09%) to 51.217%

Files with Coverage Reduction New Missed Lines %
pkg/webhooks/clustertrainingruntime_webhook.go 5 48.0%
pkg/webhooks/trainjob_webhook.go 8 30.0%
pkg/webhooks/trainingruntime_webhook.go 10 62.79%
pkg/controller/trainjob_controller.go 17 0.0%
Totals Coverage Status
Change from base Build 21369554754: 0.09%
Covered Lines: 1241
Relevant Lines: 2423

💛 - Coveralls

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
@google-oss-prow google-oss-prow bot added size/L and removed size/XL labels Jan 24, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds optional support for deploying ClusterTrainingRuntimes as part of the Kubeflow Trainer Helm chart installation, addressing issue #3115. Previously, these runtimes were only available via kustomize manifests, requiring users to manually apply them separately after Helm installation.

Changes:

  • Added runtimes section in values.yaml with configurable options for four runtime types: torchDistributed, torchDistributedWithCache, deepspeedDistributed, and mlxDistributed
  • Created four Helm template files for conditionally deploying ClusterTrainingRuntimes when enabled
  • Added helper templates in _helpers.tpl for generating runtime image references with automatic tag defaulting

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
charts/kubeflow-trainer/values.yaml Adds runtimes configuration section with enable flags and image settings for each runtime type
charts/kubeflow-trainer/templates/runtimes/torch-distributed.yaml Template for basic PyTorch distributed training runtime
charts/kubeflow-trainer/templates/runtimes/torch-distributed-with-cache.yaml Template for PyTorch runtime with data cache support and dataset initializer
charts/kubeflow-trainer/templates/runtimes/deepspeed-distributed.yaml Template for DeepSpeed distributed training runtime with MPI support
charts/kubeflow-trainer/templates/runtimes/mlx-distributed.yaml Template for MLX distributed training runtime with MPI support
charts/kubeflow-trainer/templates/_helpers.tpl Adds helper functions for default image tag generation and runtime image construction
charts/kubeflow-trainer/examples/runtimes-values.yaml Example configuration file demonstrating runtime enablement and customization
charts/kubeflow-trainer/README.md.gotmpl Documentation updates with installation instructions for ClusterTrainingRuntimes
charts/kubeflow-trainer/README.md Generated documentation with runtime configuration examples and values table

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
@google-oss-prow google-oss-prow bot added size/XL and removed size/L labels Jan 24, 2026
Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
torchDistributedWithCache:
enabled: true
# Custom image tags (optional, defaults to chart version)
cacheImage:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be

torchDistributedWithCache:
    enabled: true
    dataCache:
      enabled: true
      cacheImage:
          tag: "v2.0.0"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the review. I've updated the configuration to nest cacheImage under dataCache.
change has been applied across all relevant files: values.yaml, README.md.gotmpl and the runtime template.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we place the runtimes related to dataCache under dataCache for now?
I think that will be clearer for users:

dataCache:
  enabled: true
    runtimes:
      torchDistributedWithCache
        enabled: true

@@ -0,0 +1,75 @@
{{- /*
Copyright 2025 The Kubeflow authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2026?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed it again, updated now. thanks.

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
@khushiiagrawal
Copy link
Contributor Author

@akshaychitneni @andreyvelich Please take a look, addressed all the comments.
thanks!

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
Comment on lines 155 to 156
runtimes:
# -- PyTorch distributed training runtime (no custom images required)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can also have the flag to deploy all default runtimes (e.g. torch, deepspeed, mlx, torchtune runtimes) for simplicity?

Suggested change
runtimes:
# -- PyTorch distributed training runtime (no custom images required)
runtimes:
defaultEnabled: false
# -- PyTorch distributed training runtime (no custom images required)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, right. i've added a defaultEnabled flag at the top of the runtimes section.
when set to true, it will deploy all default runtimes (torch, deepspeed, mlx, torchtune). individual runtime settings remain available for granular control when needed

# -- MLX runtime image repository
repository: kubeflow/trainer/mlx-runtime
# -- MLX runtime image tag. Defaults to chart version if empty.
tag: ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should also add TorchTune runtimes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added TorchTune runtime support
thanks!

Comment on lines -70 to -77
{{- $imageTag := .Values.image.tag -}}
{{- if not $imageTag -}}
{{- if hasPrefix "0.0.0-" .Chart.Version -}}
{{- $imageTag = trimPrefix "0.0.0-" .Chart.Version -}}
{{- else -}}
{{- $imageTag = printf "v%s" .Chart.Version -}}
{{- end -}}
{{- end -}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do do you need to make changes to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code was refactored to use a more modular approach - the inline logic was split into trainer.resolveImageTag and trainer.defaultImageTag helpers for better maintainability. no functional changes, just better organization that allows both manager and runtime images to share the same tag resolution logic

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that make sense, thanks for clarifying!

spec:
containers:
- name: node
image: {{ include "trainer.runtimeImage" (dict "registry" .Values.runtimes.mlxDistributed.image.registry "repository" .Values.runtimes.mlxDistributed.image.repository "tag" .Values.runtimes.mlxDistributed.image.tag "context" .) }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we process it in the _helper.tpl instead of using dict here, so we can define image here similarly as for manager: https://github.com/kubeflow/trainer/blob/91f69d72628208aa72cb36d8b3dc8fbee9395d7b/charts/kubeflow-trainer/templates/manager/deployment.yaml#L36C36-L36C41

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, refactored the trainer.runtimeImage helper to accept values more cleanly - now uses {{ include "trainer.runtimeImage" (list .Values.runtimes.mlxDistributed.image .) }} instead of the verbose dict syntax
updated all runtime templates (mlx, deepspeed, torch-with-cache and the new torchtune)

…ult runtimes enabled flag

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
Signed-off-by: Khushi Agrawal <149886195+khushiiagrawal@users.noreply.github.com>
@khushiiagrawal
Copy link
Contributor Author

@andreyvelich Please take a look. addressed all the comments.
thanks!

{{- printf "%s/%s:%s" $registry $repository $tag }}
{{- end }}
{{- end }}
{{- define "trainer.version" -}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the comment to explain how trainer.version variable is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the comment. thanks.

Comment on lines -70 to -77
{{- $imageTag := .Values.image.tag -}}
{{- if not $imageTag -}}
{{- if hasPrefix "0.0.0-" .Chart.Version -}}
{{- $imageTag = trimPrefix "0.0.0-" .Chart.Version -}}
{{- else -}}
{{- $imageTag = printf "v%s" .Chart.Version -}}
{{- end -}}
{{- end -}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that make sense, thanks for clarifying!

limitations under the License.
*/ -}}

{{- if .Values.runtimes.deepspeedDistributed.enabled }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to check for defaultEnabled too ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the deepspeed runtime check to respect defaultEnabled as well.

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
@khushiiagrawal
Copy link
Contributor Author

addressed the comments. @andreyvelich PTAL.
thanks!

limitations under the License.
*/ -}}

{{- if or .Values.runtimes.deepspeedDistributed.enabled .Values.runtimes.defaultEnabled }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to check for defaultEnabled in these runtimes too: mlx, torch, torchtune runtimes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the review. i've updated the others (mlx, torch and torchtune) to respect the defaultEnabled flag as well.

apiVersion: trainer.kubeflow.org/v1alpha1
kind: ClusterTrainingRuntime
metadata:
name: torchtune-distributed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is incorrect.
Please create subdirectories for torchtune runtimes like we do here: https://github.com/kubeflow/trainer/tree/master/manifests/base/runtimes/torchtune

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, that makes sense. i've removed the generic template and created a templates/runtimes/torchtune/ directory with specific templates for llama3-2-1b, llama3-2-3b and qwen2.5-1.5b to match the manifests.

…d default enabled option

Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
@khushiiagrawal
Copy link
Contributor Author

@andreyvelich Please take a look. Addressed all the changes.
Thanks!

@khushiiagrawal
Copy link
Contributor Author

@andreyvelich @akshaychitneni do let me know if any changes required.
thanks!

limitations under the License.
*/ -}}

{{- if or .Values.runtimes.torchDistributedWithCache.enabled .Values.runtimes.defaultEnabled }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch distributed with cache should not be enabled by default.
As here, can you place it to a separate folder: https://github.com/kubeflow/trainer/tree/master/manifests/base/runtimes/data-cache
And install only when dataCache.enabled and torchDistributedWithCache.enabled

Comment on lines +165 to +177
# -- PyTorch distributed training with data cache support
torchDistributedWithCache:
# -- Enable deployment of torch-distributed-with-cache runtime
enabled: false
dataCache:
enabled: true
cacheImage:
# -- Data cache image registry
registry: ghcr.io
# -- Data cache image repository
repository: kubeflow/trainer/data-cache
# -- Data cache image tag. Defaults to chart version if empty.
tag: ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you separate data cache to another block in Helm Charts?
Something like this:

dataCache:
  enabled: true
  cacheImage:
     registry: ....
  runtimes:
    torchDistributed:
       enable: false


You can optionally deploy ClusterTrainingRuntimes as part of the Helm installation. Runtimes are disabled by default to keep the chart lightweight.

To enable specific runtimes:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add example where enabledDefault is true

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(helm): Add support for deploying ClusterTrainingRuntimes via Helm chart

4 participants