feat(runtimes): add support for ClusterTrainingRuntimes in Helm chart#3124
feat(runtimes): add support for ClusterTrainingRuntimes in Helm chart#3124khushiiagrawal wants to merge 12 commits intokubeflow:masterfrom
Conversation
Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@akshaychitneni @andreyvelich i have made all the changes that were discussed and needed to add support for ClusterTrainingRuntimes in Helm chart . |
Pull Request Test Coverage Report for Build 21483070842Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR adds optional support for deploying ClusterTrainingRuntimes as part of the Kubeflow Trainer Helm chart installation, addressing issue #3115. Previously, these runtimes were only available via kustomize manifests, requiring users to manually apply them separately after Helm installation.
Changes:
- Added
runtimessection in values.yaml with configurable options for four runtime types: torchDistributed, torchDistributedWithCache, deepspeedDistributed, and mlxDistributed - Created four Helm template files for conditionally deploying ClusterTrainingRuntimes when enabled
- Added helper templates in _helpers.tpl for generating runtime image references with automatic tag defaulting
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| charts/kubeflow-trainer/values.yaml | Adds runtimes configuration section with enable flags and image settings for each runtime type |
| charts/kubeflow-trainer/templates/runtimes/torch-distributed.yaml | Template for basic PyTorch distributed training runtime |
| charts/kubeflow-trainer/templates/runtimes/torch-distributed-with-cache.yaml | Template for PyTorch runtime with data cache support and dataset initializer |
| charts/kubeflow-trainer/templates/runtimes/deepspeed-distributed.yaml | Template for DeepSpeed distributed training runtime with MPI support |
| charts/kubeflow-trainer/templates/runtimes/mlx-distributed.yaml | Template for MLX distributed training runtime with MPI support |
| charts/kubeflow-trainer/templates/_helpers.tpl | Adds helper functions for default image tag generation and runtime image construction |
| charts/kubeflow-trainer/examples/runtimes-values.yaml | Example configuration file demonstrating runtime enablement and customization |
| charts/kubeflow-trainer/README.md.gotmpl | Documentation updates with installation instructions for ClusterTrainingRuntimes |
| charts/kubeflow-trainer/README.md | Generated documentation with runtime configuration examples and values table |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
charts/kubeflow-trainer/templates/runtimes/torch-distributed-with-cache.yaml
Outdated
Show resolved
Hide resolved
charts/kubeflow-trainer/templates/runtimes/torch-distributed-with-cache.yaml
Show resolved
Hide resolved
charts/kubeflow-trainer/templates/runtimes/torch-distributed.yaml
Outdated
Show resolved
Hide resolved
Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
| torchDistributedWithCache: | ||
| enabled: true | ||
| # Custom image tags (optional, defaults to chart version) | ||
| cacheImage: |
There was a problem hiding this comment.
Should it be
torchDistributedWithCache:
enabled: true
dataCache:
enabled: true
cacheImage:
tag: "v2.0.0"
There was a problem hiding this comment.
thanks for the review. I've updated the configuration to nest cacheImage under dataCache.
change has been applied across all relevant files: values.yaml, README.md.gotmpl and the runtime template.
There was a problem hiding this comment.
Can we place the runtimes related to dataCache under dataCache for now?
I think that will be clearer for users:
dataCache:
enabled: true
runtimes:
torchDistributedWithCache
enabled: true| @@ -0,0 +1,75 @@ | |||
| {{- /* | |||
| Copyright 2025 The Kubeflow authors. | |||
There was a problem hiding this comment.
missed it again, updated now. thanks.
Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
|
@akshaychitneni @andreyvelich Please take a look, addressed all the comments. |
Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
| runtimes: | ||
| # -- PyTorch distributed training runtime (no custom images required) |
There was a problem hiding this comment.
Maybe we can also have the flag to deploy all default runtimes (e.g. torch, deepspeed, mlx, torchtune runtimes) for simplicity?
| runtimes: | |
| # -- PyTorch distributed training runtime (no custom images required) | |
| runtimes: | |
| defaultEnabled: false | |
| # -- PyTorch distributed training runtime (no custom images required) |
There was a problem hiding this comment.
yeah, right. i've added a defaultEnabled flag at the top of the runtimes section.
when set to true, it will deploy all default runtimes (torch, deepspeed, mlx, torchtune). individual runtime settings remain available for granular control when needed
| # -- MLX runtime image repository | ||
| repository: kubeflow/trainer/mlx-runtime | ||
| # -- MLX runtime image tag. Defaults to chart version if empty. | ||
| tag: "" |
There was a problem hiding this comment.
You should also add TorchTune runtimes.
There was a problem hiding this comment.
added TorchTune runtime support
thanks!
| {{- $imageTag := .Values.image.tag -}} | ||
| {{- if not $imageTag -}} | ||
| {{- if hasPrefix "0.0.0-" .Chart.Version -}} | ||
| {{- $imageTag = trimPrefix "0.0.0-" .Chart.Version -}} | ||
| {{- else -}} | ||
| {{- $imageTag = printf "v%s" .Chart.Version -}} | ||
| {{- end -}} | ||
| {{- end -}} |
There was a problem hiding this comment.
Do do you need to make changes to it?
There was a problem hiding this comment.
code was refactored to use a more modular approach - the inline logic was split into trainer.resolveImageTag and trainer.defaultImageTag helpers for better maintainability. no functional changes, just better organization that allows both manager and runtime images to share the same tag resolution logic
There was a problem hiding this comment.
I see that make sense, thanks for clarifying!
| spec: | ||
| containers: | ||
| - name: node | ||
| image: {{ include "trainer.runtimeImage" (dict "registry" .Values.runtimes.mlxDistributed.image.registry "repository" .Values.runtimes.mlxDistributed.image.repository "tag" .Values.runtimes.mlxDistributed.image.tag "context" .) }} |
There was a problem hiding this comment.
Can we process it in the _helper.tpl instead of using dict here, so we can define image here similarly as for manager: https://github.com/kubeflow/trainer/blob/91f69d72628208aa72cb36d8b3dc8fbee9395d7b/charts/kubeflow-trainer/templates/manager/deployment.yaml#L36C36-L36C41
There was a problem hiding this comment.
sure, refactored the trainer.runtimeImage helper to accept values more cleanly - now uses {{ include "trainer.runtimeImage" (list .Values.runtimes.mlxDistributed.image .) }} instead of the verbose dict syntax
updated all runtime templates (mlx, deepspeed, torch-with-cache and the new torchtune)
…ult runtimes enabled flag Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
Signed-off-by: Khushi Agrawal <149886195+khushiiagrawal@users.noreply.github.com>
|
@andreyvelich Please take a look. addressed all the comments. |
| {{- printf "%s/%s:%s" $registry $repository $tag }} | ||
| {{- end }} | ||
| {{- end }} | ||
| {{- define "trainer.version" -}} |
There was a problem hiding this comment.
Please add the comment to explain how trainer.version variable is used.
There was a problem hiding this comment.
added the comment. thanks.
| {{- $imageTag := .Values.image.tag -}} | ||
| {{- if not $imageTag -}} | ||
| {{- if hasPrefix "0.0.0-" .Chart.Version -}} | ||
| {{- $imageTag = trimPrefix "0.0.0-" .Chart.Version -}} | ||
| {{- else -}} | ||
| {{- $imageTag = printf "v%s" .Chart.Version -}} | ||
| {{- end -}} | ||
| {{- end -}} |
There was a problem hiding this comment.
I see that make sense, thanks for clarifying!
| limitations under the License. | ||
| */ -}} | ||
|
|
||
| {{- if .Values.runtimes.deepspeedDistributed.enabled }} |
There was a problem hiding this comment.
Do you need to check for defaultEnabled too ?
There was a problem hiding this comment.
updated the deepspeed runtime check to respect defaultEnabled as well.
Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
|
addressed the comments. @andreyvelich PTAL. |
| limitations under the License. | ||
| */ -}} | ||
|
|
||
| {{- if or .Values.runtimes.deepspeedDistributed.enabled .Values.runtimes.defaultEnabled }} |
There was a problem hiding this comment.
You need to check for defaultEnabled in these runtimes too: mlx, torch, torchtune runtimes.
There was a problem hiding this comment.
thanks for the review. i've updated the others (mlx, torch and torchtune) to respect the defaultEnabled flag as well.
| apiVersion: trainer.kubeflow.org/v1alpha1 | ||
| kind: ClusterTrainingRuntime | ||
| metadata: | ||
| name: torchtune-distributed |
There was a problem hiding this comment.
this is incorrect.
Please create subdirectories for torchtune runtimes like we do here: https://github.com/kubeflow/trainer/tree/master/manifests/base/runtimes/torchtune
There was a problem hiding this comment.
thanks, that makes sense. i've removed the generic template and created a templates/runtimes/torchtune/ directory with specific templates for llama3-2-1b, llama3-2-3b and qwen2.5-1.5b to match the manifests.
…d default enabled option Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
Signed-off-by: khushiiagrawal <khushisaritaagrawal@gmail.com>
|
@andreyvelich Please take a look. Addressed all the changes. |
|
@andreyvelich @akshaychitneni do let me know if any changes required. |
| limitations under the License. | ||
| */ -}} | ||
|
|
||
| {{- if or .Values.runtimes.torchDistributedWithCache.enabled .Values.runtimes.defaultEnabled }} |
There was a problem hiding this comment.
torch distributed with cache should not be enabled by default.
As here, can you place it to a separate folder: https://github.com/kubeflow/trainer/tree/master/manifests/base/runtimes/data-cache
And install only when dataCache.enabled and torchDistributedWithCache.enabled
| # -- PyTorch distributed training with data cache support | ||
| torchDistributedWithCache: | ||
| # -- Enable deployment of torch-distributed-with-cache runtime | ||
| enabled: false | ||
| dataCache: | ||
| enabled: true | ||
| cacheImage: | ||
| # -- Data cache image registry | ||
| registry: ghcr.io | ||
| # -- Data cache image repository | ||
| repository: kubeflow/trainer/data-cache | ||
| # -- Data cache image tag. Defaults to chart version if empty. | ||
| tag: "" |
There was a problem hiding this comment.
Can you separate data cache to another block in Helm Charts?
Something like this:
dataCache:
enabled: true
cacheImage:
registry: ....
runtimes:
torchDistributed:
enable: false|
|
||
| You can optionally deploy ClusterTrainingRuntimes as part of the Helm installation. Runtimes are disabled by default to keep the chart lightweight. | ||
|
|
||
| To enable specific runtimes: |
There was a problem hiding this comment.
Add example where enabledDefault is true
PR Description
What this PR does / why we need it:
This PR adds optional support for deploying ClusterTrainingRuntimes as part of the Kubeflow Trainer Helm chart installation.
Fixes #3115
Changes
runtimessection invalues.yamlwith configurable options for:torchDistributed- PyTorch distributed training (no custom images required)torchDistributedWithCache- PyTorch with data cache support (configurablecacheImage)deepspeedDistributed- DeepSpeed distributed training (configurableimage)mlxDistributed- MLX distributed training (configurableimage)torch-distributed.yamltorch-distributed-with-cache.yamldeepspeed-distributed.yamlmlx-distributed.yamltrainer.runtimeImagein_helpers.tplfor generating runtime image references with automatic tag defaulting to chart version.runtimes-values.yamlREADME.mdwith usage instructionsDesign Decisions
Following the feedback:
enabledflag""which automatically uses the sameimageTagas the controller (viatrainer.defaultImageTaghelper)torch-distributed), only theenabledflag is exposedinitializerImageis NOT user-configurable - it uses a hardcoded image path. This may be set via Trainer config in the future as part of Support user‑specified initializers in TrainJob when runtime has none #2886Usage
Checklist:
Docsincluded if any changes are user facing (README.md updated with usage instructions)