docs: KEP-2839 Dynamic LLM Trainer Framework — LLMTrainer ABC + FrameworkStrategy by NarayanaSabari · Pull Request #3263 · kubeflow/trainer

NarayanaSabari · 2026-02-27T10:01:50Z

Summary

KEP-2839: Dynamic LLM Trainer Framework proposal.

Kubeflow Trainer is locked to TorchTune for LLM fine-tuning. TorchTune stopped adding features in July 2025 (pytorch/torchtune#2883), supports only 4 models, and only SFT. This KEP introduces a pluggable framework so we can add TRL and future backends with ~400 lines of new/changed code.

The Problem (User Stories)

"I want DPO alignment, but Kubeflow only supports SFT via TorchTune." — TorchTune doesn't support DPO/KTO/GRPO. Users fall back to raw YAML or leave Kubeflow.
"I want to use a newer model that TorchTune doesn't have recipes for." — TorchTune supports 4 hardcoded models. TRL works with any Hugging Face model.
"I want to switch frameworks without relearning the SDK." — Today: modify BuiltinTrainer internals. After: one-line change TorchTuneTrainer(...) → TRLTrainer(...).

What This KEP Proposes

SDK (Python) — LLMTrainer ABC:

Separate ABC from KEP-285's BaseTrainer (config-driven vs function-based are architecturally distinct)
Both accepted through the same TrainerClient.train(trainer=...) — flat, unified UX
TorchTuneTrainer(LLMTrainer) — refactored from TorchTuneConfig with backward-compatible alias
TRLTrainer(LLMTrainer) — SFT/DPO/KTO/GRPO support
Runtime auto-discovery via trainer.kubeflow.org/framework label

Go Control Plane — FrameworkStrategy interface:

Replaces hardcoded command-sniffing with label-based dispatch
TorchTuneStrategy — wraps existing logic (moved, not rewritten)
TRLStrategy — accelerate-compatible env var injection
Adding a future backend = one line: "unsloth": &UnslothStrategy{}

Infrastructure:

TRL container image + ClusterTrainingRuntime manifests + Helm chart additions

Refactor Scope (Minimal)

Layer	Change	Lines
SDK `types.py`	Add `LLMTrainer` ABC + `TRLTrainer` + rename	~95 new
SDK `utils.py`	Generic `config.command` / `config.to_args()`	~20 changed
Go `torch/`	Extract TorchTune → strategy, add TRL strategy	~80 moved, ~65 new
Go `constants.go`	Add `FrameworkLabel`	1 new
Infra	Dockerfile + YAML manifest + Helm	~100 new

~400 lines total. TorchTune code is moved, not rewritten. Zero breaking changes.

Type Hierarchy

   TrainerClient.train(trainer=...)
   ─────────────────────────────────
   BaseTrainer (ABC)          LLMTrainer (ABC)
   (KEP-285, func-based)     (This KEP, config-driven)
      │                         │
      ├── TorchTrainer          ├── TorchTuneTrainer
      ├── JAXTrainer            └── TRLTrainer
      └── XGBoostTrainer

Relationship to KEP-285

This KEP answers the open question from @andreyvelich about how config-driven trainers fit into the trainer hierarchy. LLMTrainer and BaseTrainer are separate ABCs — no LSP violations, independent evolution, unified API entry point. See discussion on KEP-285.

Non-Goals

Unsloth or LlamaFactory backends (future work)
CRD schema changes
Deprecating BuiltinTrainer or CustomTrainer

Builds on KEP-2401 and community consensus on "Plan 3" from #2752.

Tracking issue: #2839

/cc @Electronic-Waste @andreyvelich @tariq-hasan @szaher

google-oss-prow · 2026-02-27T10:01:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-02-27T10:02:00Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

- Strip all Low-Level Design content (code interfaces, strategies, Dockerfile, runtime YAML, Helm chart details) - Fix 10 technical inaccuracies found during audit: - TRL CLI entry point (trl sft, not python -m trl) - Multi-node env vars (standard + PET variants) - Correct enforceTorchTunePolicy inline location - dependsOn YAML format, volume handling pattern - TRLTrainerType enum values (SFT/DPO/KTO/GRPO) - Container name 'node' not 'trainer' - PET env var naming conventions - KEP now covers: Summary, Goals, Non-Goals, Current State Analysis, High-Level Design, Test Plan, Risks, Phases

Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>

…P-285 alignment - Remove @register_backend decorator and backend registry (YAGNI with 2 backends) - Change to_command() method to command: ClassVar[tuple[str, ...]] - Move num_nodes/resources_per_node to LLMBackend base class - Add Relationship to KEP-285 section for config-driven vs function-based trainers - Simplify KubernetesBackend integration (no hasattr checks) - Remove stale Phase 1/Phase 2 references from Risks table - Goals reduced from 7 to 5

Replace standalone LLMBackend ABC with ConfigTrainer that integrates into KEP-285's BaseTrainer type hierarchy, directly answering open questions from maintainers about how config-driven trainers fit alongside function-based trainers. Key changes: - LLMBackend → ConfigTrainer(BaseTrainer) for unified type hierarchy - LLMBackendStrategy → FrameworkStrategy (matches framework label convention) - TorchTuneConfig → TorchTuneTrainer with backward-compatible alias - TRLConfig → TRLTrainer with runtime auto-discovery support - Added detailed KEP-285 relationship section with maintainer references - Added implementation history and KEP.yaml-style metadata Tracking issue: kubeflow#2839

Based on mentor feedback, ConfigTrainer is now a standalone ABC rather than a subclass of KEP-285's BaseTrainer. This avoids LSP violations (dead get_train_func() methods) and allows both hierarchies to evolve independently. Key architectural change: - ConfigTrainer and BaseTrainer are separate ABCs for separate patterns (config-driven vs function-based) - Both accepted through same TrainerClient.train(trainer=...) parameter for flat, unified user experience - No inheritance relationship — clean separation of concerns Also adds Alternatives Considered section documenting the unified hierarchy option and why it was rejected.

Add three diagrams: - SDK type hierarchy showing LLMTrainer and BaseTrainer as separate ABCs - End-to-end system architecture from Python SDK to Kubernetes pods - Go Torch plugin strategy dispatch flow

Add three visual diagrams to the High-Level Design section: - Before vs After: side-by-side comparison of SDK and Go coupling points - SDK Type Hierarchy: shows LLMTrainer and BaseTrainer as parallel ABCs feeding into unified TrainerClient.train(trainer=...) - End-to-End Flow: full stack trace of a TRL SFT job from data scientist through SDK, K8s API, Go Torch plugin, down to pods

Make the KEP more compelling for maintainer review: - Add User Stories section with 3 concrete data scientist pain points - Add Why TRL section with feature comparison table - Add Refactor Scope section showing exact file/line counts (~400 lines total to unlock every future LLM backend) - Trim verbose implementation code (to_args body, TRLStrategy body, dispatch code) — keep interfaces and concepts, defer details to LLD - Replace code-heavy Go sections with tables and concise explanations

Copilot

Pull request overview

Adds/updates the KEP-2839 proposal doc describing a pluggable, label-dispatched, config-driven LLM trainer framework spanning the Python SDK and the Go Torch plugin, with TorchTune kept as a compatible backend and TRL introduced as a new backend.

Changes:

Adds a new KEP document detailing the motivation, goals, architecture, and implementation plan for a dynamic LLM trainer framework.
Documents the proposed Python-side LLMTrainer abstraction and Go-side FrameworkStrategy dispatch mechanism.
Provides example TRL runtime manifests and user-facing SDK usage examples.

Copilot · 2026-04-01T05:16:35Z

docs/proposals/2839-dynamic-llm-trainer/README.md

+  - [Go Control Plane: Command-Sniffing](#go-control-plane-command-sniffing)
+- [High-Level Design](#high-level-design)
+  - [Architecture Overview](#architecture-overview)
+  - [Component Interaction Flow](#component-interaction-flow)


The Table of Contents includes an entry for “Component Interaction Flow” but there is no corresponding section heading in the document, so the link is broken and makes navigation harder; either add the missing section or remove/rename the TOC entry to match an existing heading.

Suggested change

- [Component Interaction Flow](#component-interaction-flow)

Copilot · 2026-04-01T05:16:36Z

docs/proposals/2839-dynamic-llm-trainer/README.md

+  - [Component Interaction Flow](#component-interaction-flow)
+  - [What Changes vs What Stays](#what-changes-vs-what-stays)
+- [Design Details](#design-details)
+  - [Python SDK: LLMTrainer Base Class](#python-sdk-configtrainer-base-class)


The TOC link for “Python SDK: LLMTrainer Base Class” points to #python-sdk-configtrainer-base-class, but the actual section heading is “### Python SDK: LLMTrainer Base Class” (anchor #python-sdk-llmtrainer-base-class), so this TOC link won’t work; update the anchor (or the heading) to match.

Suggested change

- [Python SDK: LLMTrainer Base Class](#python-sdk-configtrainer-base-class)

- [Python SDK: LLMTrainer Base Class](#python-sdk-llmtrainer-base-class)

Copilot · 2026-04-01T05:16:36Z

docs/proposals/2839-dynamic-llm-trainer/README.md

+TRL uses Hugging Face's `accelerate` for distributed training, which reads standard
+environment variables (`MASTER_ADDR`, `MASTER_PORT`, `WORLD_SIZE`, `RANK`) rather
+than the `PET_*` variants used by torchrun. The strategy injects both sets for
+maximum compatibility.
+


The TRLStrategy rationale paragraph is duplicated back-to-back (“TRL uses … accelerate …” appears twice), which is likely accidental and adds noise; remove one of the two paragraphs to keep the section concise.

Suggested change

TRL uses Hugging Face's `accelerate` for distributed training, which reads standard

environment variables (`MASTER_ADDR`, `MASTER_PORT`, `WORLD_SIZE`, `RANK`) rather

than the `PET_*` variants used by torchrun. The strategy injects both sets for

maximum compatibility.

Copilot · 2026-04-01T05:16:36Z

docs/proposals/2839-dynamic-llm-trainer/README.md

+1. A `LLMTrainer` ABC in the Python SDK — a **separate abstraction** from KEP-285's
+   `BaseTrainer`, purpose-built for **config-driven trainers** where the framework's
+   own CLI is the entrypoint (e.g., `trl sft ...`, `tune run ...`). Both ABCs are
+   accepted through the same `TrainerClient.train(trainer=...)` parameter, giving
+   data scientists a flat, unified API.


The PR title/description emphasize “ConfigTrainer for BaseTrainer hierarchy”, but the KEP text proposes a separate LLMTrainer ABC that explicitly does not extend BaseTrainer; align the KEP terminology/architecture with the PR metadata (or update the PR metadata) to avoid confusion for reviewers and implementers.

docs: add KEP-2839 Dynamic LLM Trainer Framework proposal

611adef

google-oss-prow bot added the do-not-merge/work-in-progress label Feb 27, 2026

google-oss-prow bot requested review from jinchihe and kuizhiqing February 27, 2026 10:01

google-oss-prow bot added the size/XL label Feb 27, 2026

NarayanaSabari added 4 commits March 2, 2026 13:08

updated KEP for TRL

37f64fe

Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>

updated kep for dpo example

f755136

NarayanaSabari mentioned this pull request Mar 10, 2026

chore: Trainer: Specialized Trainers kubeflow/sdk#308

Open

1 task

google-oss-prow bot added size/XXL and removed size/XL labels Mar 28, 2026

NarayanaSabari changed the title ~~docs: add KEP-2839 Dynamic LLM Trainer Framework proposal~~ docs: KEP-2839 Dynamic LLM Trainer Framework — ConfigTrainer for BaseTrainer hierarchy Mar 28, 2026

NarayanaSabari marked this pull request as ready for review March 28, 2026 16:11

google-oss-prow bot removed the do-not-merge/work-in-progress label Mar 28, 2026

Sabari added 5 commits March 31, 2026 16:05

docs: rename ConfigTrainer to LLMTrainer in KEP-2839

9570cc6

docs: add architecture diagrams to KEP-2839

2f3e675

Add three diagrams: - SDK type hierarchy showing LLMTrainer and BaseTrainer as separate ABCs - End-to-end system architecture from Python SDK to Kubernetes pods - Go Torch plugin strategy dispatch flow

Copilot AI review requested due to automatic review settings April 1, 2026 05:13

Copilot started reviewing on behalf of NarayanaSabari April 1, 2026 05:13 View session

NarayanaSabari changed the title ~~docs: KEP-2839 Dynamic LLM Trainer Framework — ConfigTrainer for BaseTrainer hierarchy~~ docs: KEP-2839 Dynamic LLM Trainer Framework — LLMTrainer ABC + FrameworkStrategy Apr 1, 2026

Copilot AI reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: KEP-2839 Dynamic LLM Trainer Framework — LLMTrainer ABC + FrameworkStrategy#3263

docs: KEP-2839 Dynamic LLM Trainer Framework — LLMTrainer ABC + FrameworkStrategy#3263
NarayanaSabari wants to merge 11 commits intokubeflow:masterfrom
NarayanaSabari:kep-2839-dynamic-llm-trainer

NarayanaSabari commented Feb 27, 2026 •

edited

Loading

Uh oh!

google-oss-prow bot commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	- [Python SDK: LLMTrainer Base Class](#python-sdk-configtrainer-base-class)
	- [Python SDK: LLMTrainer Base Class](#python-sdk-llmtrainer-base-class)

Conversation

NarayanaSabari commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The Problem (User Stories)

What This KEP Proposes

Refactor Scope (Minimal)

Type Hierarchy

Relationship to KEP-285

Non-Goals

Uh oh!

google-oss-prow bot commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NarayanaSabari commented Feb 27, 2026 •

edited

Loading