docs: KEP-2839 Dynamic LLM Trainer Framework — LLMTrainer ABC + FrameworkStrategy#3263
docs: KEP-2839 Dynamic LLM Trainer Framework — LLMTrainer ABC + FrameworkStrategy#3263NarayanaSabari wants to merge 11 commits intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
- Strip all Low-Level Design content (code interfaces, strategies, Dockerfile, runtime YAML, Helm chart details) - Fix 10 technical inaccuracies found during audit: - TRL CLI entry point (trl sft, not python -m trl) - Multi-node env vars (standard + PET variants) - Correct enforceTorchTunePolicy inline location - dependsOn YAML format, volume handling pattern - TRLTrainerType enum values (SFT/DPO/KTO/GRPO) - Container name 'node' not 'trainer' - PET env var naming conventions - KEP now covers: Summary, Goals, Non-Goals, Current State Analysis, High-Level Design, Test Plan, Risks, Phases
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
…P-285 alignment - Remove @register_backend decorator and backend registry (YAGNI with 2 backends) - Change to_command() method to command: ClassVar[tuple[str, ...]] - Move num_nodes/resources_per_node to LLMBackend base class - Add Relationship to KEP-285 section for config-driven vs function-based trainers - Simplify KubernetesBackend integration (no hasattr checks) - Remove stale Phase 1/Phase 2 references from Risks table - Goals reduced from 7 to 5
Replace standalone LLMBackend ABC with ConfigTrainer that integrates into KEP-285's BaseTrainer type hierarchy, directly answering open questions from maintainers about how config-driven trainers fit alongside function-based trainers. Key changes: - LLMBackend → ConfigTrainer(BaseTrainer) for unified type hierarchy - LLMBackendStrategy → FrameworkStrategy (matches framework label convention) - TorchTuneConfig → TorchTuneTrainer with backward-compatible alias - TRLConfig → TRLTrainer with runtime auto-discovery support - Added detailed KEP-285 relationship section with maintainer references - Added implementation history and KEP.yaml-style metadata Tracking issue: kubeflow#2839
Based on mentor feedback, ConfigTrainer is now a standalone ABC rather than a subclass of KEP-285's BaseTrainer. This avoids LSP violations (dead get_train_func() methods) and allows both hierarchies to evolve independently. Key architectural change: - ConfigTrainer and BaseTrainer are separate ABCs for separate patterns (config-driven vs function-based) - Both accepted through same TrainerClient.train(trainer=...) parameter for flat, unified user experience - No inheritance relationship — clean separation of concerns Also adds Alternatives Considered section documenting the unified hierarchy option and why it was rejected.
Add three diagrams: - SDK type hierarchy showing LLMTrainer and BaseTrainer as separate ABCs - End-to-end system architecture from Python SDK to Kubernetes pods - Go Torch plugin strategy dispatch flow
Add three visual diagrams to the High-Level Design section: - Before vs After: side-by-side comparison of SDK and Go coupling points - SDK Type Hierarchy: shows LLMTrainer and BaseTrainer as parallel ABCs feeding into unified TrainerClient.train(trainer=...) - End-to-End Flow: full stack trace of a TRL SFT job from data scientist through SDK, K8s API, Go Torch plugin, down to pods
Make the KEP more compelling for maintainer review: - Add User Stories section with 3 concrete data scientist pain points - Add Why TRL section with feature comparison table - Add Refactor Scope section showing exact file/line counts (~400 lines total to unlock every future LLM backend) - Trim verbose implementation code (to_args body, TRLStrategy body, dispatch code) — keep interfaces and concepts, defer details to LLD - Replace code-heavy Go sections with tables and concise explanations
There was a problem hiding this comment.
Pull request overview
Adds/updates the KEP-2839 proposal doc describing a pluggable, label-dispatched, config-driven LLM trainer framework spanning the Python SDK and the Go Torch plugin, with TorchTune kept as a compatible backend and TRL introduced as a new backend.
Changes:
- Adds a new KEP document detailing the motivation, goals, architecture, and implementation plan for a dynamic LLM trainer framework.
- Documents the proposed Python-side
LLMTrainerabstraction and Go-sideFrameworkStrategydispatch mechanism. - Provides example TRL runtime manifests and user-facing SDK usage examples.
| - [Go Control Plane: Command-Sniffing](#go-control-plane-command-sniffing) | ||
| - [High-Level Design](#high-level-design) | ||
| - [Architecture Overview](#architecture-overview) | ||
| - [Component Interaction Flow](#component-interaction-flow) |
There was a problem hiding this comment.
The Table of Contents includes an entry for “Component Interaction Flow” but there is no corresponding section heading in the document, so the link is broken and makes navigation harder; either add the missing section or remove/rename the TOC entry to match an existing heading.
| - [Component Interaction Flow](#component-interaction-flow) |
| - [Component Interaction Flow](#component-interaction-flow) | ||
| - [What Changes vs What Stays](#what-changes-vs-what-stays) | ||
| - [Design Details](#design-details) | ||
| - [Python SDK: LLMTrainer Base Class](#python-sdk-configtrainer-base-class) |
There was a problem hiding this comment.
The TOC link for “Python SDK: LLMTrainer Base Class” points to #python-sdk-configtrainer-base-class, but the actual section heading is “### Python SDK: LLMTrainer Base Class” (anchor #python-sdk-llmtrainer-base-class), so this TOC link won’t work; update the anchor (or the heading) to match.
| - [Python SDK: LLMTrainer Base Class](#python-sdk-configtrainer-base-class) | |
| - [Python SDK: LLMTrainer Base Class](#python-sdk-llmtrainer-base-class) |
| TRL uses Hugging Face's `accelerate` for distributed training, which reads standard | ||
| environment variables (`MASTER_ADDR`, `MASTER_PORT`, `WORLD_SIZE`, `RANK`) rather | ||
| than the `PET_*` variants used by torchrun. The strategy injects both sets for | ||
| maximum compatibility. | ||
|
|
There was a problem hiding this comment.
The TRLStrategy rationale paragraph is duplicated back-to-back (“TRL uses … accelerate …” appears twice), which is likely accidental and adds noise; remove one of the two paragraphs to keep the section concise.
| TRL uses Hugging Face's `accelerate` for distributed training, which reads standard | |
| environment variables (`MASTER_ADDR`, `MASTER_PORT`, `WORLD_SIZE`, `RANK`) rather | |
| than the `PET_*` variants used by torchrun. The strategy injects both sets for | |
| maximum compatibility. |
| 1. A `LLMTrainer` ABC in the Python SDK — a **separate abstraction** from KEP-285's | ||
| `BaseTrainer`, purpose-built for **config-driven trainers** where the framework's | ||
| own CLI is the entrypoint (e.g., `trl sft ...`, `tune run ...`). Both ABCs are | ||
| accepted through the same `TrainerClient.train(trainer=...)` parameter, giving | ||
| data scientists a flat, unified API. |
There was a problem hiding this comment.
The PR title/description emphasize “ConfigTrainer for BaseTrainer hierarchy”, but the KEP text proposes a separate LLMTrainer ABC that explicitly does not extend BaseTrainer; align the KEP terminology/architecture with the PR metadata (or update the PR metadata) to avoid confusion for reviewers and implementers.
Summary
KEP-2839: Dynamic LLM Trainer Framework proposal.
Kubeflow Trainer is locked to TorchTune for LLM fine-tuning. TorchTune stopped adding features in July 2025 (pytorch/torchtune#2883), supports only 4 models, and only SFT. This KEP introduces a pluggable framework so we can add TRL and future backends with ~400 lines of new/changed code.
The Problem (User Stories)
BuiltinTrainerinternals. After: one-line changeTorchTuneTrainer(...)→TRLTrainer(...).What This KEP Proposes
SDK (Python) —
LLMTrainerABC:BaseTrainer(config-driven vs function-based are architecturally distinct)TrainerClient.train(trainer=...)— flat, unified UXTorchTuneTrainer(LLMTrainer)— refactored fromTorchTuneConfigwith backward-compatible aliasTRLTrainer(LLMTrainer)— SFT/DPO/KTO/GRPO supporttrainer.kubeflow.org/frameworklabelGo Control Plane —
FrameworkStrategyinterface:TorchTuneStrategy— wraps existing logic (moved, not rewritten)TRLStrategy— accelerate-compatible env var injection"unsloth": &UnslothStrategy{}Infrastructure:
ClusterTrainingRuntimemanifests + Helm chart additionsRefactor Scope (Minimal)
types.pyLLMTrainerABC +TRLTrainer+ renameutils.pyconfig.command/config.to_args()torch/constants.goFrameworkLabel~400 lines total. TorchTune code is moved, not rewritten. Zero breaking changes.
Type Hierarchy
Relationship to KEP-285
This KEP answers the open question from @andreyvelich about how config-driven trainers fit into the trainer hierarchy.
LLMTrainerandBaseTrainerare separate ABCs — no LSP violations, independent evolution, unified API entry point. See discussion on KEP-285.Non-Goals
BuiltinTrainerorCustomTrainerBuilds on KEP-2401 and community consensus on "Plan 3" from #2752.
Tracking issue: #2839
/cc @Electronic-Waste @andreyvelich @tariq-hasan @szaher