Skip to content

docs: KEP-2839 Dynamic LLM Trainer Framework — LLMTrainer ABC + FrameworkStrategy#3263

Open
NarayanaSabari wants to merge 11 commits intokubeflow:masterfrom
NarayanaSabari:kep-2839-dynamic-llm-trainer
Open

docs: KEP-2839 Dynamic LLM Trainer Framework — LLMTrainer ABC + FrameworkStrategy#3263
NarayanaSabari wants to merge 11 commits intokubeflow:masterfrom
NarayanaSabari:kep-2839-dynamic-llm-trainer

Conversation

@NarayanaSabari
Copy link
Copy Markdown

@NarayanaSabari NarayanaSabari commented Feb 27, 2026

Summary

KEP-2839: Dynamic LLM Trainer Framework proposal.

Kubeflow Trainer is locked to TorchTune for LLM fine-tuning. TorchTune stopped adding features in July 2025 (pytorch/torchtune#2883), supports only 4 models, and only SFT. This KEP introduces a pluggable framework so we can add TRL and future backends with ~400 lines of new/changed code.

The Problem (User Stories)

  • "I want DPO alignment, but Kubeflow only supports SFT via TorchTune." — TorchTune doesn't support DPO/KTO/GRPO. Users fall back to raw YAML or leave Kubeflow.
  • "I want to use a newer model that TorchTune doesn't have recipes for." — TorchTune supports 4 hardcoded models. TRL works with any Hugging Face model.
  • "I want to switch frameworks without relearning the SDK." — Today: modify BuiltinTrainer internals. After: one-line change TorchTuneTrainer(...)TRLTrainer(...).

What This KEP Proposes

SDK (Python) — LLMTrainer ABC:

  • Separate ABC from KEP-285's BaseTrainer (config-driven vs function-based are architecturally distinct)
  • Both accepted through the same TrainerClient.train(trainer=...) — flat, unified UX
  • TorchTuneTrainer(LLMTrainer) — refactored from TorchTuneConfig with backward-compatible alias
  • TRLTrainer(LLMTrainer) — SFT/DPO/KTO/GRPO support
  • Runtime auto-discovery via trainer.kubeflow.org/framework label

Go Control Plane — FrameworkStrategy interface:

  • Replaces hardcoded command-sniffing with label-based dispatch
  • TorchTuneStrategy — wraps existing logic (moved, not rewritten)
  • TRLStrategy — accelerate-compatible env var injection
  • Adding a future backend = one line: "unsloth": &UnslothStrategy{}

Infrastructure:

  • TRL container image + ClusterTrainingRuntime manifests + Helm chart additions

Refactor Scope (Minimal)

Layer Change Lines
SDK types.py Add LLMTrainer ABC + TRLTrainer + rename ~95 new
SDK utils.py Generic config.command / config.to_args() ~20 changed
Go torch/ Extract TorchTune → strategy, add TRL strategy ~80 moved, ~65 new
Go constants.go Add FrameworkLabel 1 new
Infra Dockerfile + YAML manifest + Helm ~100 new

~400 lines total. TorchTune code is moved, not rewritten. Zero breaking changes.

Type Hierarchy

   TrainerClient.train(trainer=...)
   ─────────────────────────────────
   BaseTrainer (ABC)          LLMTrainer (ABC)
   (KEP-285, func-based)     (This KEP, config-driven)
      │                         │
      ├── TorchTrainer          ├── TorchTuneTrainer
      ├── JAXTrainer            └── TRLTrainer
      └── XGBoostTrainer

Relationship to KEP-285

This KEP answers the open question from @andreyvelich about how config-driven trainers fit into the trainer hierarchy. LLMTrainer and BaseTrainer are separate ABCs — no LSP violations, independent evolution, unified API entry point. See discussion on KEP-285.

Non-Goals

  • Unsloth or LlamaFactory backends (future work)
  • CRD schema changes
  • Deprecating BuiltinTrainer or CustomTrainer

Builds on KEP-2401 and community consensus on "Plan 3" from #2752.

Tracking issue: #2839

/cc @Electronic-Waste @andreyvelich @tariq-hasan @szaher

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

- Strip all Low-Level Design content (code interfaces, strategies,
  Dockerfile, runtime YAML, Helm chart details)
- Fix 10 technical inaccuracies found during audit:
  - TRL CLI entry point (trl sft, not python -m trl)
  - Multi-node env vars (standard + PET variants)
  - Correct enforceTorchTunePolicy inline location
  - dependsOn YAML format, volume handling pattern
  - TRLTrainerType enum values (SFT/DPO/KTO/GRPO)
  - Container name 'node' not 'trainer'
  - PET env var naming conventions
- KEP now covers: Summary, Goals, Non-Goals, Current State
  Analysis, High-Level Design, Test Plan, Risks, Phases
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
…P-285 alignment

- Remove @register_backend decorator and backend registry (YAGNI with 2 backends)
- Change to_command() method to command: ClassVar[tuple[str, ...]]
- Move num_nodes/resources_per_node to LLMBackend base class
- Add Relationship to KEP-285 section for config-driven vs function-based trainers
- Simplify KubernetesBackend integration (no hasattr checks)
- Remove stale Phase 1/Phase 2 references from Risks table
- Goals reduced from 7 to 5
Replace standalone LLMBackend ABC with ConfigTrainer that integrates into
KEP-285's BaseTrainer type hierarchy, directly answering open questions from
maintainers about how config-driven trainers fit alongside function-based
trainers.

Key changes:
- LLMBackend → ConfigTrainer(BaseTrainer) for unified type hierarchy
- LLMBackendStrategy → FrameworkStrategy (matches framework label convention)
- TorchTuneConfig → TorchTuneTrainer with backward-compatible alias
- TRLConfig → TRLTrainer with runtime auto-discovery support
- Added detailed KEP-285 relationship section with maintainer references
- Added implementation history and KEP.yaml-style metadata

Tracking issue: kubeflow#2839
@NarayanaSabari NarayanaSabari changed the title docs: add KEP-2839 Dynamic LLM Trainer Framework proposal docs: KEP-2839 Dynamic LLM Trainer Framework — ConfigTrainer for BaseTrainer hierarchy Mar 28, 2026
@NarayanaSabari NarayanaSabari marked this pull request as ready for review March 28, 2026 16:11
Sabari added 5 commits March 31, 2026 16:05
Based on mentor feedback, ConfigTrainer is now a standalone ABC rather than
a subclass of KEP-285's BaseTrainer. This avoids LSP violations (dead
get_train_func() methods) and allows both hierarchies to evolve independently.

Key architectural change:
- ConfigTrainer and BaseTrainer are separate ABCs for separate patterns
  (config-driven vs function-based)
- Both accepted through same TrainerClient.train(trainer=...) parameter
  for flat, unified user experience
- No inheritance relationship — clean separation of concerns

Also adds Alternatives Considered section documenting the unified hierarchy
option and why it was rejected.
Add three diagrams:
- SDK type hierarchy showing LLMTrainer and BaseTrainer as separate ABCs
- End-to-end system architecture from Python SDK to Kubernetes pods
- Go Torch plugin strategy dispatch flow
Add three visual diagrams to the High-Level Design section:
- Before vs After: side-by-side comparison of SDK and Go coupling points
- SDK Type Hierarchy: shows LLMTrainer and BaseTrainer as parallel ABCs
  feeding into unified TrainerClient.train(trainer=...)
- End-to-End Flow: full stack trace of a TRL SFT job from data scientist
  through SDK, K8s API, Go Torch plugin, down to pods
Make the KEP more compelling for maintainer review:
- Add User Stories section with 3 concrete data scientist pain points
- Add Why TRL section with feature comparison table
- Add Refactor Scope section showing exact file/line counts (~400 lines
  total to unlock every future LLM backend)
- Trim verbose implementation code (to_args body, TRLStrategy body,
  dispatch code) — keep interfaces and concepts, defer details to LLD
- Replace code-heavy Go sections with tables and concise explanations
Copilot AI review requested due to automatic review settings April 1, 2026 05:13
@NarayanaSabari NarayanaSabari changed the title docs: KEP-2839 Dynamic LLM Trainer Framework — ConfigTrainer for BaseTrainer hierarchy docs: KEP-2839 Dynamic LLM Trainer Framework — LLMTrainer ABC + FrameworkStrategy Apr 1, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds/updates the KEP-2839 proposal doc describing a pluggable, label-dispatched, config-driven LLM trainer framework spanning the Python SDK and the Go Torch plugin, with TorchTune kept as a compatible backend and TRL introduced as a new backend.

Changes:

  • Adds a new KEP document detailing the motivation, goals, architecture, and implementation plan for a dynamic LLM trainer framework.
  • Documents the proposed Python-side LLMTrainer abstraction and Go-side FrameworkStrategy dispatch mechanism.
  • Provides example TRL runtime manifests and user-facing SDK usage examples.

- [Go Control Plane: Command-Sniffing](#go-control-plane-command-sniffing)
- [High-Level Design](#high-level-design)
- [Architecture Overview](#architecture-overview)
- [Component Interaction Flow](#component-interaction-flow)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Table of Contents includes an entry for “Component Interaction Flow” but there is no corresponding section heading in the document, so the link is broken and makes navigation harder; either add the missing section or remove/rename the TOC entry to match an existing heading.

Suggested change
- [Component Interaction Flow](#component-interaction-flow)

Copilot uses AI. Check for mistakes.
- [Component Interaction Flow](#component-interaction-flow)
- [What Changes vs What Stays](#what-changes-vs-what-stays)
- [Design Details](#design-details)
- [Python SDK: LLMTrainer Base Class](#python-sdk-configtrainer-base-class)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TOC link for “Python SDK: LLMTrainer Base Class” points to #python-sdk-configtrainer-base-class, but the actual section heading is “### Python SDK: LLMTrainer Base Class” (anchor #python-sdk-llmtrainer-base-class), so this TOC link won’t work; update the anchor (or the heading) to match.

Suggested change
- [Python SDK: LLMTrainer Base Class](#python-sdk-configtrainer-base-class)
- [Python SDK: LLMTrainer Base Class](#python-sdk-llmtrainer-base-class)

Copilot uses AI. Check for mistakes.
Comment on lines +789 to +793
TRL uses Hugging Face's `accelerate` for distributed training, which reads standard
environment variables (`MASTER_ADDR`, `MASTER_PORT`, `WORLD_SIZE`, `RANK`) rather
than the `PET_*` variants used by torchrun. The strategy injects both sets for
maximum compatibility.

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TRLStrategy rationale paragraph is duplicated back-to-back (“TRL uses … accelerate …” appears twice), which is likely accidental and adds noise; remove one of the two paragraphs to keep the section concise.

Suggested change
TRL uses Hugging Face's `accelerate` for distributed training, which reads standard
environment variables (`MASTER_ADDR`, `MASTER_PORT`, `WORLD_SIZE`, `RANK`) rather
than the `PET_*` variants used by torchrun. The strategy injects both sets for
maximum compatibility.

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +70
1. A `LLMTrainer` ABC in the Python SDK — a **separate abstraction** from KEP-285's
`BaseTrainer`, purpose-built for **config-driven trainers** where the framework's
own CLI is the entrypoint (e.g., `trl sft ...`, `tune run ...`). Both ABCs are
accepted through the same `TrainerClient.train(trainer=...)` parameter, giving
data scientists a flat, unified API.
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title/description emphasize “ConfigTrainer for BaseTrainer hierarchy”, but the KEP text proposes a separate LLMTrainer ABC that explicitly does not extend BaseTrainer; align the KEP terminology/architecture with the PR metadata (or update the PR metadata) to avoid confusion for reviewers and implementers.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants