GSoC 2026 Prototype — KEP-2839 · Contributor: @krishdef7
Mentors: @tariq-hasan, @andreyvelich
Kubeflow Trainer's SDK currently couples LLM fine-tuning to a single backend through a hardcoded dispatch in BuiltinTrainer:
# Current SDK — hardcoded isinstance check
if isinstance(trainer.config, TorchTuneConfig):
# ... TorchTune-specific logicTorchTune is no longer actively adding new features, which means Kubeflow users cannot access emerging post-training methods (DPO, PPO, ORPO, KTO) or faster backends (Unsloth, TRL) without modifying the SDK source code.
This prototype implements a pluggable backend architecture that decouples the SDK from any single fine-tuning framework. Users switch backends by changing one field:
# TorchTune (existing behavior, fully backward compatible)
LLMConfig(model="llama-3.2-1B", dataset="alpaca", backend_name="torchtune", ...)
# TRL — same model, same dataset, different engine
LLMConfig(model="llama-3.2-1B", dataset="alpaca", backend_name="trl", ...)
# Unsloth — ~2× faster, ~70% less memory
LLMConfig(model="llama-3.2-1B", dataset="alpaca", backend_name="unsloth", ...)┌──────────────────────────────────────────────────────────┐
│ User Code │
│ │
│ trainer = LLMTrainer( │
│ config=LLMConfig(backend_name="trl", method="dpo"), │
│ resources_per_node={"gpu": 2}, │
│ ) │
│ client.train(name="my-job", trainer=trainer) │
└────────────────────────┬─────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ LLMTrainer.resolve() │
│ │
│ 1. Look up backend from BackendRegistry │
│ 2. backend.validate(config) │
│ 3. backend.to_container_spec(config) → ContainerSpec │
│ 4. Return ResolvedLLMTrainer │
└────────────────────────┬─────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ BackendRegistry │
│ │
│ ┌────────────┐ ┌────────────┐ ┌────────────┐ │
│ │ TorchTune │ │ TRL │ │ Unsloth │ ... │
│ │ Backend │ │ Backend │ │ Backend │ │
│ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │
│ │ │ │ │
│ └──────────────┼──────────────┘ │
│ │ │
│ LLMBackend (ABC) │
│ name | validate | to_container_spec │
└──────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────┐
│ Kubernetes (unchanged controller) │
│ │
│ TrainJob.spec.trainer.image ← container_spec.image │
│ TrainJob.spec.trainer.command ← container_spec.command │
│ TrainJob.spec.trainer.args ← container_spec.args │
│ TrainJob.spec.runtimeRef ← auto-discovered by │
│ framework label │
└──────────────────────────────────────────────────────────┘
This framework is config-driven: users specify what to train (model, dataset, method, hyperparameters) and the backend decides how (entrypoint, image, CLI args).
This complements KEP-285's function-driven specialized trainers (TorchTrainer, MPITrainer), where users provide their own training code. The two coexist:
| KEP-285 Specialized Trainers | KEP-2839 Dynamic LLM Trainer | |
|---|---|---|
| Pattern | Function-driven | Config-driven |
| User provides | Training function | Model + dataset + hyperparams |
| Use case | "Run my code on N nodes" | "Fine-tune this model with these params" |
| SDK type | TorchTrainer, MPITrainer |
LLMTrainer |
This distinction directly addresses the question raised by @tariq-hasan on PR #308 about where config-driven trainers fit in the BaseTrainer hierarchy.
A new backend only needs to produce a valid ContainerSpec. The existing torch plugin in pkg/runtime/framework/plugins/torch/ handles the rest. This means adding TRL, Unsloth, or LlamaFactory requires zero Go code changes.
Backends are registered via (in order of precedence):
- Explicit registration:
BackendRegistry.register(MyBackend()) - Class decorator:
@BackendRegistry.register - Entry-point discovery:
[project.entry-points."kubeflow.llm_backends"]
Path 3 enables third-party packages to ship backends:
# Third-party package's pyproject.toml
[project.entry-points."kubeflow.llm_backends"]
my_backend = "my_package:MyBackend"Existing BuiltinTrainer code keeps working. In TrainerClient.train():
if isinstance(trainer, BuiltinTrainer):
trainer = adapt_builtin_trainer(trainer) # → LLMTrainer
if isinstance(trainer, LLMTrainer):
resolved = trainer.resolve()
# build TrainJob spec from resolved.container_spec| Method | TorchTune | TRL | Unsloth |
|---|---|---|---|
| SFT | ✅ | ✅ | ✅ |
| LoRA | ✅ | ✅* | ✅ |
| QLoRA | ✅ | ✅* | ✅ |
| DPO | ✅ | ✅ | ✅ |
| PPO | ❌ | ✅ | ❌ |
| ORPO | ❌ | ✅ | ✅ |
| KTO | ❌ | ✅ | ❌ |
| Multi-node | ✅ | ✅ | ❌ |
* TRL applies LoRA/QLoRA as a modifier on top of SFT/DPO/etc, not as a separate method.
src/kubeflow_llm_trainer/
├── interface.py # LLMBackend ABC, LLMConfig, ContainerSpec, enums
├── registry.py # BackendRegistry with 3 registration paths
├── trainer.py # LLMTrainer → ResolvedLLMTrainer
├── integration.py # TrainJob spec generation (TrainerClient integration)
├── progress.py # KEP-2779: Status server client + HF TrainerCallback
├── _compat.py # BuiltinTrainer → LLMTrainer adapter
├── backends/
│ ├── torchtune.py # TorchTune backend (backward compat)
│ ├── trl.py # TRL backend (SFT/DPO/PPO/ORPO/KTO)
│ └── unsloth.py # Unsloth backend (~2× faster)
└── entrypoints/
├── trl_runner.py # Container entrypoint for TRL training
└── unsloth_runner.py # Container entrypoint for Unsloth training
manifests/base/runtimes/
├── trl_distributed.yaml # ClusterTrainingRuntime for TRL
└── unsloth_single_device.yaml # ClusterTrainingRuntime for Unsloth
tests/ # 136 tests covering all paths
examples/ # 4 runnable examples
# Install
pip install -e ".[dev]"
# Run tests (136 passing)
pytest tests/ -v
# Run examples
python examples/01_trl_sft.py
python examples/02_cross_backend_switching.py
python examples/03_custom_backend.py
python examples/04_migration_from_builtin.pyThe changes required in the existing codebase are minimal (~40 lines of diff):
kubeflow/sdk/kubeflow/trainer/trainer_client.py:
+ from kubeflow_llm_trainer import LLMTrainer
+ from kubeflow_llm_trainer._compat import adapt_builtin_trainer
def train(
self,
trainer: Optional[Union[
"CustomTrainer",
"BuiltinTrainer",
+ "LLMTrainer",
]] = None,
...
):
+ if isinstance(trainer, BuiltinTrainer):
+ trainer = adapt_builtin_trainer(trainer)
+
+ if isinstance(trainer, LLMTrainer):
+ resolved = trainer.resolve()
+ return self._submit_llm_trainjob(name, runtime, resolved, ...)
+
# existing CustomTrainer path unchangedkubeflow/trainer/pkg/runtime/...: No changes needed.
- Upstream the
LLMBackendinterface andBackendRegistry - Refactor
BuiltinTrainerto useTorchTuneBackendinternally - Integration tests against real ClusterTrainingRuntime
- Implement
TRLBackendwith SFT, DPO, PPO, ORPO, KTO - Create
ClusterTrainingRuntimemanifests for TRL - Build
trl-trainercontainer image - E2E tests on Kubernetes
- Implement
UnslothBackendwith optimized single-GPU training - Entry-point based external backend discovery
- Documentation and migration guide
- LlamaFactory backend proof-of-concept
- Performance benchmarks (TorchTune vs TRL vs Unsloth)
- Final documentation and blog post
- KEP-2839: Kubeflow Dynamic LLM Trainer Framework (tracking issue)
- KEP-285: Specialized Trainers (complementary, function-driven)
- KEP-2401: Kubeflow LLM Trainer V2 (original TorchTune integration)
- PR #3227: TrainJob progress tracking — this prototype includes SDK-side integration
- PR #308 discussion: Config-driven vs function-driven trainers — directly addressed by this prototype
This prototype includes SDK-side integration for the TrainJob progress tracking feature being implemented in kubeflow/trainer#3227 by @robert-bell.
The progress.py module provides:
-
KubeflowProgressReporter— a standalone HTTP client that POSTs progress updates (progress %, ETA, custom metrics) to the Kubeflow Trainer status server. -
KubeflowTrainerCallback— a HuggingFace TransformersTrainerCallbackthat automatically reports progress on each logging step. Works with TRL'sSFTTrainer,DPOTrainer,PPOTrainer, and Unsloth (which patches TRL).
The entrypoints (trl_runner.py, unsloth_runner.py) automatically inject the
callback when the status server env vars are detected — zero user configuration.
# The callback is also usable standalone:
from kubeflow_llm_trainer.progress import KubeflowTrainerCallback
trainer = SFTTrainer(
model=model,
args=training_args,
callbacks=[KubeflowTrainerCallback()], # auto-reports to Kubeflow
)Apache License 2.0