-
Notifications
You must be signed in to change notification settings - Fork 453
Description
Note: This issue serves as the design document for QAT support in llm-compressor. The goal is to align on approach before implementation begins. Feedback from maintainers is welcome before any code is written.
Motivation
LLM Compressor currently supports PTQ via oneshot(), which recovers accuracy well for most schemes. However, for aggressive quantization targets (INT4 weights, INT8 activations, NVFP4) or small/sensitive models, PTQ accuracy
degradation can be significant. QAT — training the model with simulated quantization in the forward pass — is the standard solution.This issue proposes adding the training support in a minimal, well-designed form focused specifically on QAT. Here I propose two complementary approaches targeting different use cases.
Part 1: train() API (SFT + QAT)
Overview
A new train() entrypoint that mirrors oneshot() but wraps HuggingFace Trainer with fake quantization injected via the existing QuantizationModifier. llm-compressor owns only the fake-quant injection and finalization — everything
else (distributed training, FSDP, data loading, checkpointing) is delegated to Trainer.
Proposed API
from llmcompressor import train
from llmcompressor.modifiers.quantization import QuantizationModifier
recipe = QuantizationModifier(
targets="Linear",
scheme="W8A8",
ignore=["lm_head"],
)
train(
model="meta-llama/Llama-3.2-1B-Instruct",
dataset="open_platypus",
recipe=recipe,
output_dir="Llama-3.2-1B-W8A8-QAT",
num_train_epochs=1,
learning_rate=2e-5,
max_seq_length=512,
)Output is saved in the same compressed-tensors format as oneshot(), directly loadable in vLLM.
How it works
- Model loaded in full precision (BF16/FP16)
QuantizationModifier.initialize()inserts fake quantization ops into the
forward pass viatorch.ao.quantization.FakeQuantize- HuggingFace Trainer runs the training loop; gradients flow through
straight-through estimator (STE) past the quantization ops - After training,
QuantizationModifier.finalize()converts fake quant ops
to real compressed-tensors quantization - Model saved with
save_pretrained()in compressed-tensors format
train() vs oneshot()
oneshot() |
train() |
|
|---|---|---|
| Forward passes | Calibration only (no grad) | Full training loop |
| Dataset size | ~512 calibration samples | Full training dataset |
| Modifier lifecycle | initialize → calibrate → finalize | initialize → train N epochs → finalize |
| Output format | PTQ compressed checkpoint | QAT compressed checkpoint |
| Accuracy recovery | Good for most schemes | Better for aggressive quant / small models |
Implementation plan
- New
src/llmcompressor/entrypoints/train.pymirroringoneshot.pystructure - Reuse
SessionManagerMixInalready insrc/llmcompressor/transformers/finetune/ QuantizationModifiergains aqat_mode: bool = Falseflag switching from
PTQ calibration to fake-quant insertion oninitialize()- Fake quantization via
torch.ao.quantization.FakeQuantize(no new dependencies) - ~200 lines of new code; reuses all existing infrastructure
Scope boundary
- SFT-based QAT only — not pretraining from scratch
- No custom training loop — delegate entirely to HuggingFace Trainer
- No RL training loops — see Part 2
Part 2: verl Integration (RL + QAT)
Overview
verl is a distributed RL training framework used for RLHF, PPO, and reasoning model training. Integrating llm-compressor's quantization recipe lifecycle into verl's training loop enables QAT in large-scale RL settings — particularly useful for post-RLHF quantization where PTQ accuracy loss is most pronounced on instruction-following and reasoning tasks.
Proposed API
# verl training config with llm-compressor QAT recipe
from llmcompressor.modifiers.quantization import QuantizationModifier
trainer = RayPPOTrainer(
...
compression_recipe=QuantizationModifier(
targets="Linear",
scheme="W8A8",
ignore=["lm_head"],
),
)
trainer.fit() # QAT runs inside verl's PPO/GRPO loopHow it works
- llm-compressor exposes a
CompressionSessionlifecycle hook interface - verl calls
initialize()at training start to inject fake quant ops - verl's PPO/GRPO loop runs normally — fake quant is transparent to the
RL objective - verl calls
finalize()at training end to convert to real quantization - Checkpoint saved in compressed-tensors format
Implementation plan
- llm-compressor exposes a standalone
QATLifecycleHookinterface that
verl can call aton_train_begin/on_train_end - verl adds optional
compression_recipeargument toRayPPOTrainer - Integration layer is thin — verl owns all distributed execution
Scope boundary
- May need to require coordination with verl maintainers
- Higher complexity and longer timeline than Part 1
- Best implemented after Part 1 is proven
Comparison
Part 1: train() |
Part 2: verl | |
|---|---|---|
| Use case | SFT + QAT | RL training + QAT |
| Complexity | Low (~200 lines) | High (cross-repo) |
| Dependencies | HuggingFace Trainer | verl + Ray |
| Timeline | Weeks | Months |
| Target user | General quantization | Large-scale RL workflows |
Recommended sequencing: Start with Part 1 as the minimal viable QAT path.
Part 2 follows once the core QAT infrastructure is proven and stable.
Prior Art
- SparseML (suggested by @kylesayrs)
- NVIDIA ModelOpt QAT — similar approach using
torch.ao.quantization - HuggingFace
optimumQAT — wraps Trainer with fake quant hooks - llm-compressor
trl_mixin— existingSessionManagerMixInintegration
with SFTTrainer, which Part 1 builds directly on top of
Open Questions for Maintainers
- Should
train()live inllmcompressor.entrypoints.trainor reuseoneshotwith amode="qat"parameter? - Is
SessionManagerMixInthe right integration point, or should we build a dedicatedQATTrainersubclass? - Are there plans to bring back
DistillationModifierthat would affect. this design? - For Part 2, is there an existing relationship with verl maintainers, or would this be a cold outreach?
Happy to start implementing Part 1 once the design direction is confirmed.@kylesayrs