Compression API Addition #1722

markurtz · 2025-03-20T16:34:56Z

markurtz
Mar 20, 2025

Llama Stack Compression Proposal

This proposal outlines the introduction of a Compression API within the Llama Stack ecosystem, designed to compress LLMs through quantization, pruning, low-rank decomposition, speculative decoding, and other model compression/optimization techniques. The goal is to provide structured, scalable, and efficient compression pathways that reduce memory requirements, increase inference performance, and increase overall deployment feasibility without significantly degrading model accuracy.

LLM Compression has become increasingly core to standard deployment flows, particularly as better algorithms and techniques have enabled full accuracy recovery. Within vLLM, nearly half of all deployed models utilize some form of quantization. This proposal formalizes a pathway to enable compression within Llama Stack, complementing the existing APIs such as other model editing pathways like the existing post-training API. The proposal includes:

A dedicated high-level Compression API definition enabling model optimization pathways.
Three primary API routes to enable the common compression application types
- Data-free compression - Direct model modifications without dataset dependencies.
- Calibration-based compression - Using small datasets and forward passes through the model.
- Training-aware compression - Using training / fine-tuning datasets with forward and backward passes through the model.

A sample implementation utilizing LLM Compressor as an initial inline provider is illustrated and will continually be maintained by RedHat and the general community. Potential other future providers examples are included in the Background section as well.

Background

Post-Training in the LLM Context

Within Llama Stack and the broader LLM ecosystem, post-training has been predominantly associated with tuning pretrained models to new distributions rather than pathways such as model compression, which are expected to preserve the output distribution as much as possible. So, given a target distribution represented as training data or human preferences, these pathways are:

Supervised fine-tuning – Adapting a model to specific tasks or domains by adjusting the weights with example data, such as instruction tuning.
Preference optimization – Aligning outputs to human preferences generally through techniques such as reinforcement learning.

Model Compression for LLMs

Model compression/optimization, while additionally applied after pretraining, is generally expected to be used as the last step after post-training pathways through supervised fine-tuning and/or preference optimization have converged to enhance efficiency and reduce costs and energy consumption for deployed models while preserving the output distribution as much as possible. Given these definitions, post-training and compression for LLMs can be viewed separately with differing goals, algorithms, pathways, and, therefore, generally differing implementations.

Compression algorithms vary in implementation but fall under set buckets for optimization. A non-exhaustive list is provided below:

Quantization – reducing the precision of weights, activations, KV cache, and/or attention for more efficient memory movement and compute with algorithmic implementations, including GPTQ, AWQ, SpinQuant, Quarot, QUIP, QAT, among many others.
Pruning – removal by zeroing out individual weights (unstructured) or entire channels/heads (structured) for more efficient weight movement and compute with algorithmic implementations, including SparseGPT, Wanda, Sparse Finetuning, among others.
Speculative decoding – adapting a base model with a smaller student (draft) or multiple heads (draft-free) to predict the subsequent N tokens more efficiently without running the larger base model for every token, including EAGLE, Medusa, HASS, among others.
Distillation – training a smaller, student model based on the information and distributions on outputs and layers from a larger teacher model, including output distillation, SquareHead, among others.
Low-rank decomposition/adaptation – Minimize the overall size of matrix operations through techniques such as SVD or by adding adapters on top of minimized backbones, including QLoRA and others.
Attention adaptations – create more efficient attention variants within a model, such as subbing in multi-head latent attention (MLA) for grouped-query attention (GQA), sparse attention, and others.

The above compression algorithms can be applied in one of three ways, with many supporting multiple:

Data-free – The model is the only input and requires no forward or backward passes with example data. This includes weight-only quantization and FP8 quantization utilizing round-to-nearest for weights and dynamic activation quantization.
Calibration-based – The model and a small calibration dataset (generally sampled from the training data) are used to run forward passes for the model for algorithmic convergence. This includes techniques such as GPTQ for quantization and SparseGPT for pruning.
Training-aware – The model and the entire or a subset of the training dataset are used to run forward and backward passes for the model for algorithmic convergence. This includes techniques such as QAT for quantization, sparse finetuning for pruning, and EAGLE for speculative decoding.

The above distinctions for applying compression algorithms to a model are notable from the input parameters, the amount of compute required to run each, and the size of the required compute nodes to run. Data-free has minimal requirements and generally doesn’t require a GPU instance. Calibration-based algorithms require a GPU to execute a model with limited forward passes and typically minimal GPU resources. Training-aware algorithms require many forward and backward passes, increasing the memory and compute requirements and leading to a much larger requirement for GPU resources to run.

Ecosystem Tooling and Libraries

Given the different targets, the various ongoing research implementations to integrate, and the scope of work required to support any one pathway, most libraries do not support model compression and fine-tuning/preference optimization pathways. If libraries and what would be Llama Stack provider implementations support both, then they are limited on one side or the other and do not fully support every path. In general, separate and focused compression and post-training APIs make sense here to ensure full support of the most popular pathways as defined by the APIs.

Some existing post-training libraries:

Axolotl – full support for fine-tuning and preference optimization, support for distillation, and limited support for QLoRA (can train the models in the existing pipelines, but cannot create them).
Hugging Face TRL – full support for fine-tuning and preference optimization with no support for model compression.
LLM Foundry – full support for fine-tuning with no support for model compression.
InstructLab – full support for fine-tuning especially with synthetic data generation, no support for model compression.

Some existing compression libraries:

LLM Compressor – full support for data-free, calibration-based, and training-aware algorithms and pipelines.
Torch AO – full support for data-free, calibration-based, and training-aware algorithms and pipelines.
HF Optimum – full support for data-free, calibration-based, and training-aware algorithms and pipelines depending on the underlying provider.
TensorRT LLM – full support for data-free and calibration-based, no support for training-aware.

User Stories

For the stories below, the general use cases are outlined rather than assuming a specific provider or pathway to illustrate the flow's generality better. Ultimately, the goals for the two categories remain the same concerning compression – mimic the original output distribution as much as possible while minimizing the number of resources required or increasing the performance of a given LLM. Pieces marked with “data free,” “calibration-based,” or “training aware” are the LLM compression steps and correspond to the API spec listed in the later section.

Create a Compressed Version for Deployment

Take an instruction-tuned and preference-optimized Llama 3.1 8B (through llama-stack, external, InstructLab, etc.)
- → compress with data free utilizing FP8 quantization
- → eval and benchmark the model to validate accuracy and performance
- → deploy the compressed model through usual pathways.
Take Llama 3.3 70B instruct from HuggingFace
- → apply calibration-based GPTQ quantization utilizing a calibration dataset
- → eval and benchmark the model to validate accuracy and performance
- → deploy the compressed model through usual pathways.
Take Llama 3.1 8B instruct from HuggingFace
- → apply a series of compression steps: calibration-based pruning with SparseGPT and calibration dataset, training aware sparse fine tuning with layerwise distillation, calibration-based quantization with GPTQ and calibration dataset
- → eval and benchmark the model to validate accuracy and performance
- → deploy the compressed model through usual pathways.
Take an instruction tuned and preference optimized Llama 3.3 70B
- → apply training aware EAGLE based speculative decoding with small training dataset
- → eval and benchmark the model to validate accuracy and performance
- → deploy the compressed model through usual pathways.

Create a Compressed Version for Efficient Training

Take a pretrained Llama 3.1 70B
- → apply calibration-based GPTQ quantization utilizing a calibration dataset
- → eval and benchmark the model to validate accuracy and performance
- → fine-tune the model utilizing QLoRA or similar techniques for faster training on desired dataset
- → eval to validate accuracy
- → deploy the compressed model through usual pathways.
Take a pretrained Llama 3.1 8B
- → run calibration based SparseGPT and then training aware sparse fine-tuning
- → eval and benchmark the model to validate accuracy and performance
- → push to registry as base model
- → fine-tune the sparse model on a training dataset for faster training (if hardware/software supports)
- → calibration base GPTQ quantization utilizing a calibration dataset
- → eval and benchmark the model to validate accuracy and performance
- → deploy the compressed model through usual pathways.

Proposal

API Spec

GET /v1/compression/jobs
- Request: None
- Response: List[job_uuid]
GET /v1/compression/job/status
- Request: job_uuid (query param)
- Response: job_uuid, status, scheduled_at, started_at, completed_at, resources_allocated, checkpoints
GET /v1/compression/job/artifacts
- Request: job_uuid (query param)
- Response: job_uuid, checkpoints
POST /v1/compression/job/cancel
- Request: job_uuid (body)
- Response: HTTP status
POST /v1/compression/data-free
- Request: model_id, recipe
- Response: job_uuid
POST /v1/compression/calibrated
- Request: model_id, data_config, recipe
- Response: job_uuid
POST /v1/compression/training-aware
- Request: model_id, data_config, recipe, training_config
- Response: job_uuid

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compression API Addition #1722

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Compression API Addition #1722

Uh oh!

Uh oh!

markurtz Mar 20, 2025

Llama Stack Compression Proposal

Background

Post-Training in the LLM Context

Model Compression for LLMs

Ecosystem Tooling and Libraries

User Stories

Create a Compressed Version for Deployment

Create a Compressed Version for Efficient Training

Proposal

API Spec

Replies: 0 comments

markurtz
Mar 20, 2025