Draft-based Approximate Inference for LLMs

Kevin Galim^1*, Ethan Ewer^2*, Wonjun Kang^1,3,
Minjae Lee¹, Hyung Il Koo^1,4, Kangwook Lee^2,5,6

¹FuriosaAI, ²UW-Madison, ³Seoul National University,
⁴Ajou University, ⁵KRAFTON, ⁶Ludo Robotics

* Equal Contribution

Diagram of Draft-based Approximate Inference Framework

🚀 Overview

Draft-based Approximate Inference for LLMs leverages small draft models to more sharply distinguish important tokens and key-value (KV) pairs in long-context large language models. Our core contributions—SpecKV, SpecPC, and the combined SpecKV-PC—enable smarter KV cache eviction and prompt compression, delivering more precise, efficient approximate inference than existing techniques.

📝 Abstract

Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Within this framework, we present:

SpecKV: The first method to use lookahead with a small draft model to enable precise KV cache dropping.
SpecPC: Uses the draft model's attention activations to identify and discard less important prompt tokens.
SpecKV-PC: A cascaded compression strategy combining both techniques for superior results.

We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput.

🌟 Features

Plug & Play: Add to any HuggingFace-compatible LLM with just a few lines.
Higher Retained Accuracy: SpecKV, SpecPC, and SpecKV-PC preserve more model accuracy vs previous methods.
Cascaded Compression: SpecKV-PC combines prompt compression with KV cache eviction for maximum efficiency.
Flexible: Supports Qwen2.5, Llama-3, and more.

🛠️ Installation

1. Clone repository:

git clone https://github.com/furiosa-ai/draft-based-approx-llm

2. Install PyTorch (example for CUDA 12.4):

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

3. Install other dependencies:

pip install -r requirements.txt --no-build-isolation

4. Install FlashAttention:

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

5. Prepare the RULER benchmark:

python scripts/create_data.py \
    --data ruler \
    --seq_len 4096 8192 16384 32768 65536 \
    --model \
        meta-llama/Llama-3.2-1B-Instruct \
        Qwen/Qwen2.5-0.5B-Instruct

🧩 Example Usage

SpecKV

from draft_approx_llm import SpecKVConfig, patch_model
from transformers import AutoModelForCausalLM

# Load base and draft models
model_kwargs = {
    "torch_dtype": "auto",
    "attn_implementation": "flash_attention_2",
    "device_map": "auto"
}

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-14B-Instruct", **model_kwargs)
draft_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", **model_kwargs)

# Configure SpecKV
speckv_config = SpecKVConfig(
    max_capacity_prompt=256,
    window_size=32,
    pool_type="max",
    kernel_size=7,
    reduction_type="max",
    lookahead_tokens=None,
    prefill_window_size=2048,
    prefill_vertical_size=2048
)

# Patch target model with the draft model to use SpecKV
model = patch_model(model, draft_model, speckv_config)

# Generate output
model.generate(inputs, max_new_tokens=32, return_dict_in_generate=True)

See more in notebooks/example_usage_speckv.ipynb.

SpecPC

from draft_approx_llm import SpecPCConfig, patch_model
from transformers import AutoModelForCausalLM

# Load base and draft models
model_kwargs = {
    "torch_dtype": "auto",
    "attn_implementation": "flash_attention_2",
    "device_map": "auto"
}

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-14B-Instruct", **model_kwargs)
draft_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", **model_kwargs)

# Configure SpecPC
specpc_config = SpecPCConfig(
    max_capacity_prompt=1024,
    window_size=64,
    pool_type="max",
    kernel_size=64,
    reduction_type="max",
    lookahead_tokens=1,
    neighbor_tokens=64,
    starting_layer_index=8,
    weighted_query=True
)

# Patch target model with the draft model to use SpecKV
model = patch_model(model, draft_model, specpc_config)

# Generate output
model.generate(inputs, max_new_tokens=32, return_dict_in_generate=True)

See more in notebooks/example_usage_specpc.ipynb.

SpecKV-PC

from draft_approx_llm import SpecKVConfig, SpecPCConfig, SpecKVPCConfig, patch_model
from transformers import AutoModelForCausalLM

# Load base and draft models
model_kwargs = {
    "torch_dtype": "auto",
    "attn_implementation": "flash_attention_2",
    "device_map": "auto"
}

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-14B-Instruct", **model_kwargs)
draft_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", **model_kwargs)

# Configure SpecKV-PC (Cascaded Strategy)
speckvpc_config = SpecKVPCConfig(
    specpc_config=SpecPCConfig(
        max_capacity_prompt=2048,
        window_size=64,
        pool_type="max",
        kernel_size=64,
        reduction_type="max",
        lookahead_tokens=8,
        neighbor_tokens=64,
        starting_layer_index=8,
        weighted_query=True
    ),
    speckv_config=SpecKVConfig(
        max_capacity_prompt=256,
        window_size=32,
        pool_type="max",
        kernel_size=7,
        reduction_type="max",
        prefill_window_size=2048,
        prefill_vertical_size=2048
    )
)

# Patch target model with the draft model to use SpecKV-PC
model = patch_model(model, draft_model, speckvpc_config)

# Generate output
model.generate(inputs, max_new_tokens=32, return_dict_in_generate=True)

See more in notebooks/example_usage_speckvpc.ipynb.

Reproducing Paper Results

Run evaluation (results logged to Weights & Biases):

SpecKV

python eval.py --cfg cfg/paper/speckv/longbench/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckv/longbench/qwen25_05b_14b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckv/ruler/*/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckv/ruler/*/qwen25_05b_14b/cmax_*/*.yaml

SpecPC

python eval.py --cfg cfg/paper/specpc/longbench/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/specpc/longbench/qwen25_05b_14b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/specpc/ruler/*/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/specpc/ruler/*/qwen25_05b_14b/cmax_*/*.yaml

SpecKV-PC

python eval.py --cfg cfg/paper/speckvpc2048/longbench/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckvpc2048/longbench/qwen25_05b_14b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckvpc2048/ruler/*/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckvpc2048/ruler/*/qwen25_05b_14b/cmax_*/*.yaml

⏳ Roadmap

Release codebase for SpecKV and SpecPC
Release codebase for SpecKV-PC
Enable vLLM compatibility (SpecKV draft, SpecPC target)
Release Ada-SpecKV
Release Qwen2.5-VL support

📖 Citation

If you find this useful, please cite:

@article{galim2025draft,
  title={Draft-based Approximate Inference for LLMs},
  author={Galim, Kevin and Ewer, Ethan and Kang, Wonjun and Lee, Minjae and Koo, Hyung Il and Lee, Kangwook},
  journal={arXiv preprint arXiv:2506.08373},
  year={2025}
}

🤝 Contributions

Pull requests, issues, and feedback are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
cfg/paper		cfg/paper
dataset		dataset
docs		docs
draft_approx_llm		draft_approx_llm
notebooks		notebooks
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Draft-based Approximate Inference for LLMs

🚀 Overview

📝 Abstract

🌟 Features

🛠️ Installation

🧩 Example Usage

Reproducing Paper Results

SpecKV

SpecPC

SpecKV-PC

⏳ Roadmap

📖 Citation

🤝 Contributions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

furiosa-ai/draft-based-approx-llm

Folders and files

Latest commit

History

Repository files navigation

Draft-based Approximate Inference for LLMs

🚀 Overview

📝 Abstract

🌟 Features

🛠️ Installation

🧩 Example Usage

Reproducing Paper Results

SpecKV

SpecPC

SpecKV-PC

⏳ Roadmap

📖 Citation

🤝 Contributions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages