Kevin Galim1*,
Ethan Ewer2*,
Wonjun Kang1,3,
Minjae Lee1,
Hyung Il Koo1,4,
Kangwook Lee2,5,6
1FuriosaAI, 2UW-Madison, 3Seoul National University,
4Ajou University, 5KRAFTON, 6Ludo Robotics
* Equal Contribution
Draft-based Approximate Inference for LLMs leverages small draft models to more sharply distinguish important tokens and key-value (KV) pairs in long-context large language models. Our core contributionsโSpecKV, SpecPC, and the combined SpecKV-PCโenable smarter KV cache eviction and prompt compression, delivering more precise, efficient approximate inference than existing techniques.
Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Within this framework, we present:
- SpecKV: The first method to use lookahead with a small draft model to enable precise KV cache dropping.
- SpecPC: Uses the draft model's attention activations to identify and discard less important prompt tokens.
- SpecKV-PC: A cascaded compression strategy combining both techniques for superior results.
We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput.
- Plug & Play: Add to any HuggingFace-compatible LLM with just a few lines.
- Higher Retained Accuracy: SpecKV, SpecPC, and SpecKV-PC preserve more model accuracy vs previous methods.
- Cascaded Compression: SpecKV-PC combines prompt compression with KV cache eviction for maximum efficiency.
- Flexible: Supports Qwen2.5, Llama-3, and more.
1. Clone repository:
git clone https://github.com/furiosa-ai/draft-based-approx-llm2. Install PyTorch (example for CUDA 12.4):
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu1243. Install other dependencies:
pip install -r requirements.txt --no-build-isolation
4. Install FlashAttention:
pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl5. Prepare the RULER benchmark:
python scripts/create_data.py \
--data ruler \
--seq_len 4096 8192 16384 32768 65536 \
--model \
meta-llama/Llama-3.2-1B-Instruct \
Qwen/Qwen2.5-0.5B-InstructSpecKV
from draft_approx_llm import SpecKVConfig, patch_model
from transformers import AutoModelForCausalLM
# Load base and draft models
model_kwargs = {
"torch_dtype": "auto",
"attn_implementation": "flash_attention_2",
"device_map": "auto"
}
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-14B-Instruct", **model_kwargs)
draft_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", **model_kwargs)
# Configure SpecKV
speckv_config = SpecKVConfig(
max_capacity_prompt=256,
window_size=32,
pool_type="max",
kernel_size=7,
reduction_type="max",
lookahead_tokens=None,
prefill_window_size=2048,
prefill_vertical_size=2048
)
# Patch target model with the draft model to use SpecKV
model = patch_model(model, draft_model, speckv_config)
# Generate output
model.generate(inputs, max_new_tokens=32, return_dict_in_generate=True)See more in notebooks/example_usage_speckv.ipynb.
SpecPC
from draft_approx_llm import SpecPCConfig, patch_model
from transformers import AutoModelForCausalLM
# Load base and draft models
model_kwargs = {
"torch_dtype": "auto",
"attn_implementation": "flash_attention_2",
"device_map": "auto"
}
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-14B-Instruct", **model_kwargs)
draft_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", **model_kwargs)
# Configure SpecPC
specpc_config = SpecPCConfig(
max_capacity_prompt=1024,
window_size=64,
pool_type="max",
kernel_size=64,
reduction_type="max",
lookahead_tokens=1,
neighbor_tokens=64,
starting_layer_index=8,
weighted_query=True
)
# Patch target model with the draft model to use SpecKV
model = patch_model(model, draft_model, specpc_config)
# Generate output
model.generate(inputs, max_new_tokens=32, return_dict_in_generate=True)See more in notebooks/example_usage_specpc.ipynb.
SpecKV-PC
from draft_approx_llm import SpecKVConfig, SpecPCConfig, SpecKVPCConfig, patch_model
from transformers import AutoModelForCausalLM
# Load base and draft models
model_kwargs = {
"torch_dtype": "auto",
"attn_implementation": "flash_attention_2",
"device_map": "auto"
}
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-14B-Instruct", **model_kwargs)
draft_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", **model_kwargs)
# Configure SpecKV-PC (Cascaded Strategy)
speckvpc_config = SpecKVPCConfig(
specpc_config=SpecPCConfig(
max_capacity_prompt=2048,
window_size=64,
pool_type="max",
kernel_size=64,
reduction_type="max",
lookahead_tokens=8,
neighbor_tokens=64,
starting_layer_index=8,
weighted_query=True
),
speckv_config=SpecKVConfig(
max_capacity_prompt=256,
window_size=32,
pool_type="max",
kernel_size=7,
reduction_type="max",
prefill_window_size=2048,
prefill_vertical_size=2048
)
)
# Patch target model with the draft model to use SpecKV-PC
model = patch_model(model, draft_model, speckvpc_config)
# Generate output
model.generate(inputs, max_new_tokens=32, return_dict_in_generate=True)See more in notebooks/example_usage_speckvpc.ipynb.
Run evaluation (results logged to Weights & Biases):
python eval.py --cfg cfg/paper/speckv/longbench/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckv/longbench/qwen25_05b_14b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckv/ruler/*/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckv/ruler/*/qwen25_05b_14b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/specpc/longbench/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/specpc/longbench/qwen25_05b_14b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/specpc/ruler/*/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/specpc/ruler/*/qwen25_05b_14b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckvpc2048/longbench/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckvpc2048/longbench/qwen25_05b_14b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckvpc2048/ruler/*/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckvpc2048/ruler/*/qwen25_05b_14b/cmax_*/*.yaml
- Release codebase for SpecKV and SpecPC
- Release codebase for SpecKV-PC
- Enable vLLM compatibility (SpecKV draft, SpecPC target)
- Release Ada-SpecKV
- Release Qwen2.5-VL support
If you find this useful, please cite:
@article{galim2025draft,
title={Draft-based Approximate Inference for LLMs},
author={Galim, Kevin and Ewer, Ethan and Kang, Wonjun and Lee, Minjae and Koo, Hyung Il and Lee, Kangwook},
journal={arXiv preprint arXiv:2506.08373},
year={2025}
}
Pull requests, issues, and feedback are welcome!