Skip to content

furiosa-ai/draft-based-approx-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

9 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

Draft-based Approximate Inference for LLMs

Kevin Galim1*, Ethan Ewer2*, Wonjun Kang1,3,
Minjae Lee1, Hyung Il Koo1,4, Kangwook Lee2,5,6

1FuriosaAI, 2UW-Madison, 3Seoul National University,
4Ajou University, 5KRAFTON, 6Ludo Robotics

* Equal Contribution

arXiv License


Diagram of Draft-based Approximate Inference Framework

๐Ÿš€ Overview

Draft-based Approximate Inference for LLMs leverages small draft models to more sharply distinguish important tokens and key-value (KV) pairs in long-context large language models. Our core contributionsโ€”SpecKV, SpecPC, and the combined SpecKV-PCโ€”enable smarter KV cache eviction and prompt compression, delivering more precise, efficient approximate inference than existing techniques.

๐Ÿ“ Abstract

Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Within this framework, we present:

  1. SpecKV: The first method to use lookahead with a small draft model to enable precise KV cache dropping.
  2. SpecPC: Uses the draft model's attention activations to identify and discard less important prompt tokens.
  3. SpecKV-PC: A cascaded compression strategy combining both techniques for superior results.

We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput.

๐ŸŒŸ Features

  • Plug & Play: Add to any HuggingFace-compatible LLM with just a few lines.
  • Higher Retained Accuracy: SpecKV, SpecPC, and SpecKV-PC preserve more model accuracy vs previous methods.
  • Cascaded Compression: SpecKV-PC combines prompt compression with KV cache eviction for maximum efficiency.
  • Flexible: Supports Qwen2.5, Llama-3, and more.

๐Ÿ› ๏ธ Installation

1. Clone repository:

git clone https://github.com/furiosa-ai/draft-based-approx-llm

2. Install PyTorch (example for CUDA 12.4):

pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

3. Install other dependencies:

pip install -r requirements.txt --no-build-isolation

4. Install FlashAttention:

pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

5. Prepare the RULER benchmark:

python scripts/create_data.py \
    --data ruler \
    --seq_len 4096 8192 16384 32768 65536 \
    --model \
        meta-llama/Llama-3.2-1B-Instruct \
        Qwen/Qwen2.5-0.5B-Instruct

๐Ÿงฉ Example Usage

SpecKV
from draft_approx_llm import SpecKVConfig, patch_model
from transformers import AutoModelForCausalLM

# Load base and draft models
model_kwargs = {
    "torch_dtype": "auto",
    "attn_implementation": "flash_attention_2",
    "device_map": "auto"
}

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-14B-Instruct", **model_kwargs)
draft_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", **model_kwargs)

# Configure SpecKV
speckv_config = SpecKVConfig(
    max_capacity_prompt=256,
    window_size=32,
    pool_type="max",
    kernel_size=7,
    reduction_type="max",
    lookahead_tokens=None,
    prefill_window_size=2048,
    prefill_vertical_size=2048
)

# Patch target model with the draft model to use SpecKV
model = patch_model(model, draft_model, speckv_config)

# Generate output
model.generate(inputs, max_new_tokens=32, return_dict_in_generate=True)

See more in notebooks/example_usage_speckv.ipynb.

SpecPC
from draft_approx_llm import SpecPCConfig, patch_model
from transformers import AutoModelForCausalLM

# Load base and draft models
model_kwargs = {
    "torch_dtype": "auto",
    "attn_implementation": "flash_attention_2",
    "device_map": "auto"
}

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-14B-Instruct", **model_kwargs)
draft_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", **model_kwargs)

# Configure SpecPC
specpc_config = SpecPCConfig(
    max_capacity_prompt=1024,
    window_size=64,
    pool_type="max",
    kernel_size=64,
    reduction_type="max",
    lookahead_tokens=1,
    neighbor_tokens=64,
    starting_layer_index=8,
    weighted_query=True
)

# Patch target model with the draft model to use SpecKV
model = patch_model(model, draft_model, specpc_config)

# Generate output
model.generate(inputs, max_new_tokens=32, return_dict_in_generate=True)

See more in notebooks/example_usage_specpc.ipynb.

SpecKV-PC
from draft_approx_llm import SpecKVConfig, SpecPCConfig, SpecKVPCConfig, patch_model
from transformers import AutoModelForCausalLM

# Load base and draft models
model_kwargs = {
    "torch_dtype": "auto",
    "attn_implementation": "flash_attention_2",
    "device_map": "auto"
}

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-14B-Instruct", **model_kwargs)
draft_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", **model_kwargs)

# Configure SpecKV-PC (Cascaded Strategy)
speckvpc_config = SpecKVPCConfig(
    specpc_config=SpecPCConfig(
        max_capacity_prompt=2048,
        window_size=64,
        pool_type="max",
        kernel_size=64,
        reduction_type="max",
        lookahead_tokens=8,
        neighbor_tokens=64,
        starting_layer_index=8,
        weighted_query=True
    ),
    speckv_config=SpecKVConfig(
        max_capacity_prompt=256,
        window_size=32,
        pool_type="max",
        kernel_size=7,
        reduction_type="max",
        prefill_window_size=2048,
        prefill_vertical_size=2048
    )
)

# Patch target model with the draft model to use SpecKV-PC
model = patch_model(model, draft_model, speckvpc_config)

# Generate output
model.generate(inputs, max_new_tokens=32, return_dict_in_generate=True)

See more in notebooks/example_usage_speckvpc.ipynb.

Reproducing Paper Results

Run evaluation (results logged to Weights & Biases):

SpecKV

python eval.py --cfg cfg/paper/speckv/longbench/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckv/longbench/qwen25_05b_14b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckv/ruler/*/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckv/ruler/*/qwen25_05b_14b/cmax_*/*.yaml

SpecPC

python eval.py --cfg cfg/paper/specpc/longbench/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/specpc/longbench/qwen25_05b_14b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/specpc/ruler/*/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/specpc/ruler/*/qwen25_05b_14b/cmax_*/*.yaml

SpecKV-PC

python eval.py --cfg cfg/paper/speckvpc2048/longbench/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckvpc2048/longbench/qwen25_05b_14b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckvpc2048/ruler/*/llama3_1b_8b/cmax_*/*.yaml
python eval.py --cfg cfg/paper/speckvpc2048/ruler/*/qwen25_05b_14b/cmax_*/*.yaml

โณ Roadmap

  • Release codebase for SpecKV and SpecPC
  • Release codebase for SpecKV-PC
  • Enable vLLM compatibility (SpecKV draft, SpecPC target)
  • Release Ada-SpecKV
  • Release Qwen2.5-VL support

๐Ÿ“– Citation

If you find this useful, please cite:

@article{galim2025draft,
  title={Draft-based Approximate Inference for LLMs},
  author={Galim, Kevin and Ewer, Ethan and Kang, Wonjun and Lee, Minjae and Koo, Hyung Il and Lee, Kangwook},
  journal={arXiv preprint arXiv:2506.08373},
  year={2025}
}

๐Ÿค Contributions

Pull requests, issues, and feedback are welcome!

About

[ICLR 2026] Draft-based Approximate Inference for LLMs

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors