Using Speculative Decoding with the vLLM backend.
See also: Speculative Decoding Overview for cross-backend documentation.
- vLLM container with Eagle3 support
- GPU with at least 16GB VRAM
- Hugging Face access token (for gated models)
This guide walks through deploying Meta-Llama-3.1-8B-Instruct with Eagle3 speculative decoding on a single node.
First, initialize a Docker container using the vLLM backend. See the vLLM Quickstart Guide for details.
# Launch infrastructure services
docker compose -f deploy/docker-compose.yml up -d
# Build the container
./container/build.sh --framework VLLM
# Run the container
./container/run.sh -it --framework VLLM --mount-workspaceThe Meta-Llama-3.1-8B-Instruct model is gated. Request access on Hugging Face: Meta-Llama-3.1-8B-Instruct repository
Approval time varies depending on Hugging Face review traffic.
Once approved, set your access token inside the container:
export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN# Requires only one GPU
cd examples/backends/vllm
bash launch/agg_spec_decoding.shOnce the weights finish downloading, the server will be ready for inference requests.
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
],
"max_tokens": 250
}'{
"id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
"choices": [
{
"message": {
"role": "assistant",
"content": "In cherry blossom's gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes."
},
"index": 0,
"finish_reason": "stop"
}
],
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"usage": {
"prompt_tokens": 16,
"completion_tokens": 250,
"total_tokens": 266
}
}Speculative decoding in vLLM uses Eagle3 as the draft model. The launch script configures:
- Target model:
meta-llama/Meta-Llama-3.1-8B-Instruct - Draft model: Eagle3 variant
- Aggregated serving mode
See examples/backends/vllm/launch/agg_spec_decoding.sh for the full configuration.
- Currently only supports Eagle3 as the draft model
- Requires compatible model architectures between target and draft
| Document | Path |
|---|---|
| Speculative Decoding Overview | README.md |
| vLLM Backend Guide | vLLM README |
| Meta-Llama-3.1-8B-Instruct | Hugging Face |