Speculative Decoding with vLLM

Using Speculative Decoding with the vLLM backend.

See also: Speculative Decoding Overview for cross-backend documentation.

Prerequisites

vLLM container with Eagle3 support
GPU with at least 16GB VRAM
Hugging Face access token (for gated models)

Quick Start: Meta-Llama-3.1-8B-Instruct + Eagle3

This guide walks through deploying Meta-Llama-3.1-8B-Instruct with Eagle3 speculative decoding on a single node.

Step 1: Set Up Your Docker Environment

First, initialize a Docker container using the vLLM backend. See the vLLM Quickstart Guide for details.

# Launch infrastructure services
docker compose -f deploy/docker-compose.yml up -d

# Build the container
./container/build.sh --framework VLLM

# Run the container
./container/run.sh -it --framework VLLM --mount-workspace

Step 2: Get Access to the Llama-3 Model

The Meta-Llama-3.1-8B-Instruct model is gated. Request access on Hugging Face: Meta-Llama-3.1-8B-Instruct repository

Approval time varies depending on Hugging Face review traffic.

Once approved, set your access token inside the container:

export HUGGING_FACE_HUB_TOKEN="insert_your_token_here"
export HF_TOKEN=$HUGGING_FACE_HUB_TOKEN

Step 3: Run Aggregated Speculative Decoding

# Requires only one GPU
cd examples/backends/vllm
bash launch/agg_spec_decoding.sh

Once the weights finish downloading, the server will be ready for inference requests.

Step 4: Test the Deployment

curl http://localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
     "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
     "messages": [
       {"role": "user", "content": "Write a poem about why Sakura trees are beautiful."}
     ],
     "max_tokens": 250
   }'

Example Output

{
  "id": "cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8",
  "choices": [
    {
      "message": {
        "role": "assistant",
        "content": "In cherry blossom's gentle breeze ... A delicate balance of life and death, as petals fade, and new life breathes."
      },
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
  "usage": {
    "prompt_tokens": 16,
    "completion_tokens": 250,
    "total_tokens": 266
  }
}

Configuration

Speculative decoding in vLLM uses Eagle3 as the draft model. The launch script configures:

Target model: meta-llama/Meta-Llama-3.1-8B-Instruct
Draft model: Eagle3 variant
Aggregated serving mode

See examples/backends/vllm/launch/agg_spec_decoding.sh for the full configuration.

Limitations

Currently only supports Eagle3 as the draft model
Requires compatible model architectures between target and draft

Document	Path
Speculative Decoding Overview	README.md
vLLM Backend Guide	vLLM README
Meta-Llama-3.1-8B-Instruct	Hugging Face

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speculative Decoding with vLLM

Prerequisites

Quick Start: Meta-Llama-3.1-8B-Instruct + Eagle3

Step 1: Set Up Your Docker Environment

Step 2: Get Access to the Llama-3 Model

Step 3: Run Aggregated Speculative Decoding

Step 4: Test the Deployment

Example Output

Configuration

Limitations

See Also

FilesExpand file tree

speculative_decoding_vllm.md

Latest commit

History

speculative_decoding_vllm.md

File metadata and controls

Speculative Decoding with vLLM

Prerequisites

Quick Start: Meta-Llama-3.1-8B-Instruct + Eagle3

Step 1: Set Up Your Docker Environment

Step 2: Get Access to the Llama-3 Model

Step 3: Run Aggregated Speculative Decoding

Step 4: Test the Deployment

Example Output

Configuration

Limitations

See Also