Skip to content

Commit 4c2ffb2

Browse files
authored
[Speculative decoding] Initial spec decode docs (#5400)
1 parent 246598a commit 4c2ffb2

File tree

2 files changed

+76
-0
lines changed

2 files changed

+76
-0
lines changed

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -90,6 +90,7 @@ Documentation
9090
models/engine_args
9191
models/lora
9292
models/vlm
93+
models/spec_decode
9394
models/performance
9495

9596
.. toctree::

docs/source/models/spec_decode.rst

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
.. _spec_decode:
2+
3+
Speculative decoding in vLLM
4+
============================
5+
6+
.. warning::
7+
Please note that speculative decoding in vLLM is not yet optimized and does
8+
not usually yield inter-token latency reductions for all prompt datasets or sampling parameters. The work
9+
to optimize it is ongoing and can be followed in `this issue. <https://github.com/vllm-project/vllm/issues/4630>`_
10+
11+
This document shows how to use `Speculative Decoding <https://x.com/karpathy/status/1697318534555336961>`_ with vLLM.
12+
Speculative decoding is a technique which improves inter-token latency in memory-bound LLM inference.
13+
14+
Speculating with a draft model
15+
------------------------------
16+
17+
The following code configures vLLM to use speculative decoding with a draft model, speculating 5 tokens at a time.
18+
19+
.. code-block:: python
20+
from vllm import LLM, SamplingParams
21+
22+
prompts = [
23+
"The future of AI is",
24+
]
25+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
26+
27+
llm = LLM(
28+
model="facebook/opt-6.7b",
29+
tensor_parallel_size=1,
30+
speculative_model="facebook/opt-125m",
31+
num_speculative_tokens=5,
32+
use_v2_block_manager=True,
33+
)
34+
outputs = llm.generate(prompts, sampling_params)
35+
36+
for output in outputs:
37+
prompt = output.prompt
38+
generated_text = output.outputs[0].text
39+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
40+
41+
Speculating by matching n-grams in the prompt
42+
---------------------------------------------
43+
44+
The following code configures vLLM to use speculative decoding where proposals are generated by
45+
matching n-grams in the prompt. For more information read `this thread. <https://x.com/joao_gante/status/1747322413006643259>`_
46+
47+
.. code-block:: python
48+
from vllm import LLM, SamplingParams
49+
50+
prompts = [
51+
"The future of AI is",
52+
]
53+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
54+
55+
llm = LLM(
56+
model="facebook/opt-6.7b",
57+
tensor_parallel_size=1,
58+
speculative_model="[ngram]",
59+
num_speculative_tokens=5,
60+
ngram_prompt_lookup_max=4,
61+
use_v2_block_manager=True,
62+
)
63+
outputs = llm.generate(prompts, sampling_params)
64+
65+
for output in outputs:
66+
prompt = output.prompt
67+
generated_text = output.outputs[0].text
68+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
69+
70+
Resources for vLLM contributors
71+
-------------------------------
72+
* `A Hacker's Guide to Speculative Decoding in vLLM <https://www.youtube.com/watch?v=9wNAgpX6z_4>`_
73+
* `What is Lookahead Scheduling in vLLM? <https://docs.google.com/document/d/1Z9TvqzzBPnh5WHcRwjvK2UEeFeq5zMZb5mFE8jR0HCs/edit#heading=h.1fjfb0donq5a>`_
74+
* `Information on batch expansion. <https://docs.google.com/document/d/1T-JaS2T1NRfdP51qzqpyakoCXxSXTtORppiwaj5asxA/edit#heading=h.kk7dq05lc6q8>`_
75+
* `Dynamic speculative decoding <https://github.com/vllm-project/vllm/issues/4565>`_

0 commit comments

Comments
 (0)