Skip to content

Commit 0286875

Browse files
committed
Merge branch 'main' into amd/gfx950_skinny_gemm
Signed-off-by: charlifu <[email protected]>
2 parents 630ed84 + c290340 commit 0286875

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

51 files changed

+1735
-1498
lines changed

.buildkite/nightly-benchmarks/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ WARNING: The benchmarking script will save json results by itself, so please do
113113

114114
### Visualizing the results
115115

116-
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
116+
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](performance-benchmarks-descriptions.md) with real benchmarking results.
117117
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
118118
If you do not see the table, please wait till the benchmark finish running.
119119
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.

benchmarks/backend_request_func.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -324,7 +324,7 @@ async def async_request_openai_completions(
324324

325325
most_recent_timestamp = timestamp
326326
generated_text += text or ""
327-
elif usage := data.get("usage"):
327+
if usage := data.get("usage"):
328328
output.output_tokens = usage.get("completion_tokens")
329329
if first_chunk_received:
330330
output.success = True
@@ -611,6 +611,7 @@ def get_tokenizer(
611611
"tensorrt-llm": async_request_trt_llm,
612612
"scalellm": async_request_openai_completions,
613613
"sglang": async_request_openai_completions,
614+
"llama.cpp": async_request_openai_completions,
614615
}
615616

616617
OPENAI_COMPATIBLE_BACKENDS = [

benchmarks/benchmark_serving.py

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -762,6 +762,10 @@ def main(args: argparse.Namespace):
762762
if "temperature" not in sampling_params:
763763
sampling_params["temperature"] = 0.0 # Default to greedy decoding.
764764

765+
if args.backend == "llama.cpp":
766+
# Disable prompt caching in llama.cpp backend
767+
sampling_params["cache_prompt"] = False
768+
765769
# Avoid GC processing "static" data - reduce pause times.
766770
gc.collect()
767771
gc.freeze()

docs/.nav.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ nav:
1212
- User Guide: usage/README.md
1313
- Developer Guide: contributing/README.md
1414
- API Reference: api/README.md
15+
- CLI Reference: cli/README.md
1516
- Timeline:
1617
- Roadmap: https://roadmap.vllm.ai
1718
- Releases: https://github.com/vllm-project/vllm/releases
@@ -56,6 +57,8 @@ nav:
5657
- Contents:
5758
- glob: api/vllm/*
5859
preserve_directory_names: true
60+
- CLI Reference:
61+
- Summary: cli/README.md
5962
- Community:
6063
- community/*
6164
- Blog: https://blog.vllm.ai

docs/cli/README.md

Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# vLLM CLI Guide
2+
3+
The vllm command-line tool is used to run and manage vLLM models. You can start by viewing the help message with:
4+
5+
```
6+
vllm --help
7+
```
8+
9+
Available Commands:
10+
11+
```
12+
vllm {chat,complete,serve,bench,collect-env,run-batch}
13+
```
14+
15+
## Table of Contents
16+
17+
- [serve](#serve)
18+
- [chat](#chat)
19+
- [complete](#complete)
20+
- [bench](#bench)
21+
- [latency](#latency)
22+
- [serve](#serve-1)
23+
- [throughput](#throughput)
24+
- [collect-env](#collect-env)
25+
- [run-batch](#run-batch)
26+
- [More Help](#more-help)
27+
28+
## serve
29+
30+
Start the vLLM OpenAI Compatible API server.
31+
32+
Examples:
33+
34+
```bash
35+
# Start with a model
36+
vllm serve meta-llama/Llama-2-7b-hf
37+
38+
# Specify the port
39+
vllm serve meta-llama/Llama-2-7b-hf --port 8100
40+
41+
# Check with --help for more options
42+
# To list all groups
43+
vllm serve --help=listgroup
44+
45+
# To view a argument group
46+
vllm serve --help=ModelConfig
47+
48+
# To view a single argument
49+
vllm serve --help=max-num-seqs
50+
51+
# To search by keyword
52+
vllm serve --help=max
53+
```
54+
55+
## chat
56+
57+
Generate chat completions via the running API server.
58+
59+
Examples:
60+
61+
```bash
62+
# Directly connect to localhost API without arguments
63+
vllm chat
64+
65+
# Specify API url
66+
vllm chat --url http://{vllm-serve-host}:{vllm-serve-port}/v1
67+
68+
# Quick chat with a single prompt
69+
vllm chat --quick "hi"
70+
```
71+
72+
## complete
73+
74+
Generate text completions based on the given prompt via the running API server.
75+
76+
Examples:
77+
78+
```bash
79+
# Directly connect to localhost API without arguments
80+
vllm complete
81+
82+
# Specify API url
83+
vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1
84+
85+
# Quick complete with a single prompt
86+
vllm complete --quick "The future of AI is"
87+
```
88+
89+
## bench
90+
91+
Run benchmark tests for latency online serving throughput and offline inference throughput.
92+
93+
Available Commands:
94+
95+
```bash
96+
vllm bench {latency, serve, throughput}
97+
```
98+
99+
### latency
100+
101+
Benchmark the latency of a single batch of requests.
102+
103+
Example:
104+
105+
```bash
106+
vllm bench latency \
107+
--model meta-llama/Llama-3.2-1B-Instruct \
108+
--input-len 32 \
109+
--output-len 1 \
110+
--enforce-eager \
111+
--load-format dummy
112+
```
113+
114+
### serve
115+
116+
Benchmark the online serving throughput.
117+
118+
Example:
119+
120+
```bash
121+
vllm bench serve \
122+
--model meta-llama/Llama-3.2-1B-Instruct \
123+
--host server-host \
124+
--port server-port \
125+
--random-input-len 32 \
126+
--random-output-len 4 \
127+
--num-prompts 5
128+
```
129+
130+
### throughput
131+
132+
Benchmark offline inference throughput.
133+
134+
Example:
135+
136+
```bash
137+
vllm bench throughput \
138+
--model meta-llama/Llama-3.2-1B-Instruct \
139+
--input-len 32 \
140+
--output-len 1 \
141+
--enforce-eager \
142+
--load-format dummy
143+
```
144+
145+
## collect-env
146+
147+
Start collecting environment information.
148+
149+
```bash
150+
vllm collect-env
151+
```
152+
153+
## run-batch
154+
155+
Run batch prompts and write results to file.
156+
157+
Examples:
158+
159+
```bash
160+
# Running with a local file
161+
vllm run-batch \
162+
-i offline_inference/openai_batch/openai_example_batch.jsonl \
163+
-o results.jsonl \
164+
--model meta-llama/Meta-Llama-3-8B-Instruct
165+
166+
# Using remote file
167+
vllm run-batch \
168+
-i https://raw.githubusercontent.com/vllm-project/vllm/main/examples/offline_inference/openai_batch/openai_example_batch.jsonl \
169+
-o results.jsonl \
170+
--model meta-llama/Meta-Llama-3-8B-Instruct
171+
```
172+
173+
## More Help
174+
175+
For detailed options of any subcommand, use:
176+
177+
```bash
178+
vllm <subcommand> --help
179+
```

docs/features/compatibility_matrix.md

Lines changed: 21 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ The symbols used have the following meanings:
1010
- ✅ = Full compatibility
1111
- 🟠 = Partial compatibility
1212
- ❌ = No compatibility
13+
- ❔ = Unknown or TBD
1314

1415
!!! note
1516
Check the ❌ or 🟠 with links to see tracking issue for unsupported feature/hardware combination.
@@ -36,23 +37,23 @@ th:not(:first-child) {
3637
}
3738
</style>
3839

39-
| Feature | [CP][chunked-prefill] | [APC][automatic-prefix-caching] | [LoRA][lora-adapter] | <abbr title="Prompt Adapter">prmpt adptr</abbr> | [SD][spec-decode] | CUDA graph | <abbr title="Pooling Models">pooling</abbr> | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search |
40-
|-----------------------------------------------------------|-------------------------|-----------------------------------|------------------------|---------------------------------------------------|---------------------|--------------|-----------------------------------------------|-------------------------------------------------------|--------------------------------------|---------------------------------------------------|-------------------------------------------------------------|--------------------|---------------------------------------------|-----------|---------------|
41-
| [CP][chunked-prefill] | | | | | | | | | | | | | | | |
42-
| [APC][automatic-prefix-caching] | | | | | | | | | | | | | | | |
43-
| [LoRA][lora-adapter] | | | | | | | | | | | | | | | |
44-
| <abbr title="Prompt Adapter">prmpt adptr</abbr> | | | | | | | | | | | | | | | |
45-
| [SD][spec-decode] | | | | | | | | | | | | | | | |
46-
| CUDA graph | | | | | | | | | | | | | | | |
47-
| <abbr title="Pooling Models">pooling</abbr> | | | | | | | | | | | | | | | |
48-
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> | | [](gh-issue:7366) | | | [](gh-issue:7366) | | | | | | | | | | |
49-
| <abbr title="Logprobs">logP</abbr> | | | | | | | | | | | | | | | |
50-
| <abbr title="Prompt Logprobs">prmpt logP</abbr> | | | | | | | | | | | | | | | |
51-
| <abbr title="Async Output Processing">async output</abbr> | | | | | | | | | | | | | | | |
52-
| multi-step | | | | | | | | | | | | | | | |
53-
| <abbr title="Multimodal Inputs">mm</abbr> | | [🟠](gh-pr:8348) | [🟠](gh-pr:4194) | | | | | | | | | | | | |
54-
| best-of | | | | | [](gh-issue:6137) | | | | | | | [](gh-issue:7968) | | | |
55-
| beam-search | | | | | [](gh-issue:6137) | | | | | | | [](gh-issue:7968) | | | |
40+
| Feature | [CP][chunked-prefill] | [APC][automatic-prefix-caching] | [LoRA][lora-adapter] | <abbr title="Prompt Adapter">prmpt adptr</abbr> | [SD][spec-decode] | CUDA graph | <abbr title="Pooling Models">pooling</abbr> | <abbr title="Encoder-Decoder Models">enc-dec</abbr> | <abbr title="Logprobs">logP</abbr> | <abbr title="Prompt Logprobs">prmpt logP</abbr> | <abbr title="Async Output Processing">async output</abbr> | multi-step | <abbr title="Multimodal Inputs">mm</abbr> | best-of | beam-search |
41+
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
42+
| [CP][chunked-prefill] || | | | | | | | | | | | | | |
43+
| [APC][automatic-prefix-caching] ||| | | | | | | | | | | | | |
44+
| [LoRA][lora-adapter] |||| | | | | | | | | | | | |
45+
| <abbr title="Prompt Adapter">prmpt adptr</abbr> ||||| | | | | | | | | | | |
46+
| [SD][spec-decode] |||||| | | | | | | | | | |
47+
| CUDA graph ||||||| | | | | | | | | |
48+
| <abbr title="Pooling Models">pooling</abbr> |||||||| | | | | | | | |
49+
| <abbr title="Encoder-Decoder Models">enc-dec</abbr> || [](gh-issue:7366) ||| [](gh-issue:7366) |||| | | | | | | |
50+
| <abbr title="Logprobs">logP</abbr> |||||||||| | | | | | |
51+
| <abbr title="Prompt Logprobs">prmpt logP</abbr> ||||||||||| | | | | |
52+
| <abbr title="Async Output Processing">async output</abbr> |||||||||||| | | | |
53+
| multi-step ||||||||||||| | | |
54+
| <abbr title="Multimodal Inputs">mm</abbr> || [🟠](gh-pr:8348) | [🟠](gh-pr:4194) ||||||||||| | |
55+
| best-of ||||| [](gh-issue:6137) ||||||| [](gh-issue:7968) ||| |
56+
| beam-search ||||| [](gh-issue:6137) ||||||| [](gh-issue:7968) ||||
5657

5758
[](){ #feature-x-hardware }
5859

@@ -75,3 +76,6 @@ th:not(:first-child) {
7576
| multi-step |||||| [](gh-issue:8477) ||
7677
| best-of ||||||||
7778
| beam-search ||||||||
79+
80+
!!! note
81+
Please refer to [Feature support through NxD Inference backend][feature-support-through-nxd-inference-backend] for features supported on AWS Neuron hardware

docs/features/lora.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -165,6 +165,7 @@ it will first look in the local directory for a directory `foobar`, and attempt
165165
that adapter will then be available for normal use on the server.
166166

167167
Alternatively, follow these example steps to implement your own plugin:
168+
168169
1. Implement the LoRAResolver interface.
169170

170171
Example of a simple S3 LoRAResolver implementation:
@@ -198,9 +199,9 @@ Alternatively, follow these example steps to implement your own plugin:
198199
return lora_request
199200
```
200201

201-
2. Register LoRAResolver plugin.
202+
2. Register `LoRAResolver` plugin.
202203

203-
```python
204+
```python
204205
from vllm.lora.resolver import LoRAResolverRegistry
205206

206207
s3_resolver = S3LoRAResolver()

0 commit comments

Comments
 (0)