Skip to content

Commit b382964

Browse files
committed
Merge remote-tracking branch 'upstream/main' into upstream_merge_2025_06_20
2 parents 9d5b854 + 53243e5 commit b382964

File tree

104 files changed

+5744
-3968
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

104 files changed

+5744
-3968
lines changed

.github/mergify.yml

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,7 @@ pull_request_rules:
4545
- files~=^vllm/entrypoints/openai/tool_parsers/llama.*\.py
4646
- files~=^vllm/model_executor/models/.*llama.*\.py
4747
- files~=^vllm/transformers_utils/configs/.*llama.*\.py
48+
- title~=(?i)llama
4849
actions:
4950
label:
5051
add:
@@ -65,6 +66,19 @@ pull_request_rules:
6566
add:
6667
- multi-modality
6768

69+
- name: label-performance
70+
description: Automatically apply performance label
71+
conditions:
72+
- or:
73+
- files~=^benchmarks/
74+
- files~=^vllm/benchmarks/
75+
- files~=^tests/benchmarks/
76+
- files~=^\.buildkite/nightly-benchmarks/
77+
actions:
78+
label:
79+
add:
80+
- performance
81+
6882
- name: label-qwen
6983
description: Automatically apply qwen label
7084
conditions:
@@ -74,7 +88,6 @@ pull_request_rules:
7488
- files~=^vllm/model_executor/models/.*qwen.*\.py
7589
- files~=^vllm/reasoning/.*qwen.*\.py
7690
- title~=(?i)Qwen
77-
- body~=(?i)Qwen
7891
actions:
7992
label:
8093
add:

.pre-commit-config.yaml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,11 @@ repos:
115115
entry: python tools/check_spdx_header.py
116116
language: python
117117
types: [python]
118+
- id: check-root-lazy-imports
119+
name: Check root lazy imports
120+
entry: python tools/check_init_lazy_imports.py
121+
language: python
122+
types: [python]
118123
- id: check-filenames
119124
name: Check for spaces in all filenames
120125
entry: bash

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -154,11 +154,13 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs
154154

155155
## Contact Us
156156

157+
<!-- --8<-- [start:contact-us] -->
157158
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues) or [Discussions](https://github.com/vllm-project/vllm/discussions)
158159
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
159160
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
160161
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
161162
- For collaborations and partnerships, please contact us at [[email protected]](mailto:[email protected])
163+
<!-- --8<-- [end:contact-us] -->
162164

163165
## Media Kit
164166

benchmarks/backend_request_func.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -404,8 +404,14 @@ async def async_request_openai_chat_completions(
404404
chunk_bytes = chunk_bytes.strip()
405405
if not chunk_bytes:
406406
continue
407+
chunk_bytes = chunk_bytes.decode("utf-8")
408+
# NOTE: SSE comments (often used as pings) start with a colon.
409+
# These are not JSON data payload and should be skipped.
410+
if chunk_bytes.startswith(":"):
411+
continue
412+
413+
chunk = chunk_bytes.removeprefix("data: ")
407414

408-
chunk = chunk_bytes.decode("utf-8").removeprefix("data: ")
409415
if chunk != "[DONE]":
410416
timestamp = time.perf_counter()
411417
data = json.loads(chunk)

docs/ci/update_pytorch_version.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -91,7 +91,7 @@ source to unblock the update process.
9191
### FlashInfer
9292
Here is how to build and install it from source with torch2.7.0+cu128 in vLLM [Dockerfile](https://github.com/vllm-project/vllm/blob/27bebcd89792d5c4b08af7a65095759526f2f9e1/docker/Dockerfile#L259-L271):
9393

94-
```
94+
```bash
9595
export TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0 10.0+PTX'
9696
export FLASHINFER_ENABLE_SM90=1
9797
uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/[email protected]"
@@ -105,14 +105,14 @@ team if you want to get the package published there.
105105
### xFormers
106106
Similar to FlashInfer, here is how to build and install xFormers from source:
107107

108-
```
108+
```bash
109109
export TORCH_CUDA_ARCH_LIST='7.0 7.5 8.0 8.9 9.0 10.0+PTX'
110110
MAX_JOBS=16 uv pip install --system --no-build-isolation "git+https://github.com/facebookresearch/[email protected]"
111111
```
112112

113113
### Mamba
114114

115-
```
115+
```bash
116116
uv pip install --system --no-build-isolation "git+https://github.com/state-spaces/[email protected]"
117117
```
118118

docs/cli/README.md

Lines changed: 22 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -16,35 +16,33 @@ vllm {chat,complete,serve,bench,collect-env,run-batch}
1616

1717
Start the vLLM OpenAI Compatible API server.
1818

19-
Examples:
19+
??? Examples
2020

21-
```bash
22-
# Start with a model
23-
vllm serve meta-llama/Llama-2-7b-hf
21+
```bash
22+
# Start with a model
23+
vllm serve meta-llama/Llama-2-7b-hf
2424

25-
# Specify the port
26-
vllm serve meta-llama/Llama-2-7b-hf --port 8100
25+
# Specify the port
26+
vllm serve meta-llama/Llama-2-7b-hf --port 8100
2727

28-
# Check with --help for more options
29-
# To list all groups
30-
vllm serve --help=listgroup
28+
# Check with --help for more options
29+
# To list all groups
30+
vllm serve --help=listgroup
3131

32-
# To view a argument group
33-
vllm serve --help=ModelConfig
32+
# To view a argument group
33+
vllm serve --help=ModelConfig
3434

35-
# To view a single argument
36-
vllm serve --help=max-num-seqs
35+
# To view a single argument
36+
vllm serve --help=max-num-seqs
3737

38-
# To search by keyword
39-
vllm serve --help=max
40-
```
38+
# To search by keyword
39+
vllm serve --help=max
40+
```
4141

4242
## chat
4343

4444
Generate chat completions via the running API server.
4545

46-
Examples:
47-
4846
```bash
4947
# Directly connect to localhost API without arguments
5048
vllm chat
@@ -60,8 +58,6 @@ vllm chat --quick "hi"
6058

6159
Generate text completions based on the given prompt via the running API server.
6260

63-
Examples:
64-
6561
```bash
6662
# Directly connect to localhost API without arguments
6763
vllm complete
@@ -73,6 +69,8 @@ vllm complete --url http://{vllm-serve-host}:{vllm-serve-port}/v1
7369
vllm complete --quick "The future of AI is"
7470
```
7571

72+
</details>
73+
7674
## bench
7775

7876
Run benchmark tests for latency online serving throughput and offline inference throughput.
@@ -89,8 +87,6 @@ vllm bench {latency, serve, throughput}
8987

9088
Benchmark the latency of a single batch of requests.
9189

92-
Example:
93-
9490
```bash
9591
vllm bench latency \
9692
--model meta-llama/Llama-3.2-1B-Instruct \
@@ -104,8 +100,6 @@ vllm bench latency \
104100

105101
Benchmark the online serving throughput.
106102

107-
Example:
108-
109103
```bash
110104
vllm bench serve \
111105
--model meta-llama/Llama-3.2-1B-Instruct \
@@ -120,8 +114,6 @@ vllm bench serve \
120114

121115
Benchmark offline inference throughput.
122116

123-
Example:
124-
125117
```bash
126118
vllm bench throughput \
127119
--model meta-llama/Llama-3.2-1B-Instruct \
@@ -143,7 +135,8 @@ vllm collect-env
143135

144136
Run batch prompts and write results to file.
145137

146-
Examples:
138+
<details>
139+
<summary>Examples</summary>
147140

148141
```bash
149142
# Running with a local file
@@ -159,6 +152,8 @@ vllm run-batch \
159152
--model meta-llama/Meta-Llama-3-8B-Instruct
160153
```
161154

155+
</details>
156+
162157
## More Help
163158

164159
For detailed options of any subcommand, use:

docs/community/contact_us.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
title: Contact Us
3+
---
4+
[](){ #contactus }
5+
6+
--8<-- "README.md:contact-us"

docs/configuration/conserving_memory.md

Lines changed: 31 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -57,19 +57,21 @@ By default, we optimize model inference using CUDA graphs which take up extra me
5757

5858
You can adjust `compilation_config` to achieve a better balance between inference speed and memory usage:
5959

60-
```python
61-
from vllm import LLM
62-
from vllm.config import CompilationConfig, CompilationLevel
63-
64-
llm = LLM(
65-
model="meta-llama/Llama-3.1-8B-Instruct",
66-
compilation_config=CompilationConfig(
67-
level=CompilationLevel.PIECEWISE,
68-
# By default, it goes up to max_num_seqs
69-
cudagraph_capture_sizes=[1, 2, 4, 8, 16],
70-
),
71-
)
72-
```
60+
??? Code
61+
62+
```python
63+
from vllm import LLM
64+
from vllm.config import CompilationConfig, CompilationLevel
65+
66+
llm = LLM(
67+
model="meta-llama/Llama-3.1-8B-Instruct",
68+
compilation_config=CompilationConfig(
69+
level=CompilationLevel.PIECEWISE,
70+
# By default, it goes up to max_num_seqs
71+
cudagraph_capture_sizes=[1, 2, 4, 8, 16],
72+
),
73+
)
74+
```
7375

7476
You can disable graph capturing completely via the `enforce_eager` flag:
7577

@@ -127,18 +129,20 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory.
127129

128130
Here are some examples:
129131

130-
```python
131-
from vllm import LLM
132+
??? Code
132133

133-
# Available for Qwen2-VL series models
134-
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
135-
mm_processor_kwargs={
136-
"max_pixels": 768 * 768, # Default is 1280 * 28 * 28
137-
})
138-
139-
# Available for InternVL series models
140-
llm = LLM(model="OpenGVLab/InternVL2-2B",
141-
mm_processor_kwargs={
142-
"max_dynamic_patch": 4, # Default is 12
143-
})
144-
```
134+
```python
135+
from vllm import LLM
136+
137+
# Available for Qwen2-VL series models
138+
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
139+
mm_processor_kwargs={
140+
"max_pixels": 768 * 768, # Default is 1280 * 28 * 28
141+
})
142+
143+
# Available for InternVL series models
144+
llm = LLM(model="OpenGVLab/InternVL2-2B",
145+
mm_processor_kwargs={
146+
"max_dynamic_patch": 4, # Default is 12
147+
})
148+
```

docs/configuration/env_vars.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,8 @@ vLLM uses the following environment variables to configure the system:
77

88
All environment variables used by vLLM are prefixed with `VLLM_`. **Special care should be taken for Kubernetes users**: please do not name the service as `vllm`, otherwise environment variables set by Kubernetes might conflict with vLLM's environment variables, because [Kubernetes sets environment variables for each service with the capitalized service name as the prefix](https://kubernetes.io/docs/concepts/services-networking/service/#environment-variables).
99

10-
```python
11-
--8<-- "vllm/envs.py:env-vars-definition"
12-
```
10+
??? Code
11+
12+
```python
13+
--8<-- "vllm/envs.py:env-vars-definition"
14+
```

docs/contributing/README.md

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -93,25 +93,27 @@ For additional features and advanced configurations, refer to the official [MkDo
9393

9494
## Testing
9595

96-
```bash
97-
pip install -r requirements/dev.txt
96+
??? note "Commands"
9897

99-
# Linting, formatting and static type checking
100-
pre-commit install --hook-type pre-commit --hook-type commit-msg
98+
```bash
99+
pip install -r requirements/dev.txt
101100

102-
# You can manually run pre-commit with
103-
pre-commit run --all-files
101+
# Linting, formatting and static type checking
102+
pre-commit install --hook-type pre-commit --hook-type commit-msg
104103

105-
# To manually run something from CI that does not run
106-
# locally by default, you can run:
107-
pre-commit run mypy-3.9 --hook-stage manual --all-files
104+
# You can manually run pre-commit with
105+
pre-commit run --all-files
108106

109-
# Unit tests
110-
pytest tests/
107+
# To manually run something from CI that does not run
108+
# locally by default, you can run:
109+
pre-commit run mypy-3.9 --hook-stage manual --all-files
111110

112-
# Run tests for a single test file with detailed output
113-
pytest -s -v tests/test_logger.py
114-
```
111+
# Unit tests
112+
pytest tests/
113+
114+
# Run tests for a single test file with detailed output
115+
pytest -s -v tests/test_logger.py
116+
```
115117

116118
!!! tip
117119
Since the <gh-file:docker/Dockerfile> ships with Python 3.12, all tests in CI (except `mypy`) are run with Python 3.12.

0 commit comments

Comments
 (0)