Skip to content

Commit 7e9c0bb

Browse files
authored
Merge branch 'vllm-project:main' into main
2 parents 05ddcb9 + 766bc81 commit 7e9c0bb

File tree

138 files changed

+6372
-1358
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

138 files changed

+6372
-1358
lines changed

.buildkite/nightly-benchmarks/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -168,9 +168,9 @@ See [nightly-descriptions.md](nightly-descriptions.md) for the detailed descript
168168
### Workflow
169169

170170
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
171-
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
172-
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
173-
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
171+
- Inside each container, we run [scripts/run-nightly-benchmarks.sh](scripts/run-nightly-benchmarks.sh), which will probe the serving engine of the current container.
172+
- The `scripts/run-nightly-benchmarks.sh` will parse the workload described in [nightly-tests.json](tests/nightly-tests.json) and launch the right benchmark for the specified serving engine via `scripts/launch-server.sh`.
173+
- At last, we run [scripts/summary-nightly-results.py](scripts/summary-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
174174

175175
### Nightly tests
176176

@@ -180,6 +180,6 @@ In [nightly-tests.json](tests/nightly-tests.json), we include the command line a
180180

181181
The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.
182182

183-
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.
183+
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `scripts/run-nightly-benchmarks.sh` and `scripts/launch-server.sh`.
184184

185185
WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).

benchmarks/kernels/benchmark_bitblas.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,16 @@
33
# Copyright (c) Microsoft Corporation.
44
# Licensed under the MIT License.
55

6+
from packaging import version
7+
68
from vllm.model_executor.layers.quantization.utils.bitblas_utils import (
79
MINIMUM_BITBLAS_VERSION,
810
)
911

1012
try:
1113
import bitblas
1214

13-
if bitblas.__version__ < MINIMUM_BITBLAS_VERSION:
15+
if version.parse(bitblas.__version__) < version.parse(MINIMUM_BITBLAS_VERSION):
1416
raise ImportError(
1517
"bitblas version is wrong. Please "
1618
f"install bitblas>={MINIMUM_BITBLAS_VERSION}"

cmake/external_projects/vllm_flash_attn.cmake

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ else()
3838
FetchContent_Declare(
3939
vllm-flash-attn
4040
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
41-
GIT_TAG b99f8c821771fd11feb66d5c89661e9858fde359
41+
GIT_TAG 6dbc6e011a3ebe9349eeb74578940dd7095436ba
4242
GIT_PROGRESS TRUE
4343
# Don't share the vllm-flash-attn build between build types
4444
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn

csrc/mamba/mamba_ssm/selective_scan.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,9 @@ struct SSMParamsBase {
4545
index_t out_d_stride;
4646
index_t out_z_batch_stride;
4747
index_t out_z_d_stride;
48+
index_t ssm_states_batch_stride;
49+
index_t ssm_states_dim_stride;
50+
index_t ssm_states_dstate_stride;
4851

4952
// Common data pointers.
5053
void *__restrict__ A_ptr;

csrc/mamba/mamba_ssm/selective_scan_fwd.cu

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -132,8 +132,10 @@ void selective_scan_fwd_kernel(SSMParamsBase params) {
132132
input_t *Bvar = reinterpret_cast<input_t *>(params.B_ptr) + sequence_start_index * params.B_batch_stride + group_id * params.B_group_stride;
133133
weight_t *C = reinterpret_cast<weight_t *>(params.C_ptr) + dim_id * kNRows * params.C_d_stride;
134134
input_t *Cvar = reinterpret_cast<input_t *>(params.C_ptr) + sequence_start_index * params.C_batch_stride + group_id * params.C_group_stride;
135-
input_t *ssm_states = reinterpret_cast<input_t *>(params.ssm_states_ptr) + (cache_index * params.dim + dim_id * kNRows) * params.dstate;
136-
135+
input_t *ssm_states = reinterpret_cast<input_t *>(params.ssm_states_ptr) +
136+
cache_index * params.ssm_states_batch_stride +
137+
dim_id * kNRows * params.ssm_states_dim_stride;
138+
137139
float D_val[kNRows] = {0};
138140
if (params.D_ptr != nullptr) {
139141
#pragma unroll
@@ -248,7 +250,7 @@ void selective_scan_fwd_kernel(SSMParamsBase params) {
248250
}
249251
// Initialize running total
250252

251-
scan_t running_prefix = chunk > 0 ? smem_running_prefix[state_idx + r * MAX_DSTATE] : make_float2(1.0, has_initial_state ? float(ssm_states[state_idx]): 0.0);
253+
scan_t running_prefix = chunk > 0 ? smem_running_prefix[state_idx + r * MAX_DSTATE] : make_float2(1.0, has_initial_state ? float(ssm_states[state_idx * params.ssm_states_dstate_stride]): 0.0);
252254

253255
SSMScanPrefixCallbackOp<weight_t> prefix_op(running_prefix);
254256
typename Ktraits::BlockScanT(smem_scan).InclusiveScan(
@@ -259,7 +261,7 @@ void selective_scan_fwd_kernel(SSMParamsBase params) {
259261
if (threadIdx.x == 0) {
260262
smem_running_prefix[state_idx] = prefix_op.running_prefix;
261263
if (chunk == n_chunks - 1) {
262-
ssm_states[state_idx] = input_t(prefix_op.running_prefix.y);
264+
ssm_states[state_idx * params.ssm_states_dstate_stride] = input_t(prefix_op.running_prefix.y);
263265
}
264266
}
265267
#pragma unroll
@@ -481,6 +483,10 @@ void set_ssm_params_fwd(SSMParamsBase &params,
481483
params.out_batch_stride = out.stride(1);
482484
params.out_d_stride = out.stride(0);
483485

486+
params.ssm_states_batch_stride = ssm_states.stride(0);
487+
params.ssm_states_dim_stride = ssm_states.stride(1);
488+
params.ssm_states_dstate_stride = ssm_states.stride(2);
489+
484490
}
485491
else{
486492
if (!is_variable_B) {
@@ -509,6 +515,10 @@ void set_ssm_params_fwd(SSMParamsBase &params,
509515
}
510516
params.out_batch_stride = out.stride(0);
511517
params.out_d_stride = out.stride(1);
518+
519+
params.ssm_states_batch_stride = ssm_states.stride(0);
520+
params.ssm_states_dim_stride = ssm_states.stride(1);
521+
params.ssm_states_dstate_stride = ssm_states.stride(2);
512522
}
513523
}
514524

docker/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -392,7 +392,7 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist
392392
ARG FLASHINFER_GIT_REPO="https://github.com/flashinfer-ai/flashinfer.git"
393393
# Keep this in sync with https://github.com/vllm-project/vllm/blob/main/requirements/cuda.txt
394394
# We use `--force-reinstall --no-deps` to avoid issues with the existing FlashInfer wheel.
395-
ARG FLASHINFER_GIT_REF="v0.2.9"
395+
ARG FLASHINFER_GIT_REF="v0.2.10"
396396
RUN --mount=type=cache,target=/root/.cache/uv bash - <<'BASH'
397397
. /etc/environment
398398
git clone --depth 1 --recursive --shallow-submodules \
90.6 KB
Loading
87.9 KB
Loading

docs/configuration/conserving_memory.md

Lines changed: 15 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct",
8686

8787
If you run out of CPU RAM, try the following options:
8888

89-
- (Multi-modal models only) you can set the size of multi-modal input cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB).
89+
- (Multi-modal models only) you can set the size of multi-modal processor cache using `VLLM_MM_INPUT_CACHE_GIB` environment variable (default 4 GiB per API process + 4 GiB per engine core process)
9090
- (CPU backend only) you can set the size of KV cache using `VLLM_CPU_KVCACHE_SPACE` environment variable (default 4 GiB).
9191

9292
## Multi-modal input limits
@@ -129,20 +129,18 @@ reduce the size of the processed multi-modal inputs, which in turn saves memory.
129129

130130
Here are some examples:
131131

132-
??? code
133-
134-
```python
135-
from vllm import LLM
132+
```python
133+
from vllm import LLM
136134

137-
# Available for Qwen2-VL series models
138-
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
139-
mm_processor_kwargs={
140-
"max_pixels": 768 * 768, # Default is 1280 * 28 * 28
141-
})
142-
143-
# Available for InternVL series models
144-
llm = LLM(model="OpenGVLab/InternVL2-2B",
145-
mm_processor_kwargs={
146-
"max_dynamic_patch": 4, # Default is 12
147-
})
148-
```
135+
# Available for Qwen2-VL series models
136+
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
137+
mm_processor_kwargs={
138+
"max_pixels": 768 * 768, # Default is 1280 * 28 * 28
139+
})
140+
141+
# Available for InternVL series models
142+
llm = LLM(model="OpenGVLab/InternVL2-2B",
143+
mm_processor_kwargs={
144+
"max_dynamic_patch": 4, # Default is 12
145+
})
146+
```

docs/configuration/optimization.md

Lines changed: 29 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,9 @@
22

33
This guide covers optimization strategies and performance tuning for vLLM V1.
44

5+
!!! tip
6+
Running out of memory? Consult [this guide](./conserving_memory.md) on how to conserve memory.
7+
58
## Preemption
69

710
Due to the auto-regressive nature of transformer architecture, there are times when KV cache space is insufficient to handle all batched requests.
@@ -126,62 +129,44 @@ Data parallelism replicates the entire model across multiple GPU sets and proces
126129
Data parallelism can be combined with the other parallelism strategies and is set by `data_parallel_size=N`.
127130
Note that MoE layers will be sharded according to the product of the tensor parallel size and data parallel size.
128131

129-
## Reducing Memory Usage
130-
131-
If you encounter out-of-memory issues, consider these strategies:
132+
## Input Processing
132133

133-
### Context Length and Batch Size
134+
### Parallel Processing
134135

135-
You can reduce memory usage by limiting the context length and batch size:
136+
You can run input processing in parallel via [API server scale-out](../serving/data_parallel_deployment.md#internal-load-balancing).
137+
This is useful when input processing (which is run inside the API server)
138+
becomes a bottleneck compared to model execution (which is run inside engine core)
139+
and you have excess CPU capacity.
136140

137-
```python
138-
from vllm import LLM
141+
```console
142+
# Run 4 API processes and 1 engine core process
143+
vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4
139144

140-
llm = LLM(
141-
model="meta-llama/Llama-3.1-8B-Instruct",
142-
max_model_len=2048, # Limit context window
143-
max_num_seqs=4 # Limit batch size
144-
)
145+
# Run 4 API processes and 2 engine core processes
146+
vllm serve Qwen/Qwen2.5-VL-3B-Instruct --api-server-count 4 -dp 2
145147
```
146148

147-
### Adjust CUDA Graph Compilation
149+
!!! note
150+
API server scale-out is only available for online inference.
148151

149-
CUDA graph compilation in V1 uses more memory than in V0. You can reduce memory usage by adjusting the compilation level:
150-
151-
```python
152-
from vllm import LLM
153-
from vllm.config import CompilationConfig, CompilationLevel
154-
155-
llm = LLM(
156-
model="meta-llama/Llama-3.1-8B-Instruct",
157-
compilation_config=CompilationConfig(
158-
level=CompilationLevel.PIECEWISE,
159-
cudagraph_capture_sizes=[1, 2, 4, 8] # Capture fewer batch sizes
160-
)
161-
)
162-
```
152+
!!! note
153+
[Multi-modal processor cache](#processor-cache) is disabled when API server scale-out is enabled
154+
because it requires a one-to-one correspondance between API and engine core processes.
163155

164-
Or, if you are not concerned about latency or overall performance, disable CUDA graph compilation entirely with `enforce_eager=True`:
156+
## Multi-Modal Caching
165157

166-
```python
167-
from vllm import LLM
158+
### Processor Cache
168159

169-
llm = LLM(
170-
model="meta-llama/Llama-3.1-8B-Instruct",
171-
enforce_eager=True # Disable CUDA graph compilation
172-
)
173-
```
160+
By default, the multi-modal processor cache is enabled to avoid repeatedly processing
161+
the same multi-modal inputs via Hugging Face `AutoProcessor`,
162+
which commonly occurs in multi-turn conversations.
174163

175-
### Multimodal Models
164+
You can adjust the size of the cache via `VLLM_MM_INPUT_CACHE_GIB` environment variable
165+
(default 4 GiB per API process + 4 GiB per engine core process).
176166

177-
For multi-modal models, you can reduce memory usage by limiting the number of images/videos per request:
167+
If you do not benefit much from the cache, you can disable it completely via `disable_mm_preprocessor_cache`:
178168

179169
```python
180-
from vllm import LLM
181-
182-
# Accept up to 2 images per prompt
183-
llm = LLM(
184-
model="Qwen/Qwen2.5-VL-3B-Instruct",
185-
limit_mm_per_prompt={"image": 2}
186-
)
170+
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct",
171+
disable_mm_preprocessor_cache=True)
187172
```

0 commit comments

Comments
 (0)