vLLM Worker v0.15.0 — Upgrade from v0.11.x to v0.15.0 by MadiatorLabs · Pull Request #259 · runpod-workers/worker-vllm

MadiatorLabs · 2026-02-04T11:04:06Z

vLLM Worker v0.15.0 — Upgrade from v0.11.x to v0.15.0

This PR upgrades the vLLM serverless worker from vLLM v0.11.x to v0.15.0, spanning four major releases (v0.12.0, v0.13.0, v0.14.0, v0.15.0) with ~1,900+ upstream commits.

Worker-Specific Fixes

LoRA Serverless Fix: Deferred OpenAI serving engine initialization to the first request when LoRA adapters are configured. This prevents event loop mismatch in Runpod Serverless environments that caused EngineDeadError on the first request. Without LoRA, startup behavior is unchanged.
Updated all import paths for vLLM 0.15.0's reorganized vllm.entrypoints.openai.* module structure (chat_completion, completion, engine, models submodules).
Cleaned up hub.json: Removed 28 deprecated/removed env vars, updated CUDA version requirements (12.8 minimum), replaced DISABLE_LOG_REQUESTS with ENABLE_LOG_REQUESTS.
engine_args.py: Added compatibility shims for deprecated env vars (VLLM_ATTENTION_BACKEND → ATTENTION_BACKEND, DISABLE_LOG_REQUESTS → ENABLE_LOG_REQUESTS), removed use_v2_block_manager (V2 is now the only option), added new args (attention_backend, async_scheduling, stream_interval, cpu_offload_gb).
Fixed FlashInfer version: Pinned correct FlashInfer version that was broken/incompatible in the older worker build, resolving attention backend failures.

Major vLLM Changes (v0.12.0 → v0.15.0)

Performance

Async scheduling enabled by default — overlaps engine scheduling with GPU execution for improved throughput (v0.14.0)
Batch invariant BMM: 18.1% throughput improvement on DeepSeek-V3.1 (v0.12.0)
Shared experts overlap: 2.2% throughput improvement with FlashInfer DeepGEMM (v0.12.0)
MoE optimizations: grouped topk kernel fusion, NVFP4 improvements, faster cold start with torch.compile (v0.15.0)
FP4 kernel optimization: up to 65% faster FP4 quantization on Blackwell (v0.15.0)
Mamba prefix caching: ~2x speedup by caching Mamba states directly (v0.15.0)
Whisper: ~3x speedup vs v0.11.x with encoder batching and CUDA graph support (v0.13.0)

Model Support

New architectures: Kimi-K2.5, Molmo2, Grok-2, PLaMo-3, Step3vl, GLM-Lite, BAGEL, AudioFlamingo3, and many more
LoRA expansion: Nemotron-H, InternVL2, MiniMax M2, multimodal tower/connector LoRA, inplace loading for memory efficiency
Speculative decoding: EAGLE3, multi-step CUDA graph, DP>1 support, multimodal support
Embeddings: BGE-M3 sparse/ColBERT embeddings

Engine

Model Runner V2 (experimental): Refactored execution pipeline with GPU-persistent block tables, Triton-native sampler
Async scheduling + Pipeline Parallelism now work together (v0.15.0)
Session-based streaming input for interactive workloads like ASR (v0.15.0)
--max-model-len auto: Automatically fits context length to available GPU memory (v0.14.0)
Prefix caching: xxHash high-performance hash option (v0.13.0)
Conditional compilation via compile_ranges (v0.13.0)

Hardware

NVIDIA Blackwell: FlashInfer MLA default, SM103 (GB300) support, TRTLLM prefill default
TPU: Pipeline parallelism support
CPU: NUMA-aware acceleration, PyTorch 2.10 support, ARM optimizations

Quantization

New: W4A8 Marlin, MXFP4 W4A16, NVFP4 MoE CUTLASS, FP8 KV cache per-tensor/per-head
Removed: DeepSpeedFp8, RTN; HQQ deprecated

API & Frontend

Responses API with partial message generation and MCP support
gRPC server entrypoint (v0.14.0)
Reasoning parsers: DeepSeek R1, Qwen3, Granite, Hunyuan A13B
Tool calling: New parsers for FunctionGemma, GLM-4.7, DeepSeek-V3.2
Security: FIPS 140-3 compliant hashing, --ssl-ciphers, token leak prevention in crash logs

Breaking Changes

PyTorch 2.9.1+ required (CUDA 12.8+)
V2 block manager is now the only option (use_v2_block_manager removed)
Removed deprecated env vars: num_lookahead_slots, best_of, lora_extra_vocab, quantization_param_path, guided_decoding_backend, worker_use_ray, rope_scaling, rope_theta, tokenizer pool args, preemption args, and all individual speculative decoding args (now speculative_config JSON)
disable_log_requests deprecated in favor of enable_log_requests
Attention config: VLLM_ATTENTION_BACKEND env var replaced with --attention-backend CLI arg
Deprecated quantization schemes removed (DeepSpeedFp8, RTN)

DeJayDev · 2026-02-04T16:31:31Z

.runpod/hub.json

    "gpuCount": 1,
-    "allowedCudaVersions": ["12.9", "12.8", "12.7", "12.6", "12.5", "12.4"],
+    "allowedCudaVersions": ["12.9", "12.8"],
    "presets": [


Is it possible for hub listings to use minimumCudaVersion? This is more sustainable than updating this list. For example, CUDA 13 is already an available target for allowedCudaVersions.

this does exist, but as we need to get this version out, i will test this in a different repo first and if it is actually working, i will also do this for vllm

.runpod/hub.json

DeJayDev · 2026-02-04T16:32:53Z

docs/configuration.md

A lot of settings are removed from hub.json but not in this file, are the changes to this file accurate?

Did not cleaned documentation files yet.

docs/configuration.md

src/engine.py

src/handler.py

src/utils.py

Co-authored-by: Dj Isaac <contact@dejaydev.com>

Dockerfile

builder/requirements.txt

DeJayDev

Better, thanks!

src/engine.py

MadiatorLabs added 8 commits December 8, 2025 17:03

VLLM upgrade to 0.12.0 and compatibility fixes

2476fbb

MAX_NUM_BATCHED_TOKENS fix and CUDA tester

16f5f7a

Sys kill worker instead of marking as failed

be9b5e6

upgrade to vllm 0.12.0

624f63d

Update to vllm 0.15.0 and lora fix

7f8bd60

Update for HUB and removal of deprected env variables

ad43388

reverted docker-bake changes

3e9e2c6

removed leftovers

fbb1723

DeJayDev suggested changes Feb 4, 2026

View reviewed changes

MadiatorLabs and others added 4 commits February 5, 2026 14:52

Update src/handler.py

43ee0e2

Co-authored-by: Dj Isaac <contact@dejaydev.com>

Update src/utils.py

9963afb

Co-authored-by: Dj Isaac <contact@dejaydev.com>

Update src/handler.py

6678ce8

Co-authored-by: Dj Isaac <contact@dejaydev.com>

Clean up of docs and comments in code

c48206b

velaraptor-runpod reviewed Feb 5, 2026

View reviewed changes

Dockerfile Show resolved Hide resolved

velaraptor-runpod reviewed Feb 5, 2026

View reviewed changes

builder/requirements.txt Show resolved Hide resolved

velaraptor-runpod reviewed Feb 5, 2026

View reviewed changes

builder/requirements.txt Show resolved Hide resolved

velaraptor-runpod self-requested a review February 12, 2026 16:34

velaraptor-runpod approved these changes Feb 12, 2026

View reviewed changes

Merge branch 'main' into main

23fc6df

velaraptor-runpod requested a review from DeJayDev February 12, 2026 16:38

DeJayDev approved these changes Feb 12, 2026

View reviewed changes

src/engine.py Outdated Show resolved Hide resolved

src/engine.py Outdated Show resolved Hide resolved

DeJayDev added 2 commits February 12, 2026 12:49

nit: lowercase p

c1e2212

nit: lowercase p

721b100

TimPietruskyRunPod merged commit c45ac42 into runpod-workers:main Feb 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vLLM Worker v0.15.0 — Upgrade from v0.11.x to v0.15.0#259

vLLM Worker v0.15.0 — Upgrade from v0.11.x to v0.15.0#259
TimPietruskyRunPod merged 15 commits intorunpod-workers:mainfrom
MadiatorLabs:main

MadiatorLabs commented Feb 4, 2026

Uh oh!

DeJayDev Feb 4, 2026

Uh oh!

TimPietruskyRunPod Feb 12, 2026

Uh oh!

Uh oh!

DeJayDev Feb 4, 2026

Uh oh!

MadiatorLabs Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DeJayDev left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

MadiatorLabs commented Feb 4, 2026