Skip to content

vLLM Worker v0.15.0 — Upgrade from v0.11.x to v0.15.0#259

Merged
TimPietruskyRunPod merged 15 commits intorunpod-workers:mainfrom
MadiatorLabs:main
Feb 12, 2026
Merged

vLLM Worker v0.15.0 — Upgrade from v0.11.x to v0.15.0#259
TimPietruskyRunPod merged 15 commits intorunpod-workers:mainfrom
MadiatorLabs:main

Conversation

@MadiatorLabs
Copy link
Contributor

vLLM Worker v0.15.0 — Upgrade from v0.11.x to v0.15.0

This PR upgrades the vLLM serverless worker from vLLM v0.11.x to v0.15.0, spanning four major releases (v0.12.0, v0.13.0, v0.14.0, v0.15.0) with ~1,900+ upstream commits.


Worker-Specific Fixes

  • LoRA Serverless Fix: Deferred OpenAI serving engine initialization to the first request when LoRA adapters are configured. This prevents event loop mismatch in Runpod Serverless environments that caused EngineDeadError on the first request. Without LoRA, startup behavior is unchanged.
  • Updated all import paths for vLLM 0.15.0's reorganized vllm.entrypoints.openai.* module structure (chat_completion, completion, engine, models submodules).
  • Cleaned up hub.json: Removed 28 deprecated/removed env vars, updated CUDA version requirements (12.8 minimum), replaced DISABLE_LOG_REQUESTS with ENABLE_LOG_REQUESTS.
  • engine_args.py: Added compatibility shims for deprecated env vars (VLLM_ATTENTION_BACKENDATTENTION_BACKEND, DISABLE_LOG_REQUESTSENABLE_LOG_REQUESTS), removed use_v2_block_manager (V2 is now the only option), added new args (attention_backend, async_scheduling, stream_interval, cpu_offload_gb).
  • Fixed FlashInfer version: Pinned correct FlashInfer version that was broken/incompatible in the older worker build, resolving attention backend failures.

Major vLLM Changes (v0.12.0 → v0.15.0)

Performance

  • Async scheduling enabled by default — overlaps engine scheduling with GPU execution for improved throughput (v0.14.0)
  • Batch invariant BMM: 18.1% throughput improvement on DeepSeek-V3.1 (v0.12.0)
  • Shared experts overlap: 2.2% throughput improvement with FlashInfer DeepGEMM (v0.12.0)
  • MoE optimizations: grouped topk kernel fusion, NVFP4 improvements, faster cold start with torch.compile (v0.15.0)
  • FP4 kernel optimization: up to 65% faster FP4 quantization on Blackwell (v0.15.0)
  • Mamba prefix caching: ~2x speedup by caching Mamba states directly (v0.15.0)
  • Whisper: ~3x speedup vs v0.11.x with encoder batching and CUDA graph support (v0.13.0)

Model Support

  • New architectures: Kimi-K2.5, Molmo2, Grok-2, PLaMo-3, Step3vl, GLM-Lite, BAGEL, AudioFlamingo3, and many more
  • LoRA expansion: Nemotron-H, InternVL2, MiniMax M2, multimodal tower/connector LoRA, inplace loading for memory efficiency
  • Speculative decoding: EAGLE3, multi-step CUDA graph, DP>1 support, multimodal support
  • Embeddings: BGE-M3 sparse/ColBERT embeddings

Engine

  • Model Runner V2 (experimental): Refactored execution pipeline with GPU-persistent block tables, Triton-native sampler
  • Async scheduling + Pipeline Parallelism now work together (v0.15.0)
  • Session-based streaming input for interactive workloads like ASR (v0.15.0)
  • --max-model-len auto: Automatically fits context length to available GPU memory (v0.14.0)
  • Prefix caching: xxHash high-performance hash option (v0.13.0)
  • Conditional compilation via compile_ranges (v0.13.0)

Hardware

  • NVIDIA Blackwell: FlashInfer MLA default, SM103 (GB300) support, TRTLLM prefill default
  • TPU: Pipeline parallelism support
  • CPU: NUMA-aware acceleration, PyTorch 2.10 support, ARM optimizations

Quantization

  • New: W4A8 Marlin, MXFP4 W4A16, NVFP4 MoE CUTLASS, FP8 KV cache per-tensor/per-head
  • Removed: DeepSpeedFp8, RTN; HQQ deprecated

API & Frontend

  • Responses API with partial message generation and MCP support
  • gRPC server entrypoint (v0.14.0)
  • Reasoning parsers: DeepSeek R1, Qwen3, Granite, Hunyuan A13B
  • Tool calling: New parsers for FunctionGemma, GLM-4.7, DeepSeek-V3.2
  • Security: FIPS 140-3 compliant hashing, --ssl-ciphers, token leak prevention in crash logs

Breaking Changes

  • PyTorch 2.9.1+ required (CUDA 12.8+)
  • V2 block manager is now the only option (use_v2_block_manager removed)
  • Removed deprecated env vars: num_lookahead_slots, best_of, lora_extra_vocab, quantization_param_path, guided_decoding_backend, worker_use_ray, rope_scaling, rope_theta, tokenizer pool args, preemption args, and all individual speculative decoding args (now speculative_config JSON)
  • disable_log_requests deprecated in favor of enable_log_requests
  • Attention config: VLLM_ATTENTION_BACKEND env var replaced with --attention-backend CLI arg
  • Deprecated quantization schemes removed (DeepSpeedFp8, RTN)

"gpuCount": 1,
"allowedCudaVersions": ["12.9", "12.8", "12.7", "12.6", "12.5", "12.4"],
"allowedCudaVersions": ["12.9", "12.8"],
"presets": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for hub listings to use minimumCudaVersion? This is more sustainable than updating this list. For example, CUDA 13 is already an available target for allowedCudaVersions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does exist, but as we need to get this version out, i will test this in a different repo first and if it is actually working, i will also do this for vllm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of settings are removed from hub.json but not in this file, are the changes to this file accurate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not cleaned documentation files yet.

MadiatorLabs and others added 4 commits February 5, 2026 14:52
Co-authored-by: Dj Isaac <contact@dejaydev.com>
Co-authored-by: Dj Isaac <contact@dejaydev.com>
Co-authored-by: Dj Isaac <contact@dejaydev.com>
@velaraptor-runpod velaraptor-runpod self-requested a review February 12, 2026 16:34
Copy link
Contributor

@DeJayDev DeJayDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better, thanks!

@TimPietruskyRunPod TimPietruskyRunPod merged commit c45ac42 into runpod-workers:main Feb 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants