Skip to content

Releases: vllm-project/vllm

v0.16.0

25 Feb 19:58

Choose a tag to compare

vLLM v0.16.0

Please note that this release was branch cut on Feb 8, so any features added to vLLM after that date is not included.

Highlights

This release features 440 commits from 203 contributors (7 new)!

  • Async scheduling + Pipeline Parallelism is now fully supported, delivering 30.8% E2E throughput improvement and 31.8% TPOT improvement (#32618).
  • Realtime API: A new WebSocket-based Realtime API enables streaming audio interactions (#33187), building on the Voxtral realtime infrastructure.
  • RLHF workflow improvements: Native NCCL-based weight syncing API (#31943), layerwise weight reloading for QeRL (#32133), and engine pause/resume with request preservation (#32351).
  • Unified Parallel Drafting for speculative decoding (#32887), plus spec decode now works with structured outputs (#33374) and penalty application in Model Runner V2 (#33251).
  • Major XPU platform overhaul: Deprecated IPEX in favor of vllm-xpu-kernels (#33379), adding MoE (#33659), MXFP4 MoE (#33679), WNA16 (#33973), scaled_mm (#34117), and FP8 MoE (#34202) support.

Model Support

  • New architectures: GLM-OCR with MTP (#33005), Qwen3-ASR (#33312), DeepSeek-OCR-2 (#33165), Intern-S1-Pro (#33636), MiniCPM-o 4.5 (#33431), openPangu7B-VL (#32449), NemotronHPuzzle heterogeneous (#32549), MusicFlamingo (#32696), FunAudioChat (#2), ColBERT late interaction (#33686), voyage-4-nano (#33720), GLM-5 (#34124).
  • Speculative decoding: EAGLE3 for Hunyuan/HunyuanVL (#33035), AFMoE (#33111), Mistral3 (#33939).
  • LoRA expansion: Gemma3 vision components (#32764), Nemotron-H MTP models (#32265), Qwen3 output embedding (#29816). Optimized fused MoE-LoRA kernel indexing (#32770, #32774), unpermute-aware fused MoE LoRA path (#32655), reduced kernel overhead for fewer active LoRAs with multiple CUDA graphs (#32005).
  • Features: Qwen3-Omni transcription (#29828), Mistral Large 3 with FlashInfer MoE (#33174), LFM2 SigLIP2 intermediate encoder layers (#33370), Qwen3-Omni/GLM-4.xV MRoPE positioning fixes (#33010, #33039), embedding input for disabled modalities (#32493).
  • Performance: GLM-4.7-GPTQ decode and MTP acceptance rate regression fix (#33771), DeepSeek V3.2 fast detokenization (#33855), DeepSeek V3.2 tokenizer fix (#33832), GLM-5 MTP accuracy fix (#34385).

Engine Core

  • Async scheduling + Pipeline Parallelism: Full support with 30.8% throughput improvement (#32618), optimized spec decode + async scheduling with 1.5% throughput improvement (#33612), deadlock fix for torchrun PP broadcast (#33701).
  • Speculative decoding: Unified Parallel Drafting (#32887), structured output support (#33374), penalty application in MRV2 (#33251), skip softmax for all-greedy rejection sampling (#32852), correctness fix for spec tokens with prefill chunks (#33652).
  • RLHF: Native NCCL weight syncing API (#31943), layerwise reloading for QeRL (#32133), engine pause/resume with request preservation (#32351).
  • Helion kernel framework: ConfigManager (#32740), kernel wrapper (#32964), kernel registry (#33203).
  • PluggableLayer: Applied to linear layers (#33152) and Mamba layers (#33660).
  • Batch invariance: Disable Cascade Attention (#32561), enable Triton attention (#33688).
  • Performance: Grammar bitmask H2D copy on separate stream (#33059), zero-copy GQA for multimodal and CPU (#33732), early-reject oversized MM requests (#33502), CPU memory leak fix from Request reference cycle in prefix caching (#34183).

Hardware & Performance

  • NVIDIA: FlashInfer TRTLLM BF16 MoE integration (#32954), SM100 INT4 W4A16 kernel (#32437), SM121 (DGX Spark) CUTLASS support (#33517), MNNVL protocol for GB series (#33540), FlashInfer MLA concat optimization (#31171), GDN attention layout optimization (#33291), DeepGEMM FP8 MLA performance (#33568), wvSplitK_fp8 performance (#33527, #33493), B200 MoE configs for Nemotron Nano (#32804), Super B200 TP2 (#33510), GLM 4.6 (#32958), Mamba selective scan tuning for B200 (#32873). Fix: DeepSeek R1 CUTLASS MLA on B200 (#33637), QK Norm+RoPE fusion on B200+FP8 (#33967), CUTLASS FP8 blockwise on SM103a (#32224).
  • AMD ROCm: QWEN3-NEXT FP8 tunings (#32042), AITER attention backend for Qwen3-Next (#32492), fused_add_rmsnorm_pad for GPT-OSS (#30976), Qwen3-Omni startup fix (#33077).
  • Intel XPU: Platform overhaul - deprecated IPEX, switched to vllm-xpu-kernels (#33379). New: unquantized MoE (#33659), MXFP4 MoE (#33679), WNA16 kernel (#33973), scaled_mm kernel (#34117), FP8 MoE (#34202).
  • ARM CPU: KleidiAI INT4 dynamic quant with BF16 activations (#33122), NEON BFMMLA BF16 paged attention (#32263), vectorization backend optimization (#30329), attention dispatch by head_dim alignment (#32161).
  • IBM Z: BF16 kernel type for s390x (#33788).
  • torch.compile: Stop compiling identical artifacts (#34003), MoE cold start optimization option (#33735), fix 32-bit indexing assumption (#33113), attention fusion pass fix (#33945).
  • Performance: Chat completion streaming optimization (#33782), ORJSONResponse for faster API responses (#33548), MoE permute optimization for CUTLASS FP8 (#32892), shared/routed overlap for latent MoE on Nemotron-H (#32790), FlashInfer autotune control flag (#34006).

Large Scale Serving

  • Disaggregated serving: Mooncake connector rework with bootstrap server (#31034), cross-layer KV cache layout at NIXL Connector V2 (#33339), delay freeing blocks for aborted async loads (#32255), async double-free fix (#33377), Ray multi-replica single-instance fix (#33604).
  • EPLB: Capture logical experts with router replay (#33013), DP metadata fix for dense models (#32739).
  • Metrics: KV offloading connector metrics (#27942), labeled prompt token metrics for P/D disaggregation (#33290).

Quantization

  • New: FP8 block quant for CompressedTensorsW8A16Fp8 (#33280), ModelOpt MXFP8 for dense models (#33786), NVFP4/FP8 on Turing GPUs (#33076), TP > 4 for FP4 Gemm (#31099).
  • Bugfixes: FP8 online quantization memory fix (#31914), asymmetric W4A16 (ConchLinear) for CT (#33200), DeepSeek V3.2 NVFP4 (#33932), LoRA FP8 (#33879), quantized Falcon-H1 model loading (#32728), quantized Mamba TP with n_groups=1 (#33257), CPU W8A8 with bias (#33582), CPU W8A8 3D input support (#33727).
  • Deprecation: Removed BitBlas (#32683) and Marlin 24 (#32688).

API & Frontend

  • Realtime API: WebSocket-based streaming API (#33187) with Voxtral realtime support.
  • Responses API: Sampling parameters (#32609), return token IDs (#33212), return prompt token IDs (#33378), parser implementation (#32712).
  • Pooling API: Request schema consensus for ScoreRequest (#33060) and final standardization (#31127).
  • Tool calling: Fix multi-turn tool call ID preservation (#32768), fix indexing double-counting (#33141), GLM-4 incremental string streaming (#33218), DSV3.2 fast detokenization fix (#33964), MCP tools non-streaming fix (#32762).
  • Structured outputs: Performance optimization with reasoning (#33557), guidance vocab size fix (#33509).
  • CLI: --disable-access-log-for-endpoints option (#30011).
  • UX: Nested configs in YAML files (#33193), GGUF repo_id:quant_type syntax (#33371), DeepSeek ReasoningParser with thinking enabled by default (#33221), remove noisy CT warning (#33273), early tokenization validation (#31366), reasoning_content backward compatibility (#33635), only include Authorization header when OPENAI_API_KEY is set (#33488).
  • Features: run_batch transcription/translation support (#33934), /server_info collect_env (#33246), OTEL tracing during model loading (#31162), clear MM and encoder cache (#33452), HF Hub LoRA resolver (#20320).
  • Scoring: Fix multi-document scoring returning single result (#33837).

Security

Dependencies

  • huggingface-hub updates for Transformers v5 preparation (#33473).
  • Transformers v5 compatibility fixes across multiple models (#33977, #33683).

Deprecation & Breaking Changes

  • Removed BitBlas quantization (#32683) and Marlin 24 (#32688).
  • Removed deprecated reasoning_content message field (#33402).
  • Removed deprecated pooling items (#33477).
  • Removed deprecated VLLM_ALL2ALL_BACKEND environment variable (#33535).
  • Deprecated IPEX for XPU, switched to vllm-xpu-kernels (#33379).

New Contributors 🎉

v0.15.1

04 Feb 20:48

Choose a tag to compare

v0.15.1 is a patch release with security fixes, RTX Blackwell GPU fixes support, and bug fixes.

Security

Highlights

Bugfix Hardware Support

  • RTX Blackwell (SM120): Fixed NVFP4 MoE kernel support for RTX Blackwell workstation GPUs. Previously, NVFP4 MoE models would fail to load on these GPUs (#33417)
  • FP8 kernel selection: Fixed FP8 CUTLASS group GEMM to properly fall back to Triton kernels on SM120 GPUs (#33285)

Model Support

  • Step-3.5-Flash: New model support (#33523)

Bugfix Model Support

  • Qwen3-VL-Reranker: Fixed model loading (#33298)
  • Whisper: Fixed FlashAttention2 with full CUDA graphs (#33360)

Performance

  • torch.compile cold-start: Fixed regression that increased cold-start compilation time (Llama3-70B: ~88s → ~22s) (#33441)
  • MoE forward pass: Optimized by caching layer name computation (#33184)

Bug Fixes

  • Fixed prefix cache hit rate of 0% with GPT-OSS style hybrid attention models (#33524)
  • Enabled Triton MoE backend for FP8 per-tensor dynamic quantization (#33300)
  • Disabled unsupported Renormalize routing methods for TRTLLM per-tensor FP8 MoE (#33620)
  • Fixed speculative decoding metrics crash when no tokens generated (#33729)
  • Disabled fast MoE cold start optimization with speculative decoding (#33624)
  • Fixed ROCm skinny GEMM dispatch logic (#33366)

Dependencies

  • Pinned LMCache >= v0.3.9 for API compatibility (#33440)

New Contributors 🎉

Full Changelog: v0.15.0...v0.15.1

v0.15.0

29 Jan 10:21

Choose a tag to compare

Highlights

This release features 335 commits from 158 contributors (39 new)!

Model Support

  • New architectures: Kimi-K2.5 (#33131), Molmo2 (#30997), Step3vl 10B (#32329), Step1 (#32511), GLM-Lite (#31386), Eagle2.5-8B VLM (#32456).
  • LoRA expansion: Nemotron-H (#30802), InternVL2 (#32397), MiniMax M2 (#32763).
  • Speculative decoding: EAGLE3 for Pixtral/LlavaForConditionalGeneration (#32542), Qwen3 VL MoE (#32048), draft model support (#24322).
  • Embeddings: BGE-M3 sparse embeddings and ColBERT embeddings (#14526).
  • Model enhancements: Voxtral streaming architecture (#32861), SharedFusedMoE for Qwen3MoE (#32082), dynamic resolution for Nemotron Nano VL (#32121), Molmo2 vision backbone quantization (#32385).

Engine Core

  • Async scheduling + Pipeline Parallelism: --async-scheduling now works with pipeline parallelism (#32359).
  • Mamba prefix caching: Block-aligned prefix caching for Mamba/hybrid models with --enable-prefix-caching --mamba-cache-mode align. Achieves ~2x speedup by caching Mamba states directly (#30877).
  • Session-based streaming input: New incremental input support for interactive workloads like ASR. Accepts async generators producing StreamingInput objects while maintaining KV cache alignment (#28973).
  • Model Runner V2: VLM support (#32546), architecture improvements.
  • LoRA: Inplace loading for memory efficiency (#31326).
  • AOT compilation: torch.compile inductor artifacts support (#25205).
  • Performance: KV cache offloading redundant load prevention (#29087), FlashAttn attention/cache update separation (#25954).

Hardware & Performance

NVIDIA

  • Blackwell defaults: FlashInfer MLA is now the default MLA backend on Blackwell, with TRTLLM as default prefill (#32615).
  • MoE performance: 1.2-2% E2E throughput improvement via grouped topk kernel fusion (#32058), NVFP4 small-batch decoding improvement (#30885), faster cold start for MoEs with torch.compile (#32805).
  • FP4 kernel optimization: Up to 65% faster FP4 quantization on Blackwell (SM100F) using 256-bit loads, ~4% E2E throughput improvement (#32520).
  • Kernel improvements: topk_sigmoid kernel for MoE routing (#31246), atomics reduce counting for SplitK skinny GEMMs (#29843), fused cat+quant for FP8 KV cache in MLA (#32950).
  • torch.compile: SiluAndMul and QuantFP8 CustomOp compilation (#32806), Triton prefill attention performance (#32403).

AMD ROCm

  • MoRI EP: High-performance all2all backend for Expert Parallel (#28664).
  • Attention improvements: Shuffle KV cache layout and assembly paged attention kernel for AiterFlashAttentionBackend (#29887).
  • FP4 support: MLA projection GEMMs with dynamic quantization (#32238).
  • Consumer GPU support: Flash Attention Triton backend on RDNA3/RDNA4 (#32944).

Other Platforms

  • TPU: Pipeline parallelism support (#28506), backend option (#32438).
  • Intel XPU: AgRsAll2AllManager for distributed communication (#32654).
  • CPU: NUMA-aware acceleration for TP/DP inference on ARM (#32792), PyTorch 2.10 (#32869).
  • Whisper: torch.compile support (#30385).
  • WSL: Platform compatibility fix for Windows Subsystem for Linux (#32749).

Quantization

  • MXFP4: W4A16 support for compressed-tensors MoE models (#32285).
  • Non-gated MoE: Quantization support with Marlin, NVFP4 CUTLASS, FP8, INT8, and compressed-tensors (#32257).
  • Intel: Quantization Toolkit integration (#31716).
  • FP8 KV cache: Per-tensor and per-attention-head quantization via llmcompressor (#30141).

API & Frontend

  • Responses API: Partial message generation (#32100), include_stop_str_in_output tuning (#32383), prompt_cache_key support (#32824).
  • OpenAI API: skip_special_tokens configuration (#32345).
  • Score endpoint: Flexible input formats with data_1/data_2 and queries/documents (#32577).
  • Render endpoints: New endpoints for prompt preprocessing (#32473).
  • Whisper API: avg_logprob and compression_ratio in verbose_json segments (#31059).
  • Security: FIPS 140-3 compliant hash option for enterprise/government users (#32386), --ssl-ciphers CLI argument (#30937).
  • UX improvements: Auto api_server_count based on dp_size (#32525), wheel variant auto-detection during install (#32948), custom profiler URI schemes (#32393).

Dependencies

  • FlashInfer v0.6.1 (#30993)
  • Transformers 4.57.5 (#32287)
  • PyTorch 2.10 for CPU backend (#32869)
  • DeepGEMM newer version (#32479)

Breaking Changes & Deprecations

  • Metrics: Removed deprecated vllm:time_per_output_token_seconds metric - use vllm:inter_token_latency_seconds instead (#32661).
  • Environment variables: Removed deprecated environment variables (#32812).
  • Quantization: DeepSpeedFp8 removed (#32679), RTN removed (#32697), HQQ deprecated (#32681).

Bug Fixes

  • Speculative decoding: Eagle draft_model_config fix (#31753).
  • DeepSeek: DeepSeek-V3.1 + DeepGEMM incompatible scale shapes fix (#32361).
  • Distributed: DP+MoE inference fix via CpuCommunicator (#31867), P/D with non-MoE DP fix (#33037).
  • EPLB: Possible deadlock fix (#32418).
  • NIXL: UCX memory leak fix by exporting UCX_MEM_MMAP_HOOK_MODE=none (#32181).
  • Structured output: Outlines byte fallback handling fix (#31391).

New Contributors 🎉

Full Changelog: v0.14.1...v0.15.0

v0.14.1

24 Jan 20:29

Choose a tag to compare

This is a patch release on top of v0.14.0 to address a few security and memory leak fixes.

v0.14.0

20 Jan 09:20

Choose a tag to compare

Highlights

This release features approximately 660 commits from 251 contributors (86 new contributors).

Breaking Changes:

  • Async scheduling is now enabled by default - Users who experience issues can disable with --no-async-scheduling.
    • Excludes some not-yet-supported configurations: pipeline parallel, CPU backend, non-MTP/Eagle spec decoding.
  • PyTorch 2.9.1 is now required and the default wheel is compiled against cu129.
  • Deprecated quantization schemes have been removed (#31688, #31285).
  • When using speculative decoding, unsupported sampling parameters will fail rather than being silently ignored (#31982).

Key Improvements:

  • Async scheduling enabled by default (#27614): Overlaps engine core scheduling with GPU execution, improving throughput without user configuration. Now also works with speculative decoding (#31998) and structured outputs (#29821).
  • gRPC server entrypoint (#30190): Alternative to REST API with binary protocol, HTTP/2 multiplexing.
  • --max-model-len auto (#29431): Automatically fits context length to available GPU memory, eliminating OOM startup failures.
  • Model inspection view (#29450): View the modules, attention backends, and quantization of your model in vLLM by specifying VLLM_LOG_MODEL_INSPECTION=1 or by simply printing the LLM object.
  • Model Runner V2 enhancements: UVA block tables (#31965), M-RoPE (#32143), logit_bias/allowed_token_ids/min_tokens support (#32163).
    • Please note that Model Runner V2 is still experimental and disabled by default.

Model Support

New Model Architectures:

LoRA Support Expansion:

Model Enhancements:

  • Qwen3-VL as reranker (#31890)
  • DeepSeek v3.2 chat prefix completion (#31147)
  • GLM-4.5/GLM-4.7 enable_thinking: false (#31788)
  • Ernie4.5-VL video timestamps (#31274)
  • Score template expansion (#31335)
  • LLaMa4 vision encoder compilation (#30709)
  • NemotronH quantized attention (#31898)

Engine Core

  • Async scheduling default with spec decode (#27614, #31998) and structured outputs (#29821)
  • Hybrid allocator + KV connector (#30166) with multiple KV cache groups (#31707)
  • Triton attention: encoder-only/cross attention (#31406), cross-layer blocks (#30687)
  • Mamba2 prefix cache optimization (#28047)
  • Batch invariant LoRA (#30097)
  • LoRA name in BlockStored for KV-cache reconstruction (#27577)
  • Request ID collision prevention (#27987)
  • Dense model DP without overhead (#30739)
  • Async + spec decode penalties/bad_words (#30495)

Hardware & Performance

CUTLASS MoE Optimizations:

  • 2.9% throughput + 10.8% TTFT via fill(0) optimization (#31754)
  • 5.3% throughput + 2.2% TTFT via problem size calculation (#31830)
  • Fused SiLU+Mul+Quant for NVFP4 (#31832)
  • NVFP4 stride fusion (#31837)

Other Performance:

  • GDN attention decode speedup (Qwen3-Next) (#31722)
  • Fused RoPE + MLA KV-cache write (#25774)
  • Sliding window attention optimization (#31984)
  • FlashInfer DeepGEMM swapAB SM90 (#29213)
  • Unpermute-aware fused MoE + small-batch fallback (#29354)
  • GDN Attention blocking copy removal (#31167)
  • FusedMoE LoRA small rank performance (#32019)
  • EPLB numpy optimization (#29499)
  • FlashInfer rotary for DeepSeek (#30729)
  • Vectorized activations (#29512)
  • NUMA interleaved memory (#30800)
  • Async spec decode logprobs (#31336)

Hardware Configs:

  • SM103 support (#30705, #31150)
  • B300 Blackwell MoE configs (#30629)
  • Qwen3-Next FP8 CUTLASS configs (#29553)
  • Qwen3Moe B200 Triton configs (#31448)
  • GLM-4.5/4.6 RTX Pro 6000 kernels (#31407)
  • MiniMax-M2/M2.1 QKNorm (#31493)
  • NVFP4 small batch tuning (#30897)

Platform:

  • ROCm: AITER RMSNorm fusion (#26575), MTP for AITER MLA (#28624), moriio connector (#29304), xgrammar upstream (#31327)
  • XPU: FP8 streaming quant (#30944), custom workers (#30935)
  • CPU: Head sizes 80/112 (#31968), async disabled by default (#31525), LoRA MoE CPU pinning (#31317)
  • TPU: tpu-inference path (#30808), Sophgo docs (#30949)

Large Scale Serving

  • XBO (Extended Dual-Batch Overlap) (#30120)
  • NIXL asymmetric TP (P > D tensor-parallel-size) (#27274)
  • NIXL heterogeneous BlockSize/kv_layout (#30275)
  • Cross-layers KV layout for MultiConnector (#30761)
  • Mooncake protocol expansion (#30133)
  • LMCache KV cache registration (#31397)
  • EPLB default all2all backend (#30559)

Quantization

  • Marlin for Turing (sm75) (#29901, #31000)
  • Quark int4-fp8 w4a8 MoE (#30071)
  • MXFP4 W4A16 dense models (#31926)
  • ModelOpt FP8 variants (FP8_PER_CHANNEL_PER_TOKEN, FP8_PB_WO) (#30957)
  • ModelOpt KV cache quantization update (#31895)
  • NVFP4 Marlin for NVFP4A16 MoEs (#30881)
  • Static quant all group shapes (#30833)
  • Default MXFP4 LoRA backend: Marlin (#30598)
  • compressed-tensors 0.13.0 (#30799)

API & Frontend

New Features:

Tool Calling:

CLI:

  • -ep for --enable-expert-parallel (#30890)
  • Complete help messages (#31226)
  • Bench serve auto-discovery + --input-len (#30816)
  • Spec decode acceptance stats (#31739)
  • --enable-log-deltas (renamed) (#32020)
  • --default-chat-template-kwargs (#31343)

API:

  • /server_info env info (#31899)
  • MCP streaming in Responses API (#31761)
  • /embeddings continue_final_message (#31497)
  • Reranking score templates (#30550)
  • Chat template warmup (#30700)
  • Configurable handshake timeout (#27444)
  • Better 500 errors (#20610)
  • Worker init logging (#29493)
  • Bench error reporting (#31808)
  • Corrupted video recovery (#29197)
  • Spec-decode param validation (#31982)
  • Validation error metadata (#30134)

Security

  • Prevent token leaks in crash logs (#30751)
  • weights_only=True in torch.load (#32045)

Dependencies

  • PyTorch 2.9.1 (#28495)
  • compressed-tensors 0.13.0 (#30799)
  • CUDA 13 LMCache/NIXL in Docker (#30913)
  • Configurable NVSHMEM version (#30732)

Bug Fixes (User-Facing)

  • Invalid UTF-8 tokens (#28874)
  • CPU RoPE gibberish with --enforce-eager (#31643)
  • Tool call streaming finish chunk (#31438)
  • Encoder cache leak CPU scheduling stuck (#31857)
  • Engine crash: tools + response_format (#32127)
  • Voxtral transcription API (#31388)
  • Safetensors download optimization (#30537)

Deprecations

  • Deprecated quantization schemes removed (#31688, #31285)
  • seed_everything deprecated (#31659)

Documentation

  • vllm-metal plugin docs (#31174)
  • Claude Code example (#31188)
  • CustomOp developer guide (#30886)

New Contributors 🎉

Read more

v0.13.0

19 Dec 03:02

Choose a tag to compare

vLLM v0.13.0 Release Notes Highlights

Highlights

This release features 442 commits from 207 contributors (61 new contributors)!

Breaking Changes: This release includes deprecation removals, PassConfig flag renames, and attention configuration changes from environment variables to CLI arguments. Please review the breaking changes section carefully before upgrading.

Model Support

  • New models: BAGEL (AR only) (#28439), AudioFlamingo3 (#30539), JAIS 2 (#30188), latent MoE architecture support (#30203).
  • Tool parsers: DeepSeek-V3.2 (#29848), Gigachat 3 (#29905), Holo2 reasoning (#30048).
  • Model enhancements: Qwen3-VL embeddings support (#30037), Qwen3-VL EVS (Efficient Video Sampling) (#29752), DeepSeek V3.2 proper drop_thinking logic (#30490), DeepSeek V3.2 top-k fix (#27568).
  • Task expansion: Automatic TokenClassification model conversion (#30666), Ultravox v0.7 transformer projector (#30089).
  • Quantization: BitsAndBytes for Qwen3-Omni-MoE (#29896).
  • Speculative decoding: Eagle/Eagle3 Transformers backend (#30340), Mamba selective_state_update spec decode (#29488).

Engine Core

  • Compilation: Conditional compilation via compile_ranges for selective kernel compilation (#24252).
  • Prefix caching: xxHash high-performance hash option (#29163).
  • Attention: PrefixLM support for FlexAttention (#27938) and TritonAttention (#30386), CUDA graphs for 3D Triton attention (#28306), TRITON_MLA without prefix-caching (#29125).
  • Batch invariance: FA2 and LoRA batch-invariant support (#30018).
  • Pooling: Chunked prefill for ALL pooling tasks (#27145), multi-vector retrieval API (#26686).
  • Model Runner V2: Min-p sampling (#30171), NaN detection in logits (#30187).
  • Speculative decoding: Medusa GPU-CPU sync avoidance (#29723), async spec-decode improvements (#29624).
  • Whisper: Major performance improvements - V1 is now faster than V0 (~3x speedup vs v0.12.0). Encoder batching (#29421), FULL_DECODE_ONLY CUDA graph (#30072), CPU backend support (#30062).
  • Performance: Fused blockwise quant RMS norm (#27883), MoE LoRA loading reduction (#30243), encoder cache optimization (#30475), CPU KV offloading streams (#29013).

Hardware & Performance

  • NVIDIA Blackwell Ultra: SM103 (GB300) support with CUDA 13 (#30484).
  • DeepSeek optimizations (benchmarked on DeepSeek-V3.1):
    • DeepEP High-Throughput CUDA graph enabled by default: 5.3% throughput, 4.4% TTFT improvement (#29558)
    • DeepGEMM fused layout kernel: 4.3% throughput, 10.7% TTFT improvement (#29546)
    • DeepGEMM experts initialization: 3.9% TTFT improvement (#30494)
    • group_topk kernel: 1.9% throughput, 2.1% TPOT improvement (#30159)
    • Sparse prefill kernel for FP8 KV-cache in DeepSeek-V3.2 (#27532)
    • MLA FP8 optimization with ReduceScatterSum (#29795), direct k_nope/k_pe copy (#29710)
  • CPU: Whisper support (#30062), Arm Optimized Routines vectorized exp (#30068), x86 CPU wheel pipeline (#28848).
  • AMD ROCm: Aiter quantization kernels (#25552), torch.compile layernorm/silu + FP8 quant (#25693), Triton ScaledMM fallback (#26668), MXFP4 w4a4 inference (#29775).
  • Intel XPU: wNa16 compressed tensors (#29484).
  • Build: CUDA 13 aarch64 wheels (#30341), Docker kernel build stage (#29452), Ascend NPU Docker (#30015).

Large Scale Serving & Disaggregated Prefill/Decode

  • KV connectors: Mooncake Transfer Engine (#24718), cache reset via /reset_prefix_cache (#27170), KV events (#28309), failure recovery config (#26813).
  • NIXL: Compatibility checking in handshake (#29503), large batch proxy support (#28782).
  • EPLB: NVFP4 support (#29804), algorithm abstraction (#26471).
  • Multi-node: External launcher mode (#29833).
  • Hybrid allocator: Optional KV connector integration (#29805).
  • Performance: silu_mul_per_token_group_quant_fp8 kernel for DP/EP (#29470).

Quantization

  • New: W4A8 grouped GEMM on Hopper (#29691), online FP8 with streaming post-processing (#29196), FP8 weight reloading for RLHF (#28480).
  • MoE + LoRA: AWQ Marlin (#30442) and GPTQ Marlin (#30254) support.
  • GGUF: MoE + GGUF restored for Qwen3 MoE (#30116), Qwen2 MoE (#30307), HF defaults override (#30118).
  • Compatibility: Transformers v5 RoPE support (#30046).

API & Frontend

  • Responses API: MCP type infrastructure (#30054), Browser/Container MCP tools (#29989), full MCP Python loop (#29798), extra body parameters (#30532).
  • Configuration: AttentionConfig replaces VLLM_ATTENTION_BACKEND env var (#26315).
  • Chat templates: DeepSeek-V3.2 (#29837), DeepSeek-V3.2 developer tools (#30040).
  • Anthropic API: Streaming fixes (#29971, #30266).
  • Embeddings: Binary format with encoding_format=bytes_only (#30249), multiple image/audio per request (#29988), tokenization_kwargs override (#29794).
  • Metrics: Prefill KV compute metric excluding cached tokens (#30189).
  • Profiling: Layer-wise NVTX (#29990), profiling CLI config (#29912).
  • UX: Better OOM errors (#28051), ModelConfig validation (#30213), distributed executor errors (#30140).

Security

Dependencies

  • NVSHMEM 3.3.24 + CUDA 13 fix (#30149).
  • TPU tpu-inference 0.12.0 (#30221).

Breaking Changes & Deprecations

  1. PassConfig flags renamed per RFC #27995 (#29646)
  2. Attention env vars → CLI args: VLLM_ATTENTION_BACKEND replaced with --attention-backend (#26315)
  3. Removed -O.xx flag (#29991)
  4. Removed deprecated plugin/compilation fields (#30396)
  5. Removed deprecated task, seed, MM settings (#30397)
  6. Removed embed_input_ids/embed_multimodal fallbacks (#30458)
  7. Removed tokenizer setter (#30400)
  8. Deprecations: merge_by_field_config (#30035, #30170), --convert reward--convert embed (#30463)

New Contributors 🎉

Read more

v0.12.0

03 Dec 09:36

Choose a tag to compare

vLLM v0.12.0 Release Notes Highlights

Highlights

This release features 474 commits from 213 contributors (57 new)!

Breaking Changes: This release includes PyTorch 2.9.0 upgrade (CUDA 12.9), V0 deprecations including xformers backend, and scheduled removals - please review the changelog carefully.

Major Features:

  • EAGLE Speculative Decoding Improvements: Multi-step CUDA graph support (#29559), DP>1 support (#26086), and multimodal support with Qwen3VL (#29594).
  • Significant Performance Optimizations: 18.1% throughput improvement from batch invariant BMM (#29345), 2.2% throughput improvement from shared experts overlap (#28879).
  • AMD ROCm Expansion: DeepSeek v3.2 + SparseMLA support (#26670), FP8 MLA decode (#28032), AITER attention backend (#28701).

Model Support

  • New model families: PLaMo-3 (#28834), OpenCUA-7B (#29068), HunyuanOCR (#29327), Mistral Large 3 and Ministral 3 (#29757).
  • Format support: Gemma3 GGUF multimodal support (#27772).
  • Multimodal enhancements: Qwen3 Omni audio-in-video support (#27721), Eagle3 multimodal support for Qwen3VL (#29594).
  • Performance: QwenVL cos/sin cache optimization (#28798).

Engine Core

  • GPU Model Runner V2 (Experimental) (#25266): Complete refactoring of model execution pipeline:

    • No "reordering" or complex bookkeeping with persistent batch removal
    • GPU-persistent block tables for better scalability with max_model_len and num_kv_groups
    • Triton-native sampler: no -1 temperature hack, efficient per-request seeds, memory-efficient prompt logprobs
    • Simplified DP and CUDA graph implementations
    • Efficient structured outputs support
  • Prefill Context Parallel (PCP) (Preparatory) (#28718): Partitions the sequence dimension during prefill for improved long-sequence inference. Complements existing Decode Context Parallel (DCP). See RFC #25749 for details.

  • RLHF Support: Pause and Resume Generation for Asynchronous RL Training (#28037).

  • KV Cache Enhancements: Cross-layer KV blocks support (#27743), KV cache residency metrics (#27793).

  • Audio support: Audio embeddings support in chat completions (#29059).

  • Speculative Decoding:

    • Multi-step Eagle with CUDA graph (#29559)
    • EAGLE DP>1 support (#26086)
    • EAGLE3 heads without use_aux_hidden_states (#27688)
    • Eagle multimodal CUDA graphs with MRoPE (#28896)
    • Logprobs support with spec decode + async scheduling (#29223)
  • Configuration: Flexible inputs_embeds_size separate from hidden_size (#29741), --fully-sharded-loras for fused_moe (#28761).

Hardware & Performance

  • NVIDIA Performance:

    • Batch invariant BMM optimization: 18.1% throughput improvement, 10.7% TTFT improvement on DeepSeek-V3.1 (#29345)
    • Shared Experts Overlap with FlashInfer DeepGEMM: 2.2% throughput improvement, 3.6% TTFT improvement at batch size 32 (#28879)
    • DeepGEMM N dim restriction reduced from 128 to 64 multiplier (#28687)
    • DeepEP low-latency with round-robin expert placement (#28449)
    • NVFP4 MoE CUTLASS support for SM120 (#29242)
    • H200 Fused MoE Config improvements (#28992)
  • AMD ROCm:

    • DeepSeek v3.2 and SparseMLA support (#26670)
    • FP8 MLA decode support (#28032)
    • AITER sampling ops integration (#26084)
    • AITER triton attention backend (#28701)
    • Bitsandbytes quantization on AMD GPUs with warp size 32 (#27307)
    • Fastsafetensors support (#28225)
    • Sliding window support for AiterFlashAttentionBackend (#29234)
    • Whisper v1 with Aiter Unified/Flash Attention (#28376)
  • CPU:

    • Paged attention GEMM acceleration on ARM CPUs with NEON (#29193)
    • Parallelize over tokens in int4 MoE (#29600)
    • CPU all reduce optimization for async_scheduling + DP>1 (#29311)
  • Attention: FlashAttention ViT support, now default backend (#28763).

  • Long Context: Optimized gather_and_maybe_dequant_cache kernel for extremely long sequences (#28029).

  • Multi-NUMA: Enhanced NUMA functionality for systems with multiple NUMA nodes per socket (#25559).

  • Docker: Image size reduced by ~200MB (#29060).

Quantization

  • W4A8: Marlin kernel support (#24722).
  • NVFP4:
    • MoE CUTLASS support for SM120 (#29242)
    • TRTLLM MoE NVFP4 kernel (#28892)
    • CuteDSL MoE with NVFP4 DeepEP dispatch (#27141)
    • Non-gated activations support in modelopt path (#29004)
  • AWQ: Compressed-tensors AWQ support for Turing GPUs (#29732).
  • LoRA: FusedMoE LoRA Triton kernel for MXFP4 (#29708).
  • Online quantization: Moved to model.load_weights (#26327).

API & Frontend

  • Responses API:
    • Multi-turn support for non-harmony requests (#29175)
    • Reasoning item input parsing (#28248)
  • Tool Calling:
    • Parsed tool arguments support (#28820)
    • parallel_tool_calls param compliance (#26233)
    • Tool filtering support in ToolServer (#29224)
  • Whisper: verbose_json and timestamp features for transcription/translation (#24209).
  • Sampling: Flat logprob control moved from env var to SamplingParams (#28914).
  • GGUF: Improved HuggingFace loading UX with repo_id:quant_type syntax (#29137).
  • Profiling: Iteration-level profiling for Torch and CUDA profiler (#28987).
  • Logs: Colorized log output (#29017).
  • Optimization Levels: -O0, -O1, -O2, -O3 allow trading startup time for performance, more compilation flags will be added in future releases (#26847)

Dependencies

  • PyTorch 2.9.0 with CUDA 12.9 (#24994) - Breaking change requiring environment updates.
  • xgrammar: Updated to 0.1.27 (#28221).
  • Transformers: Updated to 4.57.3 (#29418), preparation for v5 with rope_parameters (#28542).
  • XPU: torch & IPEX 2.9 upgrade (#29307).

V0 Deprecation & Breaking Changes

Removed Parameters:

Deprecated:

Scheduled Removals (will be removed in future release):

  • ParallelConfig's direct child EPLB fields (#29324)
  • guided_* config fields (#29326)
  • override_pooler_config and disable_log_requests (#29402)
  • CompilationConfig.use_inductor (#29323)
  • Deprecated metrics (#29330)

Other Breaking Changes:

  • PyTorch 2.9.0 upgrade requires CUDA 12.9 environment
  • Mistral format auto-detection for model loading (#28659)

New Contributors

Read more

v0.11.2

20 Nov 07:29

Choose a tag to compare

This release includes 4 bug fixes on top of v0.11.1:

  • [BugFix] Ray with multiple nodes (#28873)
  • [BugFix] Fix false assertion with spec-decode=[2,4,..] and TP>2 (#29036)
  • [BugFix] Fix async-scheduling + FlashAttn MLA (#28990)
  • [NVIDIA] Guard SM100 CUTLASS MoE macro to SM100 builds v2 (#28938)

v0.11.1

18 Nov 23:03
4393684

Choose a tag to compare

Highlights

This release includes 1456 commits from 449 contributors (184 new contributors)!

Key changes include:

  • PyTorch 2.9.0 + CUDA 12.9.1: Updated the default CUDA build to torch==2.9.0+cu129, enabling Inductor partitioning and landing multiple fixes in graph-partition rules and compile-cache integration.
  • Batch-invariant torch.compile: Generalized batch-invariant support across attention and MoE backends, with explicit support for DeepGEMM and FlashInfer on Hopper and Blackwell GPUs.
  • Robust async scheduling: Fixed several correctness and stability issues in async scheduling, especially when combined with chunked prefill, structured outputs, priority scheduling, MTP, and DeepEP / DCP. We expect --async-scheduling to be enabled by default in the next release.
  • Stronger scheduler + KV ecosystem: Improved test coverage in CI and made scheduler behavior more robust with KV connectors, prefix caching, and multi-node deployments.
  • Anthropic API Support: Added support for the /v1/messages endpoint, allowing users to interact with vllm serve using Anthropic-compatible clients.

Detailed release notes will be updated in the next few days.

What's Changed

Read more

v0.11.0

02 Oct 19:17

Choose a tag to compare

Highlights

This release features 538 commits, 207 contributors (65 new contributors)!

  • This release completes the removal of V0 engine. V0 engine code including AsyncLLMEngine, LLMEngine, MQLLMEngine, all attention backends, and related components have been removed. V1 is the only engine in the codebase now.
  • This releases turns on FULL_AND_PIECEWISE as the CUDA graph mode default. This should provide better out of the box performance for most models, particularly fine-grained MoEs, while preserving compatibility with existing models supporting only PIECEWISE mode.

Note: In v0.11.0 (and v0.10.2), --async-scheduling will produce gibberish output in some cases such as preemption and others. This functionality is correct in v0.10.1. We are actively fixing it for the next version.

Model Support

  • New architectures: DeepSeek-V3.2-Exp (#25896), Qwen3-VL series (#24727), Qwen3-Next (#24526), OLMo3 (#24534), LongCat-Flash (#23991), Dots OCR (#24645), Ling2.0 (#24627), CWM (#25611).
  • Encoders: RADIO encoder support (#24595), Transformers backend support for encoder-only models (#25174).
  • Task expansion: BERT token classification/NER (#24872), multimodal models for pooling tasks (#24451).
  • Data parallel for vision encoders: InternVL (#23909), Qwen2-VL (#25445), Qwen3-VL (#24955).
  • Speculative decoding: EAGLE3 for MiniCPM3 (#24243) and GPT-OSS (#25246).
  • Features: Qwen3-VL text-only mode (#26000), EVS video token pruning (#22980), Mamba2 TP+quantization (#24593), MRoPE + YaRN (#25384), Whisper on XPU (#25123), LongCat-Flash-Chat tool calling (#24083).
  • Performance: GLM-4.1V 916ms TTFT reduction via fused RMSNorm (#24733), GLM-4 MoE SharedFusedMoE optimization (#24849), Qwen2.5-VL CUDA sync removal (#24741), Qwen3-VL Triton MRoPE kernel (#25055), FP8 checkpoints for Qwen3-Next (#25079).
  • Reasoning: SeedOSS reason parser (#24263).

Engine Core

  • KV cache offloading: CPU offloading with LRU management (#19848, #20075, #21448, #22595, #24251).
  • V1 features: Prompt embeddings (#24278), sharded state loading (#25308), FlexAttention sliding window (#24089), LLM.apply_model (#18465).
  • Hybrid allocator: Pipeline parallel (#23974), varying hidden sizes (#25101).
  • Async scheduling: Uniprocessor executor support (#24219).
  • Architecture: Tokenizer group removal (#24078), shared memory multimodal caching (#20452).
  • Attention: Hybrid SSM/Attention in Triton (#21197), FlashAttention 3 for ViT (#24347).
  • Performance: FlashInfer RoPE 2x speedup (#21126), fused Q/K RoPE 11% improvement (#24511, #25005), 8x spec decode overhead reduction (#24986), FlashInfer spec decode with 1.14x speedup (#25196), model info caching (#23558), inputs_embeds copy avoidance (#25739).
  • LoRA: Optimized weight loading (#25403).
  • Defaults: CUDA graph mode FULL_AND_PIECEWISE (#25444), Inductor standalone compile disabled (#25391).
  • torch.compile: CUDA graph Inductor partition integration (#24281).

Hardware & Performance

  • NVIDIA: FP8 FlashInfer MLA decode (#24705), BF16 fused MoE for Hopper/Blackwell expert parallel (#25503).
  • DeepGEMM: Enabled by default (#24462), 5.5% throughput improvement (#24783).
  • New architectures: RISC-V 64-bit (#22112), ARM non-x86 CPU (#25166), ARM 4-bit fused MoE (#23809).
  • AMD: ROCm 7.0 (#25178), GLM-4.5 MI300X tuning (#25703).
  • Intel XPU: MoE DP accuracy fix (#25465).

Large Scale Serving & Performance

  • Dual-Batch Overlap (DBO): Overlapping computation mechanism (#23693), DeepEP high throughput + prefill (#24845).
  • Data Parallelism: torchrun launcher (#24899), Ray placement groups (#25026), Triton DP/EP kernels (#24588).
  • EPLB: Hunyuan V1 (#23078), Mixtral (#22842), static placement (#23745), reduced overhead (#24573).
  • Disaggregated serving: KV transfer metrics (#22188), NIXL MLA latent dimension (#25902).
  • MoE: Shared expert overlap optimization (#24254), SiLU kernel for DeepSeek-R1 (#24054), Enable Allgather/ReduceScatter backend for NaiveAllToAll (#23964).
  • Distributed: NCCL symmetric memory with 3-4% throughput improvement (#24532), enabled by default for TP (#25070).

Quantization

  • FP8: Per-token-group quantization (#24342), hardware-accelerated instructions (#24757), torch.compile KV cache (#22758), paged attention update (#22222).
  • FP4: NVFP4 for dense models (#25609), Gemma3 (#22771), Llama 3.1 405B (#25135).
  • W4A8: Faster preprocessing (#23972).
  • Compressed tensors: Blocked FP8 for MoE (#25219).

API & Frontend

  • OpenAI: Prompt logprobs for all tokens (#24956), logprobs=-1 for full vocab (#25031), reasoning streaming events (#24938), Responses API MCP tools (#24628, #24985), health 503 on dead engine (#24897).
  • Multimodal: Media UUID caching (#23950), image path format (#25081).
  • Tool calling: XML parser for Qwen3-Coder (#25028), Hermes-style tokens (#25281).
  • CLI: --enable-logging (#25610), improved --help (#24903).
  • Config: Speculative model engine args (#25250), env validation (#24761), NVTX profiling (#25501), guided decoding backward compatibility (#25615, #25422).
  • Metrics: V1 TPOT histogram (#24015), hidden deprecated gpu_ metrics (#24245), KV cache GiB units (#25204, #25479).
  • UX: Removed misleading quantization warning (#25012).

Security

Dependencies

  • PyTorch 2.8 for CPU (#25652), FlashInfer 0.3.1 (#24470), CUDA 13 (#24599), ROCm 7.0 (#25178).
  • Build requirements: C++17 now enforced globally (#24823).
  • TPU: Deprecated xm.mark_step in favor of torch_xla.sync (#25254).

V0 Deprecation

What's Changed

Read more