Releases: vllm-project/vllm
v0.16.0
vLLM v0.16.0
Please note that this release was branch cut on Feb 8, so any features added to vLLM after that date is not included.
Highlights
This release features 440 commits from 203 contributors (7 new)!
- Async scheduling + Pipeline Parallelism is now fully supported, delivering 30.8% E2E throughput improvement and 31.8% TPOT improvement (#32618).
- Realtime API: A new WebSocket-based Realtime API enables streaming audio interactions (#33187), building on the Voxtral realtime infrastructure.
- RLHF workflow improvements: Native NCCL-based weight syncing API (#31943), layerwise weight reloading for QeRL (#32133), and engine pause/resume with request preservation (#32351).
- Unified Parallel Drafting for speculative decoding (#32887), plus spec decode now works with structured outputs (#33374) and penalty application in Model Runner V2 (#33251).
- Major XPU platform overhaul: Deprecated IPEX in favor of vllm-xpu-kernels (#33379), adding MoE (#33659), MXFP4 MoE (#33679), WNA16 (#33973), scaled_mm (#34117), and FP8 MoE (#34202) support.
Model Support
- New architectures: GLM-OCR with MTP (#33005), Qwen3-ASR (#33312), DeepSeek-OCR-2 (#33165), Intern-S1-Pro (#33636), MiniCPM-o 4.5 (#33431), openPangu7B-VL (#32449), NemotronHPuzzle heterogeneous (#32549), MusicFlamingo (#32696), FunAudioChat (#2), ColBERT late interaction (#33686), voyage-4-nano (#33720), GLM-5 (#34124).
- Speculative decoding: EAGLE3 for Hunyuan/HunyuanVL (#33035), AFMoE (#33111), Mistral3 (#33939).
- LoRA expansion: Gemma3 vision components (#32764), Nemotron-H MTP models (#32265), Qwen3 output embedding (#29816). Optimized fused MoE-LoRA kernel indexing (#32770, #32774), unpermute-aware fused MoE LoRA path (#32655), reduced kernel overhead for fewer active LoRAs with multiple CUDA graphs (#32005).
- Features: Qwen3-Omni transcription (#29828), Mistral Large 3 with FlashInfer MoE (#33174), LFM2 SigLIP2 intermediate encoder layers (#33370), Qwen3-Omni/GLM-4.xV MRoPE positioning fixes (#33010, #33039), embedding input for disabled modalities (#32493).
- Performance: GLM-4.7-GPTQ decode and MTP acceptance rate regression fix (#33771), DeepSeek V3.2 fast detokenization (#33855), DeepSeek V3.2 tokenizer fix (#33832), GLM-5 MTP accuracy fix (#34385).
Engine Core
- Async scheduling + Pipeline Parallelism: Full support with 30.8% throughput improvement (#32618), optimized spec decode + async scheduling with 1.5% throughput improvement (#33612), deadlock fix for torchrun PP broadcast (#33701).
- Speculative decoding: Unified Parallel Drafting (#32887), structured output support (#33374), penalty application in MRV2 (#33251), skip softmax for all-greedy rejection sampling (#32852), correctness fix for spec tokens with prefill chunks (#33652).
- RLHF: Native NCCL weight syncing API (#31943), layerwise reloading for QeRL (#32133), engine pause/resume with request preservation (#32351).
- Helion kernel framework: ConfigManager (#32740), kernel wrapper (#32964), kernel registry (#33203).
- PluggableLayer: Applied to linear layers (#33152) and Mamba layers (#33660).
- Batch invariance: Disable Cascade Attention (#32561), enable Triton attention (#33688).
- Performance: Grammar bitmask H2D copy on separate stream (#33059), zero-copy GQA for multimodal and CPU (#33732), early-reject oversized MM requests (#33502), CPU memory leak fix from Request reference cycle in prefix caching (#34183).
Hardware & Performance
- NVIDIA: FlashInfer TRTLLM BF16 MoE integration (#32954), SM100 INT4 W4A16 kernel (#32437), SM121 (DGX Spark) CUTLASS support (#33517), MNNVL protocol for GB series (#33540), FlashInfer MLA concat optimization (#31171), GDN attention layout optimization (#33291), DeepGEMM FP8 MLA performance (#33568), wvSplitK_fp8 performance (#33527, #33493), B200 MoE configs for Nemotron Nano (#32804), Super B200 TP2 (#33510), GLM 4.6 (#32958), Mamba selective scan tuning for B200 (#32873). Fix: DeepSeek R1 CUTLASS MLA on B200 (#33637), QK Norm+RoPE fusion on B200+FP8 (#33967), CUTLASS FP8 blockwise on SM103a (#32224).
- AMD ROCm: QWEN3-NEXT FP8 tunings (#32042), AITER attention backend for Qwen3-Next (#32492), fused_add_rmsnorm_pad for GPT-OSS (#30976), Qwen3-Omni startup fix (#33077).
- Intel XPU: Platform overhaul - deprecated IPEX, switched to vllm-xpu-kernels (#33379). New: unquantized MoE (#33659), MXFP4 MoE (#33679), WNA16 kernel (#33973), scaled_mm kernel (#34117), FP8 MoE (#34202).
- ARM CPU: KleidiAI INT4 dynamic quant with BF16 activations (#33122), NEON BFMMLA BF16 paged attention (#32263), vectorization backend optimization (#30329), attention dispatch by head_dim alignment (#32161).
- IBM Z: BF16 kernel type for s390x (#33788).
- torch.compile: Stop compiling identical artifacts (#34003), MoE cold start optimization option (#33735), fix 32-bit indexing assumption (#33113), attention fusion pass fix (#33945).
- Performance: Chat completion streaming optimization (#33782), ORJSONResponse for faster API responses (#33548), MoE permute optimization for CUTLASS FP8 (#32892), shared/routed overlap for latent MoE on Nemotron-H (#32790), FlashInfer autotune control flag (#34006).
Large Scale Serving
- Disaggregated serving: Mooncake connector rework with bootstrap server (#31034), cross-layer KV cache layout at NIXL Connector V2 (#33339), delay freeing blocks for aborted async loads (#32255), async double-free fix (#33377), Ray multi-replica single-instance fix (#33604).
- EPLB: Capture logical experts with router replay (#33013), DP metadata fix for dense models (#32739).
- Metrics: KV offloading connector metrics (#27942), labeled prompt token metrics for P/D disaggregation (#33290).
Quantization
- New: FP8 block quant for CompressedTensorsW8A16Fp8 (#33280), ModelOpt MXFP8 for dense models (#33786), NVFP4/FP8 on Turing GPUs (#33076), TP > 4 for FP4 Gemm (#31099).
- Bugfixes: FP8 online quantization memory fix (#31914), asymmetric W4A16 (ConchLinear) for CT (#33200), DeepSeek V3.2 NVFP4 (#33932), LoRA FP8 (#33879), quantized Falcon-H1 model loading (#32728), quantized Mamba TP with n_groups=1 (#33257), CPU W8A8 with bias (#33582), CPU W8A8 3D input support (#33727).
- Deprecation: Removed BitBlas (#32683) and Marlin 24 (#32688).
API & Frontend
- Realtime API: WebSocket-based streaming API (#33187) with Voxtral realtime support.
- Responses API: Sampling parameters (#32609), return token IDs (#33212), return prompt token IDs (#33378), parser implementation (#32712).
- Pooling API: Request schema consensus for ScoreRequest (#33060) and final standardization (#31127).
- Tool calling: Fix multi-turn tool call ID preservation (#32768), fix indexing double-counting (#33141), GLM-4 incremental string streaming (#33218), DSV3.2 fast detokenization fix (#33964), MCP tools non-streaming fix (#32762).
- Structured outputs: Performance optimization with reasoning (#33557), guidance vocab size fix (#33509).
- CLI:
--disable-access-log-for-endpointsoption (#30011). - UX: Nested configs in YAML files (#33193), GGUF
repo_id:quant_typesyntax (#33371), DeepSeek ReasoningParser with thinking enabled by default (#33221), remove noisy CT warning (#33273), early tokenization validation (#31366), reasoning_content backward compatibility (#33635), only include Authorization header when OPENAI_API_KEY is set (#33488). - Features: run_batch transcription/translation support (#33934), /server_info collect_env (#33246), OTEL tracing during model loading (#31162), clear MM and encoder cache (#33452), HF Hub LoRA resolver (#20320).
- Scoring: Fix multi-document scoring returning single result (#33837).
Security
- Patch protobuf for CVE-2026-0994 (#34253).
Dependencies
- huggingface-hub updates for Transformers v5 preparation (#33473).
- Transformers v5 compatibility fixes across multiple models (#33977, #33683).
Deprecation & Breaking Changes
- Removed BitBlas quantization (#32683) and Marlin 24 (#32688).
- Removed deprecated
reasoning_contentmessage field (#33402). - Removed deprecated pooling items (#33477).
- Removed deprecated
VLLM_ALL2ALL_BACKENDenvironment variable (#33535). - Deprecated IPEX for XPU, switched to vllm-xpu-kernels (#33379).
New Contributors 🎉
- @aabbccddwasd made their first contribution in #33771
- @Code4me2 made their first contribution in #33517
- @ikchifo made their first contribution in #33967
- @jiangwu300 made their first contribution in #33604
- @pjs102793 made their first contribution in #33963
- @sleepcoo made their first contribution in #33978
- @TundeAtSN made their first contribution in #33939
v0.15.1
v0.15.1 is a patch release with security fixes, RTX Blackwell GPU fixes support, and bug fixes.
Security
- CVE-2025-69223: Updated aiohttp dependency (#33621)
- CVE-2026-0994: Updated Protobuf dependency (#33619)
Highlights
Bugfix Hardware Support
- RTX Blackwell (SM120): Fixed NVFP4 MoE kernel support for RTX Blackwell workstation GPUs. Previously, NVFP4 MoE models would fail to load on these GPUs (#33417)
- FP8 kernel selection: Fixed FP8 CUTLASS group GEMM to properly fall back to Triton kernels on SM120 GPUs (#33285)
Model Support
- Step-3.5-Flash: New model support (#33523)
Bugfix Model Support
- Qwen3-VL-Reranker: Fixed model loading (#33298)
- Whisper: Fixed FlashAttention2 with full CUDA graphs (#33360)
Performance
- torch.compile cold-start: Fixed regression that increased cold-start compilation time (Llama3-70B: ~88s → ~22s) (#33441)
- MoE forward pass: Optimized by caching layer name computation (#33184)
Bug Fixes
- Fixed prefix cache hit rate of 0% with GPT-OSS style hybrid attention models (#33524)
- Enabled Triton MoE backend for FP8 per-tensor dynamic quantization (#33300)
- Disabled unsupported Renormalize routing methods for TRTLLM per-tensor FP8 MoE (#33620)
- Fixed speculative decoding metrics crash when no tokens generated (#33729)
- Disabled fast MoE cold start optimization with speculative decoding (#33624)
- Fixed ROCm skinny GEMM dispatch logic (#33366)
Dependencies
- Pinned LMCache >= v0.3.9 for API compatibility (#33440)
New Contributors 🎉
- @zaristei2 made their first contribution in #33621
Full Changelog: v0.15.0...v0.15.1
v0.15.0
Highlights
This release features 335 commits from 158 contributors (39 new)!
Model Support
- New architectures: Kimi-K2.5 (#33131), Molmo2 (#30997), Step3vl 10B (#32329), Step1 (#32511), GLM-Lite (#31386), Eagle2.5-8B VLM (#32456).
- LoRA expansion: Nemotron-H (#30802), InternVL2 (#32397), MiniMax M2 (#32763).
- Speculative decoding: EAGLE3 for Pixtral/LlavaForConditionalGeneration (#32542), Qwen3 VL MoE (#32048), draft model support (#24322).
- Embeddings: BGE-M3 sparse embeddings and ColBERT embeddings (#14526).
- Model enhancements: Voxtral streaming architecture (#32861), SharedFusedMoE for Qwen3MoE (#32082), dynamic resolution for Nemotron Nano VL (#32121), Molmo2 vision backbone quantization (#32385).
Engine Core
- Async scheduling + Pipeline Parallelism:
--async-schedulingnow works with pipeline parallelism (#32359). - Mamba prefix caching: Block-aligned prefix caching for Mamba/hybrid models with
--enable-prefix-caching --mamba-cache-mode align. Achieves ~2x speedup by caching Mamba states directly (#30877). - Session-based streaming input: New incremental input support for interactive workloads like ASR. Accepts async generators producing
StreamingInputobjects while maintaining KV cache alignment (#28973). - Model Runner V2: VLM support (#32546), architecture improvements.
- LoRA: Inplace loading for memory efficiency (#31326).
- AOT compilation: torch.compile inductor artifacts support (#25205).
- Performance: KV cache offloading redundant load prevention (#29087), FlashAttn attention/cache update separation (#25954).
Hardware & Performance
NVIDIA
- Blackwell defaults: FlashInfer MLA is now the default MLA backend on Blackwell, with TRTLLM as default prefill (#32615).
- MoE performance: 1.2-2% E2E throughput improvement via grouped topk kernel fusion (#32058), NVFP4 small-batch decoding improvement (#30885), faster cold start for MoEs with torch.compile (#32805).
- FP4 kernel optimization: Up to 65% faster FP4 quantization on Blackwell (SM100F) using 256-bit loads, ~4% E2E throughput improvement (#32520).
- Kernel improvements: topk_sigmoid kernel for MoE routing (#31246), atomics reduce counting for SplitK skinny GEMMs (#29843), fused cat+quant for FP8 KV cache in MLA (#32950).
- torch.compile: SiluAndMul and QuantFP8 CustomOp compilation (#32806), Triton prefill attention performance (#32403).
AMD ROCm
- MoRI EP: High-performance all2all backend for Expert Parallel (#28664).
- Attention improvements: Shuffle KV cache layout and assembly paged attention kernel for AiterFlashAttentionBackend (#29887).
- FP4 support: MLA projection GEMMs with dynamic quantization (#32238).
- Consumer GPU support: Flash Attention Triton backend on RDNA3/RDNA4 (#32944).
Other Platforms
- TPU: Pipeline parallelism support (#28506), backend option (#32438).
- Intel XPU: AgRsAll2AllManager for distributed communication (#32654).
- CPU: NUMA-aware acceleration for TP/DP inference on ARM (#32792), PyTorch 2.10 (#32869).
- Whisper: torch.compile support (#30385).
- WSL: Platform compatibility fix for Windows Subsystem for Linux (#32749).
Quantization
- MXFP4: W4A16 support for compressed-tensors MoE models (#32285).
- Non-gated MoE: Quantization support with Marlin, NVFP4 CUTLASS, FP8, INT8, and compressed-tensors (#32257).
- Intel: Quantization Toolkit integration (#31716).
- FP8 KV cache: Per-tensor and per-attention-head quantization via llmcompressor (#30141).
API & Frontend
- Responses API: Partial message generation (#32100),
include_stop_str_in_outputtuning (#32383),prompt_cache_keysupport (#32824). - OpenAI API:
skip_special_tokensconfiguration (#32345). - Score endpoint: Flexible input formats with
data_1/data_2andqueries/documents(#32577). - Render endpoints: New endpoints for prompt preprocessing (#32473).
- Whisper API:
avg_logprobandcompression_ratioin verbose_json segments (#31059). - Security: FIPS 140-3 compliant hash option for enterprise/government users (#32386),
--ssl-ciphersCLI argument (#30937). - UX improvements: Auto
api_server_countbased ondp_size(#32525), wheel variant auto-detection during install (#32948), custom profiler URI schemes (#32393).
Dependencies
- FlashInfer v0.6.1 (#30993)
- Transformers 4.57.5 (#32287)
- PyTorch 2.10 for CPU backend (#32869)
- DeepGEMM newer version (#32479)
Breaking Changes & Deprecations
- Metrics: Removed deprecated
vllm:time_per_output_token_secondsmetric - usevllm:inter_token_latency_secondsinstead (#32661). - Environment variables: Removed deprecated environment variables (#32812).
- Quantization: DeepSpeedFp8 removed (#32679), RTN removed (#32697), HQQ deprecated (#32681).
Bug Fixes
- Speculative decoding: Eagle draft_model_config fix (#31753).
- DeepSeek: DeepSeek-V3.1 + DeepGEMM incompatible scale shapes fix (#32361).
- Distributed: DP+MoE inference fix via CpuCommunicator (#31867), P/D with non-MoE DP fix (#33037).
- EPLB: Possible deadlock fix (#32418).
- NIXL: UCX memory leak fix by exporting UCX_MEM_MMAP_HOOK_MODE=none (#32181).
- Structured output: Outlines byte fallback handling fix (#31391).
New Contributors 🎉
- @YunzhuLu made their first contribution in #32126
- @emricksini-h made their first contribution in #30784
- @dsfaccini made their first contribution in #32289
- @ofirzaf made their first contribution in #32312
- @seekskyworld made their first contribution in #32321
- @brian033 made their first contribution in #31715
- @TomerBN-Nvidia made their first contribution in #32257
- @vanshilshah97 made their first contribution in #32448
- @George-Polya made their first contribution in #32385
- @T1mn made their first contribution in #32411
- @mritunjaysharma394 made their first contribution in #31492
- @randzero made their first contribution in #32511
- @DemingCheng made their first contribution in #32556
- @iboiko-habana made their first contribution in #32471
- @honglyua-il made their first contribution in #32462
- @hyeongyun0916 made their first contribution in #32473
- @DanielMe made their first contribution in #32560
- @netanel-haber made their first contribution in #32121
- @longregen made their first contribution in #28784
- @jasonyanwenl made their first contribution in #32749
- @Wauplin made their first contribution in #32788
- @ikaadil made their first contribution in #32775
- @alexsun07 made their first contribution in #28664
- @liranschour made their first contribution in #30207
- @AuYang261 made their first contribution in #32844
- @diviramon made their first contribution in #32393
- @RishabhSaini made their first contribution in #32884
- @MatteoFari made their first contribution in #32397
- @peakcrosser7 made their first contribution in #30877
- @orionr made their first contribution in #30443
- @marksverdhei made their first contribution in #32614
- @joninco made their first contribution in #32935
- @monajafi-amd made their first contribution in #32944
- @ruizcrp made their first contribution in #32988
- @sjhddh made their first contribution in #32983
- @HirokenOvo made their first contribution in #32646
- @Chenhao-Guan made their first contribution in #32763
- @joshuadeng made their first contribution in #28973
- @ZhanqiuHu made their first contribution in #33016
Full Changelog: v0.14.1...v0.15.0
v0.14.1
v0.14.0
Highlights
This release features approximately 660 commits from 251 contributors (86 new contributors).
Breaking Changes:
- Async scheduling is now enabled by default - Users who experience issues can disable with
--no-async-scheduling.- Excludes some not-yet-supported configurations: pipeline parallel, CPU backend, non-MTP/Eagle spec decoding.
- PyTorch 2.9.1 is now required and the default wheel is compiled against cu129.
- Deprecated quantization schemes have been removed (#31688, #31285).
- When using speculative decoding, unsupported sampling parameters will fail rather than being silently ignored (#31982).
Key Improvements:
- Async scheduling enabled by default (#27614): Overlaps engine core scheduling with GPU execution, improving throughput without user configuration. Now also works with speculative decoding (#31998) and structured outputs (#29821).
- gRPC server entrypoint (#30190): Alternative to REST API with binary protocol, HTTP/2 multiplexing.
--max-model-len auto(#29431): Automatically fits context length to available GPU memory, eliminating OOM startup failures.- Model inspection view (#29450): View the modules, attention backends, and quantization of your model in vLLM by specifying
VLLM_LOG_MODEL_INSPECTION=1or by simply printing theLLMobject. - Model Runner V2 enhancements: UVA block tables (#31965), M-RoPE (#32143),
logit_bias/allowed_token_ids/min_tokenssupport (#32163).- Please note that Model Runner V2 is still experimental and disabled by default.
Model Support
New Model Architectures:
- Grok-2 with tiktoken tokenizer (#31847)
- LFM2-VL vision-language model (#31758)
- MiMo-V2-Flash (#30836)
- openPangu MoE (#28775)
- IQuestCoder (#31575)
- Nemotron Parse 1.1 (#30864)
- GLM-ASR audio (#31436)
- Isaac vision model v0.1/v0.2 (#28367, #31550)
- Kanana-1.5-v-3b-instruct (#29384)
- K-EXAONE-236B-A23B MoE (#31621)
LoRA Support Expansion:
- Multimodal tower/connector LoRA (#26674): LLaVA (#31513), BLIP2 (#31620), PaliGemma (#31656), Pixtral (#31724), DotsOCR (#31825), GLM4-V (#31652)
- DeepSeek-OCR (#31569), Qwen3-Next (#31719), NemotronH (#31539), PLaMo 2/3 (#31322)
- Vision LoRA mm_processor_cache support (#31927)
- MoE expert base_layer loading (#31104)
Model Enhancements:
- Qwen3-VL as reranker (#31890)
- DeepSeek v3.2 chat prefix completion (#31147)
- GLM-4.5/GLM-4.7
enable_thinking: false(#31788) - Ernie4.5-VL video timestamps (#31274)
- Score template expansion (#31335)
- LLaMa4 vision encoder compilation (#30709)
- NemotronH quantized attention (#31898)
Engine Core
- Async scheduling default with spec decode (#27614, #31998) and structured outputs (#29821)
- Hybrid allocator + KV connector (#30166) with multiple KV cache groups (#31707)
- Triton attention: encoder-only/cross attention (#31406), cross-layer blocks (#30687)
- Mamba2 prefix cache optimization (#28047)
- Batch invariant LoRA (#30097)
- LoRA name in BlockStored for KV-cache reconstruction (#27577)
- Request ID collision prevention (#27987)
- Dense model DP without overhead (#30739)
- Async + spec decode penalties/bad_words (#30495)
Hardware & Performance
CUTLASS MoE Optimizations:
- 2.9% throughput + 10.8% TTFT via fill(0) optimization (#31754)
- 5.3% throughput + 2.2% TTFT via problem size calculation (#31830)
- Fused SiLU+Mul+Quant for NVFP4 (#31832)
- NVFP4 stride fusion (#31837)
Other Performance:
- GDN attention decode speedup (Qwen3-Next) (#31722)
- Fused RoPE + MLA KV-cache write (#25774)
- Sliding window attention optimization (#31984)
- FlashInfer DeepGEMM swapAB SM90 (#29213)
- Unpermute-aware fused MoE + small-batch fallback (#29354)
- GDN Attention blocking copy removal (#31167)
- FusedMoE LoRA small rank performance (#32019)
- EPLB numpy optimization (#29499)
- FlashInfer rotary for DeepSeek (#30729)
- Vectorized activations (#29512)
- NUMA interleaved memory (#30800)
- Async spec decode logprobs (#31336)
Hardware Configs:
- SM103 support (#30705, #31150)
- B300 Blackwell MoE configs (#30629)
- Qwen3-Next FP8 CUTLASS configs (#29553)
- Qwen3Moe B200 Triton configs (#31448)
- GLM-4.5/4.6 RTX Pro 6000 kernels (#31407)
- MiniMax-M2/M2.1 QKNorm (#31493)
- NVFP4 small batch tuning (#30897)
Platform:
- ROCm: AITER RMSNorm fusion (#26575), MTP for AITER MLA (#28624), moriio connector (#29304), xgrammar upstream (#31327)
- XPU: FP8 streaming quant (#30944), custom workers (#30935)
- CPU: Head sizes 80/112 (#31968), async disabled by default (#31525), LoRA MoE CPU pinning (#31317)
- TPU: tpu-inference path (#30808), Sophgo docs (#30949)
Large Scale Serving
- XBO (Extended Dual-Batch Overlap) (#30120)
- NIXL asymmetric TP (P > D tensor-parallel-size) (#27274)
- NIXL heterogeneous BlockSize/kv_layout (#30275)
- Cross-layers KV layout for MultiConnector (#30761)
- Mooncake protocol expansion (#30133)
- LMCache KV cache registration (#31397)
- EPLB default all2all backend (#30559)
Quantization
- Marlin for Turing (sm75) (#29901, #31000)
- Quark int4-fp8 w4a8 MoE (#30071)
- MXFP4 W4A16 dense models (#31926)
- ModelOpt FP8 variants (FP8_PER_CHANNEL_PER_TOKEN, FP8_PB_WO) (#30957)
- ModelOpt KV cache quantization update (#31895)
- NVFP4 Marlin for NVFP4A16 MoEs (#30881)
- Static quant all group shapes (#30833)
- Default MXFP4 LoRA backend: Marlin (#30598)
- compressed-tensors 0.13.0 (#30799)
API & Frontend
New Features:
- gRPC server (#30190)
--max-model-len auto(#29431)- Model inspection view (#29450)
- Offline FastAPI docs (#30184)
attention_configin LLM() (#30710)- MFU metrics (#30738)
- Iteration logging + NVTX (#31193)
reasoning_effortparameter (#31956)
Tool Calling:
CLI:
-epfor--enable-expert-parallel(#30890)- Complete help messages (#31226)
- Bench serve auto-discovery +
--input-len(#30816) - Spec decode acceptance stats (#31739)
--enable-log-deltas(renamed) (#32020)--default-chat-template-kwargs(#31343)
API:
/server_infoenv info (#31899)- MCP streaming in Responses API (#31761)
/embeddingscontinue_final_message(#31497)- Reranking score templates (#30550)
- Chat template warmup (#30700)
- Configurable handshake timeout (#27444)
- Better 500 errors (#20610)
- Worker init logging (#29493)
- Bench error reporting (#31808)
- Corrupted video recovery (#29197)
- Spec-decode param validation (#31982)
- Validation error metadata (#30134)
Security
Dependencies
- PyTorch 2.9.1 (#28495)
- compressed-tensors 0.13.0 (#30799)
- CUDA 13 LMCache/NIXL in Docker (#30913)
- Configurable NVSHMEM version (#30732)
Bug Fixes (User-Facing)
- Invalid UTF-8 tokens (#28874)
- CPU RoPE gibberish with
--enforce-eager(#31643) - Tool call streaming finish chunk (#31438)
- Encoder cache leak CPU scheduling stuck (#31857)
- Engine crash: tools + response_format (#32127)
- Voxtral transcription API (#31388)
- Safetensors download optimization (#30537)
Deprecations
Documentation
New Contributors 🎉
- @penfree made their first contribution in #30237
- @jiangkuaixue123 made their first contribution in #30120
- @jr-shen made their first contribution in #29663
- @grzegorz-k-karch made their first contribution in #30795
- @shanjiaz made their first contribution in #30799
- @Somoku made their first contribution in #29569
- @baoqian426 made their first contribution in #30841
- @SongDI911 made their first contribution in #30852
- @www-spam made their first contribution in #30827
- @Xunzhuo made their first contribution in #30844
- @TheCodeWrangler made their first contribution in #30700
- @SungMinCho made their first contribution in #30738
- @sarathc-cerebras made their first contribution in #30188
- @wzyrrr made their first contribution in #30949
- @navmarri14 made their first contribution in #30629
- @HaloWorld made their first contribution in #30867
- @jeffreywang-anyscale made their first contribution in #31013
- @AmeenP made their first contribution in #31093
- @westers made their first contribution in #31071
- @CedricHwong made their first contribution in #30957
- @c0de128 made their first contribution in #31114
- @Bounty-hunter made their first contribution in #30242
- @jzakrzew made their first contribution in #30550
- @1643661061leo made their first contribution in #30760
- @NickCao made their first contribution in https:/...
v0.13.0
vLLM v0.13.0 Release Notes Highlights
Highlights
This release features 442 commits from 207 contributors (61 new contributors)!
Breaking Changes: This release includes deprecation removals, PassConfig flag renames, and attention configuration changes from environment variables to CLI arguments. Please review the breaking changes section carefully before upgrading.
Model Support
- New models: BAGEL (AR only) (#28439), AudioFlamingo3 (#30539), JAIS 2 (#30188), latent MoE architecture support (#30203).
- Tool parsers: DeepSeek-V3.2 (#29848), Gigachat 3 (#29905), Holo2 reasoning (#30048).
- Model enhancements: Qwen3-VL embeddings support (#30037), Qwen3-VL EVS (Efficient Video Sampling) (#29752), DeepSeek V3.2 proper
drop_thinkinglogic (#30490), DeepSeek V3.2 top-k fix (#27568). - Task expansion: Automatic TokenClassification model conversion (#30666), Ultravox v0.7 transformer projector (#30089).
- Quantization: BitsAndBytes for Qwen3-Omni-MoE (#29896).
- Speculative decoding: Eagle/Eagle3 Transformers backend (#30340), Mamba
selective_state_updatespec decode (#29488).
Engine Core
- Compilation: Conditional compilation via
compile_rangesfor selective kernel compilation (#24252). - Prefix caching: xxHash high-performance hash option (#29163).
- Attention: PrefixLM support for FlexAttention (#27938) and TritonAttention (#30386), CUDA graphs for 3D Triton attention (#28306),
TRITON_MLAwithout prefix-caching (#29125). - Batch invariance: FA2 and LoRA batch-invariant support (#30018).
- Pooling: Chunked prefill for ALL pooling tasks (#27145), multi-vector retrieval API (#26686).
- Model Runner V2: Min-p sampling (#30171), NaN detection in logits (#30187).
- Speculative decoding: Medusa GPU-CPU sync avoidance (#29723), async spec-decode improvements (#29624).
- Whisper: Major performance improvements - V1 is now faster than V0 (~3x speedup vs v0.12.0). Encoder batching (#29421),
FULL_DECODE_ONLYCUDA graph (#30072), CPU backend support (#30062). - Performance: Fused blockwise quant RMS norm (#27883), MoE LoRA loading reduction (#30243), encoder cache optimization (#30475), CPU KV offloading streams (#29013).
Hardware & Performance
- NVIDIA Blackwell Ultra: SM103 (GB300) support with CUDA 13 (#30484).
- DeepSeek optimizations (benchmarked on DeepSeek-V3.1):
- DeepEP High-Throughput CUDA graph enabled by default: 5.3% throughput, 4.4% TTFT improvement (#29558)
- DeepGEMM fused layout kernel: 4.3% throughput, 10.7% TTFT improvement (#29546)
- DeepGEMM experts initialization: 3.9% TTFT improvement (#30494)
group_topkkernel: 1.9% throughput, 2.1% TPOT improvement (#30159)- Sparse prefill kernel for FP8 KV-cache in DeepSeek-V3.2 (#27532)
- MLA FP8 optimization with ReduceScatterSum (#29795), direct k_nope/k_pe copy (#29710)
- CPU: Whisper support (#30062), Arm Optimized Routines vectorized exp (#30068), x86 CPU wheel pipeline (#28848).
- AMD ROCm: Aiter quantization kernels (#25552), torch.compile layernorm/silu + FP8 quant (#25693), Triton ScaledMM fallback (#26668), MXFP4 w4a4 inference (#29775).
- Intel XPU: wNa16 compressed tensors (#29484).
- Build: CUDA 13 aarch64 wheels (#30341), Docker kernel build stage (#29452), Ascend NPU Docker (#30015).
Large Scale Serving & Disaggregated Prefill/Decode
- KV connectors: Mooncake Transfer Engine (#24718), cache reset via
/reset_prefix_cache(#27170), KV events (#28309), failure recovery config (#26813). - NIXL: Compatibility checking in handshake (#29503), large batch proxy support (#28782).
- EPLB: NVFP4 support (#29804), algorithm abstraction (#26471).
- Multi-node: External launcher mode (#29833).
- Hybrid allocator: Optional KV connector integration (#29805).
- Performance: silu_mul_per_token_group_quant_fp8 kernel for DP/EP (#29470).
Quantization
- New: W4A8 grouped GEMM on Hopper (#29691), online FP8 with streaming post-processing (#29196), FP8 weight reloading for RLHF (#28480).
- MoE + LoRA: AWQ Marlin (#30442) and GPTQ Marlin (#30254) support.
- GGUF: MoE + GGUF restored for Qwen3 MoE (#30116), Qwen2 MoE (#30307), HF defaults override (#30118).
- Compatibility: Transformers v5 RoPE support (#30046).
API & Frontend
- Responses API: MCP type infrastructure (#30054), Browser/Container MCP tools (#29989), full MCP Python loop (#29798), extra body parameters (#30532).
- Configuration:
AttentionConfigreplacesVLLM_ATTENTION_BACKENDenv var (#26315). - Chat templates: DeepSeek-V3.2 (#29837), DeepSeek-V3.2 developer tools (#30040).
- Anthropic API: Streaming fixes (#29971, #30266).
- Embeddings: Binary format with
encoding_format=bytes_only(#30249), multiple image/audio per request (#29988), tokenization_kwargs override (#29794). - Metrics: Prefill KV compute metric excluding cached tokens (#30189).
- Profiling: Layer-wise NVTX (#29990), profiling CLI config (#29912).
- UX: Better OOM errors (#28051), ModelConfig validation (#30213), distributed executor errors (#30140).
Security
- Additional protection for CVE-2025-62164 (#30649).
Dependencies
Breaking Changes & Deprecations
- PassConfig flags renamed per RFC #27995 (#29646)
- Attention env vars → CLI args:
VLLM_ATTENTION_BACKENDreplaced with--attention-backend(#26315) - Removed
-O.xxflag (#29991) - Removed deprecated plugin/compilation fields (#30396)
- Removed deprecated task, seed, MM settings (#30397)
- Removed
embed_input_ids/embed_multimodalfallbacks (#30458) - Removed tokenizer setter (#30400)
- Deprecations:
merge_by_field_config(#30035, #30170),--convert reward→--convert embed(#30463)
New Contributors 🎉
- @ajpqs made their first contribution in #29905
- @amitz-nv made their first contribution in #29978
- @amrmahdi made their first contribution in #29452
- @andrewbriand made their first contribution in #29804
- @anker-c2 made their first contribution in #30344
- @AuruTus made their first contribution in #30182
- @avigny made their first contribution in #19425
- @Bhanu068 made their first contribution in #30254
- @Copilot made their first contribution in #29025
- @dbotwinick made their first contribution in #30583
- @dependabot[bot] made their first contribution in #30234
- @desertfire made their first contribution in #29919
- @dmitry-tokarev-nv made their first contribution in #30149
- @drslark made their first contribution in #30632
- @dtcccc made their first contribution in #24718
- @elizabetht made their first contribution in #28671
- @Elm8116 made their first contribution in #30068
- @gausah01 made their first contribution in #29604
- @gh-wf made their first contribution in #30285
- @hdlj-h made their first contribution in #30056
- @HF-001 made their first contribution in #30051
- @hzxuzhonghu made their first contribution in #29931
- @JaviS-Rei made their first contribution in #29882
- @johannesflommersfeld made their first contribution in #30390
- @KevinMusgrave made their first contribution in #30529
- @kitaekatt made their first contribution in #30408
- @lashahub made their first contribution in #30539
- @LuminolT made their first contribution in #29163
- @majiayu000 made their first contribution in #30615
- @MaoJianwei made their first contribution in #29797
- @Mercykid-bash made their first contribution in #26471
- @mgehre-amd made their first contribution in #30364
- @mivehk made their first contribution in #30512
- @mondaylord made their first contribution in #30671
- @noa-neria made their first contribution in #29320
- @PatrykSaffer made their first contribution in #30330
- @Peng-YM made their first contribution in #29074
- @realliujiaxu made their first contribution in #30059
- @redwrasse made their first contribution in #29261
- @Ri0S made their first contribution in #30532
- @sarathc-cerebras made their first contribution in #30188
- @scr...
v0.12.0
vLLM v0.12.0 Release Notes Highlights
Highlights
This release features 474 commits from 213 contributors (57 new)!
Breaking Changes: This release includes PyTorch 2.9.0 upgrade (CUDA 12.9), V0 deprecations including xformers backend, and scheduled removals - please review the changelog carefully.
Major Features:
- EAGLE Speculative Decoding Improvements: Multi-step CUDA graph support (#29559), DP>1 support (#26086), and multimodal support with Qwen3VL (#29594).
- Significant Performance Optimizations: 18.1% throughput improvement from batch invariant BMM (#29345), 2.2% throughput improvement from shared experts overlap (#28879).
- AMD ROCm Expansion: DeepSeek v3.2 + SparseMLA support (#26670), FP8 MLA decode (#28032), AITER attention backend (#28701).
Model Support
- New model families: PLaMo-3 (#28834), OpenCUA-7B (#29068), HunyuanOCR (#29327), Mistral Large 3 and Ministral 3 (#29757).
- Format support: Gemma3 GGUF multimodal support (#27772).
- Multimodal enhancements: Qwen3 Omni audio-in-video support (#27721), Eagle3 multimodal support for Qwen3VL (#29594).
- Performance: QwenVL cos/sin cache optimization (#28798).
Engine Core
-
GPU Model Runner V2 (Experimental) (#25266): Complete refactoring of model execution pipeline:
- No "reordering" or complex bookkeeping with persistent batch removal
- GPU-persistent block tables for better scalability with
max_model_lenandnum_kv_groups - Triton-native sampler: no -1 temperature hack, efficient per-request seeds, memory-efficient prompt logprobs
- Simplified DP and CUDA graph implementations
- Efficient structured outputs support
-
Prefill Context Parallel (PCP) (Preparatory) (#28718): Partitions the sequence dimension during prefill for improved long-sequence inference. Complements existing Decode Context Parallel (DCP). See RFC #25749 for details.
-
RLHF Support: Pause and Resume Generation for Asynchronous RL Training (#28037).
-
KV Cache Enhancements: Cross-layer KV blocks support (#27743), KV cache residency metrics (#27793).
-
Audio support: Audio embeddings support in chat completions (#29059).
-
Speculative Decoding:
-
Configuration: Flexible
inputs_embeds_sizeseparate fromhidden_size(#29741),--fully-sharded-lorasfor fused_moe (#28761).
Hardware & Performance
-
NVIDIA Performance:
- Batch invariant BMM optimization: 18.1% throughput improvement, 10.7% TTFT improvement on DeepSeek-V3.1 (#29345)
- Shared Experts Overlap with FlashInfer DeepGEMM: 2.2% throughput improvement, 3.6% TTFT improvement at batch size 32 (#28879)
- DeepGEMM N dim restriction reduced from 128 to 64 multiplier (#28687)
- DeepEP low-latency with round-robin expert placement (#28449)
- NVFP4 MoE CUTLASS support for SM120 (#29242)
- H200 Fused MoE Config improvements (#28992)
-
AMD ROCm:
- DeepSeek v3.2 and SparseMLA support (#26670)
- FP8 MLA decode support (#28032)
- AITER sampling ops integration (#26084)
- AITER triton attention backend (#28701)
- Bitsandbytes quantization on AMD GPUs with warp size 32 (#27307)
- Fastsafetensors support (#28225)
- Sliding window support for AiterFlashAttentionBackend (#29234)
- Whisper v1 with Aiter Unified/Flash Attention (#28376)
-
CPU:
-
Attention: FlashAttention ViT support, now default backend (#28763).
-
Long Context: Optimized
gather_and_maybe_dequant_cachekernel for extremely long sequences (#28029). -
Multi-NUMA: Enhanced NUMA functionality for systems with multiple NUMA nodes per socket (#25559).
-
Docker: Image size reduced by ~200MB (#29060).
Quantization
- W4A8: Marlin kernel support (#24722).
- NVFP4:
- AWQ: Compressed-tensors AWQ support for Turing GPUs (#29732).
- LoRA: FusedMoE LoRA Triton kernel for MXFP4 (#29708).
- Online quantization: Moved to
model.load_weights(#26327).
API & Frontend
- Responses API:
- Tool Calling:
- Whisper:
verbose_jsonandtimestampfeatures for transcription/translation (#24209). - Sampling: Flat logprob control moved from env var to
SamplingParams(#28914). - GGUF: Improved HuggingFace loading UX with
repo_id:quant_typesyntax (#29137). - Profiling: Iteration-level profiling for Torch and CUDA profiler (#28987).
- Logs: Colorized log output (#29017).
- Optimization Levels:
-O0,-O1,-O2,-O3allow trading startup time for performance, more compilation flags will be added in future releases (#26847)
Dependencies
- PyTorch 2.9.0 with CUDA 12.9 (#24994) - Breaking change requiring environment updates.
- xgrammar: Updated to 0.1.27 (#28221).
- Transformers: Updated to 4.57.3 (#29418), preparation for v5 with
rope_parameters(#28542). - XPU: torch & IPEX 2.9 upgrade (#29307).
V0 Deprecation & Breaking Changes
Removed Parameters:
Deprecated:
Scheduled Removals (will be removed in future release):
ParallelConfig's direct child EPLB fields (#29324)guided_*config fields (#29326)override_pooler_configanddisable_log_requests(#29402)CompilationConfig.use_inductor(#29323)- Deprecated metrics (#29330)
Other Breaking Changes:
- PyTorch 2.9.0 upgrade requires CUDA 12.9 environment
- Mistral format auto-detection for model loading (#28659)
New Contributors
- @jesse996 made their first contribution in #28846
- @Nepherpitou made their first contribution in #28960
- @Samoed made their first contribution in #27329
- @j20120307 made their first contribution in #28999
- @vnadathur made their first contribution in #26468
- @zhyajie made their first contribution in #28942
- @IzzyPutterman made their first contribution in #28896
- @rjrock-amd made their first contribution in #28905
- @zq1997 made their first contribution in #27715
- @shengliangxu made their first contribution in #28076
- @prashanth058 made their first contribution in #28972
- @qgallouedec made their first contribution in #28820
- @zhanggzh made their first contribution in #19347
- @pandalee99 made their first contribution in #26628
- @dsuhinin made their first contribution in #29100
- @xli made their first contribution in #29124
- @jeremyteboul made their first contribution in #29059
- @soodoshll made their first contribution in #28875
- @bhagyashrigai made their first contribution in #28957
- @skaraban3807 made their first contribution in #25559
- @Victor49152 made their first contribution in #28892
- @rjrock made their first contribution in #29205
- @FlintyLemming made their first contribution in #29182
- @madskildegaard made their first contribution in #29175
- @nandan2003 made their first contribution in #29189
- @michaelact made their first contribution in #29173
- @yongming-qin made their first contribution in #28958
- @joshiemoore made their first contribution in #29249
- @lim4349 made their first contribution in #29068
- @apinge made their first contribution in #28376
- @gbyu-amd made their first contribution in #28032
- @kflu made their first contribution in #29364
- @Inokinoki made their first contribution in #29200
- @GOavi101 made their first contribution in #29313
- @sts07142 made their first contribution in #29137
- @ivanium made their first contribution in #29143
- @geodavic...
v0.11.2
v0.11.1
Highlights
This release includes 1456 commits from 449 contributors (184 new contributors)!
Key changes include:
- PyTorch 2.9.0 + CUDA 12.9.1: Updated the default CUDA build to
torch==2.9.0+cu129, enabling Inductor partitioning and landing multiple fixes in graph-partition rules and compile-cache integration. - Batch-invariant
torch.compile: Generalized batch-invariant support across attention and MoE backends, with explicit support for DeepGEMM and FlashInfer on Hopper and Blackwell GPUs. - Robust async scheduling: Fixed several correctness and stability issues in async scheduling, especially when combined with chunked prefill, structured outputs, priority scheduling, MTP, and DeepEP / DCP. We expect
--async-schedulingto be enabled by default in the next release. - Stronger scheduler + KV ecosystem: Improved test coverage in CI and made scheduler behavior more robust with KV connectors, prefix caching, and multi-node deployments.
- Anthropic API Support: Added support for the
/v1/messagesendpoint, allowing users to interact withvllm serveusing Anthropic-compatible clients.
Detailed release notes will be updated in the next few days.
What's Changed
- [Bugfix] Improve GLM4 MoE Reasoning Parser's is_reasoning_end Condition (@frankwang28 #25355)
- [Docs] Add Toronto Meetup (@mgoin #25773)
- [CI] Add E2E Blackwell Quantized MoE Test (@mgoin #25723)
- [V1] address post issues related to #20059 (part 1); cascade attention reenable by default (@fhl2000 #23046)
- [CI] Fix FlashInfer AOT in release docker image (@mgoin #25730)
- [spec decode] Consolidate speculative decode method name for MTP (@zixi-qi #25232)
- Reduce the Cuda Graph memory footprint when running with DBO (@SageMoore #25779)
- Kernel-override Determinism [1/n] (@bwasti #25603)
- [Bugfix] Optimize CpuGpuBuffer initialization (@namanlalitnyu #25447)
- [Spec decode] automatically disable mm for text-only draft models (@jmkuebler #25667)
- [Core] Don't count preempted tokens in prefix cache hit rate (@zhuohan123 #25787)
- Add option to restrict media domains (@russellb #25783)
- Add flashinfer-build.sh and register precompiled cu128 wheel in Dockerfile (@mgoin #25782)
- [Multimodal][Speculative Decoding]Eagle Eagle3 mm support, enablement on qwen2.5vl (@david6666666 #22872)
- [Bugfix] Allow Only SDPA Backend for ViT on B200 for Qwen3-VL (@yewentao256 #25788)
- [CI/Build] Consolidate model loader tests and requirements (@DarkLight1337 #25765)
- [CI/Build] Add timing to Model Executor Test (@22quinn #25799)
- [CI/Build] Reorganize root-level V1 tests (@DarkLight1337 #25767)
- [Misc] Fix codeowners override for v1 sample and attention (@22quinn #25037)
- [Misc] Update openai client example file for multimodal (@ywang96 #25795)
- [Bugfix] Add missing
image_sizefor phi4_multimodal (@Renovamen #25796) - [Bugfix] Merge MM embeddings by index instead of token IDs (@DarkLight1337 #16229)
- Validate API tokens in constant time (@russellb #25781)
- Add filtering for chat template kwargs (@russellb #25794)
- Fix GPTQ model loading in Transformers backend (@hmellor #25770)
- [Bugfix] Fix triton import precommit failure (@tlrmchlsmth #25803)
- [Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models (@tlrmchlsmth #24982)
- [docs] Resolve transcriptions API TODO (@yyzxw #25446)
- [env] default nixl side port conflicts with kv-event zmq port (@panpan0000 #25056)
- [Core] Refactor self.model() to call a helper for subclassing. (@patrick-toulme #25084)
- [torch.compile]: Add VLLM_DEBUG_DUMP_PATH environment variable (@ZJY0516 #25651)
- [Bug]: Set LD_LIBRARY_PATH to include the 'standard' CUDA location (@smarterclayton #25766)
- [Core] GC Debug callback (@Jialin #24829)
- [Bugfix][NIXL] Fix Async Scheduler timeout issue (@NickLucche #25808)
- [MM] Optimize memory profiling for scattered multimodal embeddings (@ywang96 #25810)
- [Bugfix] Fix Qwen3-VL regression from #24982 (@ywang96 #25814)
- [VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling (@Isotr0py #25557)
- Fix random dataset mismatched token length with config. (@weireweire #24937)
- Update GLM-4.5 Doc transformers version (@zRzRzRzRzRzRzR #25830)
- [Bugfix] fix Qwen3VLMoe load when pp > 1 (@JJJYmmm #25838)
- Remove redundant cudagraph dispatcher warning (@mgoin #25841)
- [Misc] fix tests failure by using current_platform (@kingsmad #25825)
- [P/D] NIXL Updates (@robertgshaw2-redhat #25844)
- Add Phi4FlashForCausalLM to _PREVIOUSLY_SUPPORTED_MODELS (@tdoublep #25832)
- [XPU]Fix xpu spec decoding UTs, avoid using cuda graph (@jikunshang #25847)
- [Bugfix] Fallback ViT attn backend to SDPA for blackwell (@ywang96 #25851)
- [V0 Deprecation][Models] Remove all V0 condition for mm embeddings merge (@Isotr0py #25331)
- [Misc] Remove more
get_input_embeddings_v0(@DarkLight1337 #25857) - update to latest deepgemm for dsv3.2 (@youkaichao #25871)
- [Bugfix] Fix requirements paths in install instructions (@yingjun-mou #25827)
- [Model][Bugfix] Fix issues in MiDashengLM implementation for quantized models (@zhoukezi #25854)
- [torch.compile] serialize cudagraph_mode as its enum name instead of value (@ZJY0516 #25868)
- [Cuda2CPU][P/D] Add cuda2cpu support in NixlConnector (@chenxi-yang #24690)
- [Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue (@rahul-tuli #25883)
- [CI/Build] Include Transformers backend test in nightly transformers test (@Isotr0py #25885)
- [Model] Remove MotifForCausalLM (@jeejeelee #25866)
- [Bugfix] Use correct key "ignore" for config.json non-quantized layers (@leejnau #25706)
- [BugFix][torch.compile] KV scale calculation issues with FP8 quantization (#21640) (@adabeyta #25513)
- [Doc] Add documentation for vLLM continuous benchmarking and profiling (@namanlalitnyu #25819)
- [Bugfix][ROCm] Fixing trying to import non-existent symbols from libnccl.so (@gshtras #25605)
- [Kernel] Chunk-aligned mamba2 (@tdoublep #24683)
- [Doc] Polish example for torchrun dp (@zhuohan123 #25899)
- [NIXL] Increase default KV block eviction timeout on P (@NickLucche #25897)
- [V0 Deprecation] Remove
vllm.workerand update according imports (@aarnphm #25901) - Test Prompt Embeds/LoRA compatibility and Enable LoRA Support for OPT Models (@qthequartermasterman #25717)
- [Bug] Fix Weight Loading for Block FP8 Cutlass SM90 (@yewentao256 #25909)
- [Benchmark] Support benchmark throughput for external launcher DP (@zhuohan123 #25913)
- Move
VllmConfigfromconfig/__init__.pytoconfig/vllm.py(@hmellor #25271) - [BugFix] Fix DP/EP hang (@LucasWilkinson #25906)
- [BugFix] Pass config_format via try_get_generation_config (@acisseJZhong #25912)
- [Model][Bugfix] Fix MiDashengLM audio encoder mask by removing incorrect
logical_not(@zhoukezi #25925) - [Bugfix]: Clean up chunked prefill logging when using whisper (@simondanielsson #25075)
- [New Model] DeepSeek-V3.2 (Rebased to Main) (@zyongye #25896)
- [Doc] Add Cambricon MLU support (@a120092009 #25942)
- Updated TRL integration docs (@sergiopaniego #25684)
- [Bugfix][Model]fix ernie45 moe gate&bias dtype to float32 (@CSWYF3634076 #25936)
- [Model] Move
vision_feature_select_strategyintoresolve_visual_encoder_outputs(@DarkLight1337 #25938) - [perf] Use CPU tensor to reduce GPU->CPU sync (@lhtin #25884)
- [NIXL] Add support for MLA caches with different latent dim (@NickLucche #25902)
- [CI] Move applicable tests to CPU (@rzabarazesh #24080)
- [Fix] Improve CPU backend compatibility for RISC-V (@ihb2032 #25816)
- [Kernel][Moe Configs] Add more tuned triton configs for ExpertsInt8 and FP8 (@Josephasafg #25858)
- Add Hugging Face Inference Endpoints guide to Deployment docs (@sergiopaniego #25886)
- [Bugfix][Model] Fix inference for Hunyuan dense models (@Anionex #25354)
- [Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging (@pavanimajety #25895)
- [Bugfix] Token type and position embeddings fail to be applied to
inputs_embeds(@DarkLight1337 #25922) - [bugfix][deepseek] fix flashmla kernel selection (@youkaichao #25956)
- [Bug] Fix AttributeError: 'QKVParallelLinear' object has no attribute 'orig_dtype' (@yewentao256 #25958)
- [Doc] Improve MM Pooling model documentation (@DarkLight1337 #25966)
- [Docs] Add moe kernel features doc (@bnellnm #25297)
- OffloadingConnector: Fix GPU block tracking bug (@orozery #25856)
- [Llama4] [multimodal] Fix misplaced dtype cast of
cos_sin_cacheinLlama4VisionRotaryEmbedding(@cjackal #25889) - [Bench] Add DeepSeekV32 to MoE benchmark (@jeejeelee #25962)
- [V1] [P/D] Add Support for KV Load Failure Recovery (@sdavidbd #19330)
- Add explicit pooling classes for the Transformers backend (@hmellor #25322)
- [Docs] Remove API Reference from search index (@hmellor #25949)
- [gpt-oss] use vLLM instead of openai types for streaming (@qandrew #25186)
- [Misc] Make EP kernels install script support uv (@LucasWilkinson #25785)
- [Model] MTP fallback to eager for DeepSeek v32 (@luccafong #25982)
- Update launch_bounds_utils.h for correct compile on Multiple Cuda Arch - PTXAS out of range Warning (@DrStone1971 #25843)
- [Log] Optimize Log for FP8MOE (@yewentao256 #25709)
- Fix INT8 quantization error on Blackwell GPUs (SM100+) (@certainly-param #25935)
- [MM] Add text-only mode for Qwen3-VL (@ywang96 #26000)
- [Bugfix] Fix
__syncwarpon ROCM (@zhewenl #25996) - [BugFix] Fix default kv-cache-dtype default for DeepseekV3.2 (@LucasWilkinson #25988)
- Update to Transformers
v4.56.2(@hmellor #24638) - [Misc]allow disable pynccl (@luccafong #25421)
- [Doc] updating torch.compile doc link #25989)
- [BugFix][MM] Fix Nonetype error when video is cache in qwen2.5-omni-thinker (@wwl2755 #26004)
- [Misc] Factor out common
_apply_feature_select_strategy(@DarkLight1337 #26003) - [CI] Only capture a single CUDA graph size in CI by default (@hmellor #25951)
- [MISC] Fix misleading batch_size_capture_lis...
v0.11.0
Highlights
This release features 538 commits, 207 contributors (65 new contributors)!
- This release completes the removal of V0 engine. V0 engine code including AsyncLLMEngine, LLMEngine, MQLLMEngine, all attention backends, and related components have been removed. V1 is the only engine in the codebase now.
- This releases turns on FULL_AND_PIECEWISE as the CUDA graph mode default. This should provide better out of the box performance for most models, particularly fine-grained MoEs, while preserving compatibility with existing models supporting only PIECEWISE mode.
Note: In v0.11.0 (and v0.10.2), --async-scheduling will produce gibberish output in some cases such as preemption and others. This functionality is correct in v0.10.1. We are actively fixing it for the next version.
Model Support
- New architectures: DeepSeek-V3.2-Exp (#25896), Qwen3-VL series (#24727), Qwen3-Next (#24526), OLMo3 (#24534), LongCat-Flash (#23991), Dots OCR (#24645), Ling2.0 (#24627), CWM (#25611).
- Encoders: RADIO encoder support (#24595), Transformers backend support for encoder-only models (#25174).
- Task expansion: BERT token classification/NER (#24872), multimodal models for pooling tasks (#24451).
- Data parallel for vision encoders: InternVL (#23909), Qwen2-VL (#25445), Qwen3-VL (#24955).
- Speculative decoding: EAGLE3 for MiniCPM3 (#24243) and GPT-OSS (#25246).
- Features: Qwen3-VL text-only mode (#26000), EVS video token pruning (#22980), Mamba2 TP+quantization (#24593), MRoPE + YaRN (#25384), Whisper on XPU (#25123), LongCat-Flash-Chat tool calling (#24083).
- Performance: GLM-4.1V 916ms TTFT reduction via fused RMSNorm (#24733), GLM-4 MoE SharedFusedMoE optimization (#24849), Qwen2.5-VL CUDA sync removal (#24741), Qwen3-VL Triton MRoPE kernel (#25055), FP8 checkpoints for Qwen3-Next (#25079).
- Reasoning: SeedOSS reason parser (#24263).
Engine Core
- KV cache offloading: CPU offloading with LRU management (#19848, #20075, #21448, #22595, #24251).
- V1 features: Prompt embeddings (#24278), sharded state loading (#25308), FlexAttention sliding window (#24089), LLM.apply_model (#18465).
- Hybrid allocator: Pipeline parallel (#23974), varying hidden sizes (#25101).
- Async scheduling: Uniprocessor executor support (#24219).
- Architecture: Tokenizer group removal (#24078), shared memory multimodal caching (#20452).
- Attention: Hybrid SSM/Attention in Triton (#21197), FlashAttention 3 for ViT (#24347).
- Performance: FlashInfer RoPE 2x speedup (#21126), fused Q/K RoPE 11% improvement (#24511, #25005), 8x spec decode overhead reduction (#24986), FlashInfer spec decode with 1.14x speedup (#25196), model info caching (#23558), inputs_embeds copy avoidance (#25739).
- LoRA: Optimized weight loading (#25403).
- Defaults: CUDA graph mode FULL_AND_PIECEWISE (#25444), Inductor standalone compile disabled (#25391).
- torch.compile: CUDA graph Inductor partition integration (#24281).
Hardware & Performance
- NVIDIA: FP8 FlashInfer MLA decode (#24705), BF16 fused MoE for Hopper/Blackwell expert parallel (#25503).
- DeepGEMM: Enabled by default (#24462), 5.5% throughput improvement (#24783).
- New architectures: RISC-V 64-bit (#22112), ARM non-x86 CPU (#25166), ARM 4-bit fused MoE (#23809).
- AMD: ROCm 7.0 (#25178), GLM-4.5 MI300X tuning (#25703).
- Intel XPU: MoE DP accuracy fix (#25465).
Large Scale Serving & Performance
- Dual-Batch Overlap (DBO): Overlapping computation mechanism (#23693), DeepEP high throughput + prefill (#24845).
- Data Parallelism: torchrun launcher (#24899), Ray placement groups (#25026), Triton DP/EP kernels (#24588).
- EPLB: Hunyuan V1 (#23078), Mixtral (#22842), static placement (#23745), reduced overhead (#24573).
- Disaggregated serving: KV transfer metrics (#22188), NIXL MLA latent dimension (#25902).
- MoE: Shared expert overlap optimization (#24254), SiLU kernel for DeepSeek-R1 (#24054), Enable Allgather/ReduceScatter backend for NaiveAllToAll (#23964).
- Distributed: NCCL symmetric memory with 3-4% throughput improvement (#24532), enabled by default for TP (#25070).
Quantization
- FP8: Per-token-group quantization (#24342), hardware-accelerated instructions (#24757), torch.compile KV cache (#22758), paged attention update (#22222).
- FP4: NVFP4 for dense models (#25609), Gemma3 (#22771), Llama 3.1 405B (#25135).
- W4A8: Faster preprocessing (#23972).
- Compressed tensors: Blocked FP8 for MoE (#25219).
API & Frontend
- OpenAI: Prompt logprobs for all tokens (#24956), logprobs=-1 for full vocab (#25031), reasoning streaming events (#24938), Responses API MCP tools (#24628, #24985), health 503 on dead engine (#24897).
- Multimodal: Media UUID caching (#23950), image path format (#25081).
- Tool calling: XML parser for Qwen3-Coder (#25028), Hermes-style tokens (#25281).
- CLI: --enable-logging (#25610), improved --help (#24903).
- Config: Speculative model engine args (#25250), env validation (#24761), NVTX profiling (#25501), guided decoding backward compatibility (#25615, #25422).
- Metrics: V1 TPOT histogram (#24015), hidden deprecated gpu_ metrics (#24245), KV cache GiB units (#25204, #25479).
- UX: Removed misleading quantization warning (#25012).
Security
Dependencies
- PyTorch 2.8 for CPU (#25652), FlashInfer 0.3.1 (#24470), CUDA 13 (#24599), ROCm 7.0 (#25178).
- Build requirements: C++17 now enforced globally (#24823).
- TPU: Deprecated
xm.mark_stepin favor oftorch_xla.sync(#25254).
V0 Deprecation
- Engines: AsyncLLMEngine (#25025), LLMEngine (#25033), MQLLMEngine (#25019), core (#25321), model runner (#25328), MP executor (#25329).
- Components: Attention backends (#25351), encoder-decoder (#24907), output processor (#25320), sampling metadata (#25345), Sequence/Sampler (#25332).
- Interfaces: LoRA (#25686), async output processor (#25334), MultiModalPlaceholderMap (#25366), seq group methods (#25330), placeholder attention (#25510), input embeddings (#25242), multimodal registry (#25362), max_seq_len_to_capture (#25543), attention classes (#25541), hybrid models (#25400), backend suffixes (#25489), compilation fallbacks (#25675), default args (#25409).
What's Changed
- [Qwen3-Next] MoE configs for H20 TP=1,2,4,8 by @jeejeelee in #24707
- [DOCs] Update ROCm installation docs section by @gshtras in #24691
- Enable conversion of multimodal models to pooling tasks by @maxdebayser in #24451
- Fix implementation divergence for BLOOM models between vLLM and HuggingFace when using prompt embeds by @qthequartermasterman in #24686
- [Bugfix] Fix MRoPE dispatch on CPU by @bigPYJ1151 in #24712
- [BugFix] Fix Qwen3-Next PP by @njhill in #24709
- [CI] Fix flaky test v1/worker/test_gpu_model_runner.py::test_kv_cache_stride_order by @heheda12345 in #24640
- [CI] Add ci_envs for convenient local testing by @noooop in #24630
- [CI/Build] Skip prompt embeddings tests on V1-only CPU backend by @bigPYJ1151 in #24721
- [Misc][gpt-oss] Add gpt-oss label to PRs that mention harmony or related to builtin tool call by @heheda12345 in #24717
- [Bugfix] Fix BNB name match by @jeejeelee in #24735
- [Kernel] [CPU] refactor
cpu_attn.py:_run_sdpa_forwardfor better memory access by @ignaciosica in #24701 - [sleep mode] save memory for on-the-fly quantization by @youkaichao in #24731
- [Multi Modal] Add FA3 in VIT by @wwl2755 in #24347
- [Multimodal] Remove legacy multimodal fields in favor of MultiModalFeatureSpec by @sfeng33 in #24548
- [Doc]: fix typos in various files by @didier-durand in #24726
- [Docs] Fix warnings in mkdocs build (continued) by @Zerohertz in #24740
- [Bugfix] Fix MRoPE dispatch on XPU by @yma11 in #24724
- [Qwen3-Next] MoE configs for H100 TP=1,2 and TP2/EP by @elvircrn in #24739
- [Core] Shared memory based object store for Multimodal data caching and IPC by @dongluw in #20452
- [Bugfix][Frontend] Fix
--enable-log-outputsdoes not match the documentation by @kebe7jun in #24626 - [Models] Optimise and simplify
_validate_and_reshape_mm_tensorby @lgeiger in #24742 - [Models] Prevent CUDA sync in Qwen2.5-VL by @lgeiger in #24741
- [Model] Switch to Fused RMSNorm in GLM-4.1V model by @SamitHuang in #24733
- [UX] Remove AsyncLLM torch profiler disabled log by @mgoin in #24609
- [CI] Speed up model unit tests in CI by @afeldman-nm in #24253
- [Bugfix] Fix incompatibility between #20452 and #24548 by @DarkLight1337 in #24754
- [CI] Trigger BC Linter when labels are added/removed by @zhewenl in #24767
- [Benchmark] Allow arbitrary headers to be passed to benchmarked endpoints by @smarterclayton in #23937
- [Compilation Bug] Fix Inductor Graph Output with Shape Issue by...