Skip to content

v0.11.1

Choose a tag to compare

@khluu khluu released this 18 Nov 23:03
· 268 commits to main since this release
4393684

Highlights

This release includes 1456 commits from 449 contributors (184 new contributors)!

Key changes include:

  • PyTorch 2.9.0 + CUDA 12.9.1: Updated the default CUDA build to torch==2.9.0+cu129, enabling Inductor partitioning and landing multiple fixes in graph-partition rules and compile-cache integration.
  • Batch-invariant torch.compile: Generalized batch-invariant support across attention and MoE backends, with explicit support for DeepGEMM and FlashInfer on Hopper and Blackwell GPUs.
  • Robust async scheduling: Fixed several correctness and stability issues in async scheduling, especially when combined with chunked prefill, structured outputs, priority scheduling, MTP, and DeepEP / DCP. We expect --async-scheduling to be enabled by default in the next release.
  • Stronger scheduler + KV ecosystem: Improved test coverage in CI and made scheduler behavior more robust with KV connectors, prefix caching, and multi-node deployments.
  • Anthropic API Support: Added support for the /v1/messages endpoint, allowing users to interact with vllm serve using Anthropic-compatible clients.

Detailed release notes will be updated in the next few days.

What's Changed

  • [Bugfix] Improve GLM4 MoE Reasoning Parser's is_reasoning_end Condition (@frankwang28 #25355)
  • [Docs] Add Toronto Meetup (@mgoin #25773)
  • [CI] Add E2E Blackwell Quantized MoE Test (@mgoin #25723)
  • [V1] address post issues related to #20059 (part 1); cascade attention reenable by default (@fhl2000 #23046)
  • [CI] Fix FlashInfer AOT in release docker image (@mgoin #25730)
  • [spec decode] Consolidate speculative decode method name for MTP (@zixi-qi #25232)
  • Reduce the Cuda Graph memory footprint when running with DBO (@SageMoore #25779)
  • Kernel-override Determinism [1/n] (@bwasti #25603)
  • [Bugfix] Optimize CpuGpuBuffer initialization (@namanlalitnyu #25447)
  • [Spec decode] automatically disable mm for text-only draft models (@jmkuebler #25667)
  • [Core] Don't count preempted tokens in prefix cache hit rate (@zhuohan123 #25787)
  • Add option to restrict media domains (@russellb #25783)
  • Add flashinfer-build.sh and register precompiled cu128 wheel in Dockerfile (@mgoin #25782)
  • [Multimodal][Speculative Decoding]Eagle Eagle3 mm support, enablement on qwen2.5vl (@david6666666 #22872)
  • [Bugfix] Allow Only SDPA Backend for ViT on B200 for Qwen3-VL (@yewentao256 #25788)
  • [CI/Build] Consolidate model loader tests and requirements (@DarkLight1337 #25765)
  • [CI/Build] Add timing to Model Executor Test (@22quinn #25799)
  • [CI/Build] Reorganize root-level V1 tests (@DarkLight1337 #25767)
  • [Misc] Fix codeowners override for v1 sample and attention (@22quinn #25037)
  • [Misc] Update openai client example file for multimodal (@ywang96 #25795)
  • [Bugfix] Add missing image_size for phi4_multimodal (@Renovamen #25796)
  • [Bugfix] Merge MM embeddings by index instead of token IDs (@DarkLight1337 #16229)
  • Validate API tokens in constant time (@russellb #25781)
  • Add filtering for chat template kwargs (@russellb #25794)
  • Fix GPTQ model loading in Transformers backend (@hmellor #25770)
  • [Bugfix] Fix triton import precommit failure (@tlrmchlsmth #25803)
  • [Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models (@tlrmchlsmth #24982)
  • [docs] Resolve transcriptions API TODO (@yyzxw #25446)
  • [env] default nixl side port conflicts with kv-event zmq port (@panpan0000 #25056)
  • [Core] Refactor self.model() to call a helper for subclassing. (@patrick-toulme #25084)
  • [torch.compile]: Add VLLM_DEBUG_DUMP_PATH environment variable (@ZJY0516 #25651)
  • [Bug]: Set LD_LIBRARY_PATH to include the 'standard' CUDA location (@smarterclayton #25766)
  • [Core] GC Debug callback (@Jialin #24829)
  • [Bugfix][NIXL] Fix Async Scheduler timeout issue (@NickLucche #25808)
  • [MM] Optimize memory profiling for scattered multimodal embeddings (@ywang96 #25810)
  • [Bugfix] Fix Qwen3-VL regression from #24982 (@ywang96 #25814)
  • [VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling (@Isotr0py #25557)
  • Fix random dataset mismatched token length with config. (@weireweire #24937)
  • Update GLM-4.5 Doc transformers version (@zRzRzRzRzRzRzR #25830)
  • [Bugfix] fix Qwen3VLMoe load when pp > 1 (@JJJYmmm #25838)
  • Remove redundant cudagraph dispatcher warning (@mgoin #25841)
  • [Misc] fix tests failure by using current_platform (@kingsmad #25825)
  • [P/D] NIXL Updates (@robertgshaw2-redhat #25844)
  • Add Phi4FlashForCausalLM to _PREVIOUSLY_SUPPORTED_MODELS (@tdoublep #25832)
  • [XPU]Fix xpu spec decoding UTs, avoid using cuda graph (@jikunshang #25847)
  • [Bugfix] Fallback ViT attn backend to SDPA for blackwell (@ywang96 #25851)
  • [V0 Deprecation][Models] Remove all V0 condition for mm embeddings merge (@Isotr0py #25331)
  • [Misc] Remove more get_input_embeddings_v0 (@DarkLight1337 #25857)
  • update to latest deepgemm for dsv3.2 (@youkaichao #25871)
  • [Bugfix] Fix requirements paths in install instructions (@yingjun-mou #25827)
  • [Model][Bugfix] Fix issues in MiDashengLM implementation for quantized models (@zhoukezi #25854)
  • [torch.compile] serialize cudagraph_mode as its enum name instead of value (@ZJY0516 #25868)
  • [Cuda2CPU][P/D] Add cuda2cpu support in NixlConnector (@chenxi-yang #24690)
  • [Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue (@rahul-tuli #25883)
  • [CI/Build] Include Transformers backend test in nightly transformers test (@Isotr0py #25885)
  • [Model] Remove MotifForCausalLM (@jeejeelee #25866)
  • [Bugfix] Use correct key "ignore" for config.json non-quantized layers (@leejnau #25706)
  • [BugFix][torch.compile] KV scale calculation issues with FP8 quantization (#21640) (@adabeyta #25513)
  • [Doc] Add documentation for vLLM continuous benchmarking and profiling (@namanlalitnyu #25819)
  • [Bugfix][ROCm] Fixing trying to import non-existent symbols from libnccl.so (@gshtras #25605)
  • [Kernel] Chunk-aligned mamba2 (@tdoublep #24683)
  • [Doc] Polish example for torchrun dp (@zhuohan123 #25899)
  • [NIXL] Increase default KV block eviction timeout on P (@NickLucche #25897)
  • [V0 Deprecation] Remove vllm.worker and update according imports (@aarnphm #25901)
  • Test Prompt Embeds/LoRA compatibility and Enable LoRA Support for OPT Models (@qthequartermasterman #25717)
  • [Bug] Fix Weight Loading for Block FP8 Cutlass SM90 (@yewentao256 #25909)
  • [Benchmark] Support benchmark throughput for external launcher DP (@zhuohan123 #25913)
  • MoveVllmConfig from config/__init__.py to config/vllm.py (@hmellor #25271)
  • [BugFix] Fix DP/EP hang (@LucasWilkinson #25906)
  • [BugFix] Pass config_format via try_get_generation_config (@acisseJZhong #25912)
  • [Model][Bugfix] Fix MiDashengLM audio encoder mask by removing incorrect logical_not (@zhoukezi #25925)
  • [Bugfix]: Clean up chunked prefill logging when using whisper (@simondanielsson #25075)
  • [New Model] DeepSeek-V3.2 (Rebased to Main) (@zyongye #25896)
  • [Doc] Add Cambricon MLU support (@a120092009 #25942)
  • Updated TRL integration docs (@sergiopaniego #25684)
  • [Bugfix][Model]fix ernie45 moe gate&bias dtype to float32 (@CSWYF3634076 #25936)
  • [Model] Move vision_feature_select_strategy into resolve_visual_encoder_outputs (@DarkLight1337 #25938)
  • [perf] Use CPU tensor to reduce GPU->CPU sync (@lhtin #25884)
  • [NIXL] Add support for MLA caches with different latent dim (@NickLucche #25902)
  • [CI] Move applicable tests to CPU (@rzabarazesh #24080)
  • [Fix] Improve CPU backend compatibility for RISC-V (@ihb2032 #25816)
  • [Kernel][Moe Configs] Add more tuned triton configs for ExpertsInt8 and FP8 (@Josephasafg #25858)
  • Add Hugging Face Inference Endpoints guide to Deployment docs (@sergiopaniego #25886)
  • [Bugfix][Model] Fix inference for Hunyuan dense models (@Anionex #25354)
  • [Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging (@pavanimajety #25895)
  • [Bugfix] Token type and position embeddings fail to be applied to inputs_embeds (@DarkLight1337 #25922)
  • [bugfix][deepseek] fix flashmla kernel selection (@youkaichao #25956)
  • [Bug] Fix AttributeError: 'QKVParallelLinear' object has no attribute 'orig_dtype' (@yewentao256 #25958)
  • [Doc] Improve MM Pooling model documentation (@DarkLight1337 #25966)
  • [Docs] Add moe kernel features doc (@bnellnm #25297)
  • OffloadingConnector: Fix GPU block tracking bug (@orozery #25856)
  • [Llama4] [multimodal] Fix misplaced dtype cast of cos_sin_cache in Llama4VisionRotaryEmbedding (@cjackal #25889)
  • [Bench] Add DeepSeekV32 to MoE benchmark (@jeejeelee #25962)
  • [V1] [P/D] Add Support for KV Load Failure Recovery (@sdavidbd #19330)
  • Add explicit pooling classes for the Transformers backend (@hmellor #25322)
  • [Docs] Remove API Reference from search index (@hmellor #25949)
  • [gpt-oss] use vLLM instead of openai types for streaming (@qandrew #25186)
  • [Misc] Make EP kernels install script support uv (@LucasWilkinson #25785)
  • [Model] MTP fallback to eager for DeepSeek v32 (@luccafong #25982)
  • Update launch_bounds_utils.h for correct compile on Multiple Cuda Arch - PTXAS out of range Warning (@DrStone1971 #25843)
  • [Log] Optimize Log for FP8MOE (@yewentao256 #25709)
  • Fix INT8 quantization error on Blackwell GPUs (SM100+) (@certainly-param #25935)
  • [MM] Add text-only mode for Qwen3-VL (@ywang96 #26000)
  • [Bugfix] Fix __syncwarp on ROCM (@zhewenl #25996)
  • [BugFix] Fix default kv-cache-dtype default for DeepseekV3.2 (@LucasWilkinson #25988)
  • Update to Transformers v4.56.2 (@hmellor #24638)
  • [Misc]allow disable pynccl (@luccafong #25421)
  • [Doc] updating torch.compile doc link #25989)
  • [BugFix][MM] Fix Nonetype error when video is cache in qwen2.5-omni-thinker (@wwl2755 #26004)
  • [Misc] Factor out common _apply_feature_select_strategy (@DarkLight1337 #26003)
  • [CI] Only capture a single CUDA graph size in CI by default (@hmellor #25951)
  • [MISC] Fix misleading batch_size_capture_list when cuda_graph_sizes < 4 (@billishyahao #25829)
  • [Benchmark] Finish documented v0.11.0 deprecation of --endpoint-type (@natoscott #26007)
  • [Bugfix] Apply same sampling parameters for both n=1 and n>1 (@kmaehashi #26005)
  • [NVIDIA] Blackwell Family (@johnnynunez #24673)
  • Fix test_mamba_ssm_ssd.py due to missing _query_start_loc_to_chunk_indices_offsets (@hl475 #25995)
  • [CI] Tweaks to GPT-OSS Eval (Blackwell) for stability (@mgoin #26030)
  • [BugFix][DP/EP] Fix CUTLASS MLA hang under load (@LucasWilkinson #26026)
  • [ROCm][Build] Add support for AMD Ryzen AI MAX / AI 300 Series (@hyoon1 #25908)
  • [Bug] Fix Negative Cuda Memory Usage (@yewentao256 #25683)
  • [BugFix] ChunkedLocalAttention is currently not CG compatible (@LucasWilkinson #26034)
  • Support RL online quantization with torchao (@jerryzh168 #23014)
  • [ROCm][Bugfix] Add missing parameter to ROCm backend (@gshtras #26029)
  • [Misc] Make handling of SamplingParams clearer in n>1 case (@njhill #26032)
  • Run:ai model streamer add GCS package support (@pwschuurman #24909)
  • Update base image to 22.04 (jammy) (@huydhn #26065)
  • Change size of single CUDA graph for CI to 4 (@tdoublep #26089)
  • [FA/Chore] Bump vllm-flash-attention (@LucasWilkinson #25537)
  • [Model] Use merge_by_field_config for MM models (A-C) (@DarkLight1337 #26073)
  • [Model] Use merge_by_field_config for MM models (D-F) (@DarkLight1337 #26076)
  • [Platform][CI] Added OOT platform interface e2e test that running on Ascend NPU (@leo-pony #25470)
  • [Qwen][ROCm] Flash Attention Rotary Embeddings (@vllmellm #24642)
  • [CI] Add Blackwell DeepSeek FP8 FlashInfer MoE tests (@mgoin #26040)
  • [CI/Build] Replace vllm.entrypoints.openai.api_server entrypoint with vllm serve command (@DarkLight1337 #25967)
  • [BugFix] Fix FI accuracy issue when used for MLA prefill (@LucasWilkinson #26063)
  • [Small] Prevent bypassing media domain restriction via HTTP redirects (@huachenheli #26035)
  • [Deepseek v3.2] Support indexer prefill chunking (@heheda12345 #25999)
  • EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 32% instead of 5% on MTBench (@ekagra-ranjan #25916)
  • [Mamba][KVCacheManager] Simplify kv cache manage logic for mamba + MTP (@heheda12345 #25119)
  • [Perf] Fix and reapply move apply w8a8 block fp8 linear to class (@ElizaWszola #25696)
  • Fix MTP with deepep_low_latency (@MatthewBonanni #25904)
  • [Bugfix] Disable cascade attention with FlashInfer (@mgoin #26130)
  • [Log] Optimize DeepGEMM Missing Log (@yewentao256 #26106)
  • [Bug][Benchmark] Fix duplicate req in oversampling (@ekagra-ranjan #26140)
  • [Attention] Move Backend enum into registry (@MatthewBonanni #25893)
  • [CI/Build] Conditionally register cutlass_fp4_group_mm to fix building on Hopper (@mgoin #26138)
  • [DeepSeek] Improve performance of DS MLA cache kernel (@MatthewBonanni #26132)
  • [Bug]: Limit num_reqs in dummy_run when max_num_seqs is small (@benchislett #26144)
  • [gpt-oss] disable tool server initialization if no tool in request (@qandrew #25790)
  • [Build/CI] Revert back to Ubuntu 20.04, install python 3.12 with uv (@tlrmchlsmth #26103)
  • [ROCm] [VL] [Bugfix] Fix vit flash attn dispatcher logic for ROCm (@tjtanaa #26104)
  • [Bugfix] Fix import gemm_afp4wfp4 failure on AMD (@zhewenl #26068)
  • [Model] Use merge_by_field_config for MM models (G) (@DarkLight1337 #26117)
  • FusedMoE support for the Transformers backend (@hmellor #22650)
  • [BUG] Reorder model config creation (@ahao-anyscale #26124)
  • [Misc] Remove typing.List (@varun-sundar-rabindranath #26150)
  • [Input] Remove unused prompt field (@DarkLight1337 #26097)
  • [Perf] Optimize reshape_and_cache CUDA Kernel (@ZJY0516 #25955)
  • add(v1): RequestStatesStats to RequestOutput (@huijjj #24947)
  • [Model] Use merge_by_field_config for MM models (InternVL family) (@DarkLight1337 #26153)
  • [test utils] correct wrong typing (@yannicks1 #26159)
  • [CI] Fix distributed hybrid tests in CI (@tdoublep #26155)
  • [NIXL][Misc] Expose metrics from NIXL for logging to CLI (@NickLucche #25388)
  • [openai] Fix missing tool usage check (system message) (@levunet #24768)
  • [Multi Modal] Configurable MM Profiling (@wwl2755 #25631)
  • [Doc] Fixed shape description for fused_batched_moe.py (@Egor-Krivov #25668)
  • Quick fix for IMA with the Prefix Prefill kernel during graph capture (@SageMoore #25983)
  • [Renderer] Move Processor out of AsyncLLM (@KKSK-DON #24138)
  • Re-enable prefill of max model length (@yannicks1 #24446)
  • [backends][short_conv] CUDA graph piecewise edits (@paulpak58 #24215)
  • [Model] Supplement to PR 24862: Pass param prefix to LLMHead (@whx-sjtu #25805)
  • [CI/Build] do not enforce precompilation on tpu ci tests (@sixiang-google #25992)
  • [Model] Fixed stream generator for gpt-oss + spec-decoding (@astralord #26027)
  • [Renderer] Move Processor out of LLMEngine (@DarkLight1337 #26165)
  • Fix undefined symbol: cutlass_moe_mm_sm100 (@jasl #26098)
  • [BugFix][QWEN-VL]fix wrong apply_rotary_emb_torch selection introduced by #24642 (@xuechendi #26123)
  • Stop mergify from keeping stale PRs alive (@hmellor #26169)
  • Avoid division by zero in cache DS MLA kernel (@MatthewBonanni #26174)
  • Fix V1 engine serialization error with Ray distributed executor (@nrghosh #26148)
  • [Quantization/NVFP4] Speed up TRTLLM NVFP4 MOE weight loading and fix K/V scale loading for MLA Attn (@pavanimajety #25968)
  • [Perf] Remove hardcoded num_warps=1 (@chelsea0x3b #26183)
  • [Refactor] Optimize FP8 MOE Backend Choice and Log (@yewentao256 #26044)
  • [responsesAPI] add better error messaging for long prompts (@qandrew #25724)
  • [Bugfix] Relax tokenizer regex for mixtral to include 'tokenizer.model' (@BowenBao #25964)
  • [CI] Push multiarch manifests as nightly builds (@csahithi #25764)
  • [Misc] Add penalties sampling parameters to serve tool (@southfreebird #25974)
  • [BugFix] Fix de-functionalization pass for rotary_embedding (@angelayi #23953)
  • [CI] Fix Pre-commit Mypy Error (@yewentao256 #26181)
  • [GPTOSS][DP/EP][Marlin] Enable GPTOSS DP/EP using Marlin kernels (@varun-sundar-rabindranath #25488)
  • Fix issue of using only the part of video frame [Nemotron Nano] (@BloodAxe #26186)
  • [Bugfix] Fix qwen3 vl dummy data generation with overrides (@ywang96 #26193)
  • [BugFix] Use async Mistral Tokenizer in Chat Completions (@bbrowning #26134)
  • Add batch invariant kernel override for FlashInfer backend [2/n] (@bwasti #25769)
  • [cpu][perf] Accelerate unquantized-linear for AArch64 through oneDNN/ACL and weight prepack (@fadara01 #25948)
  • [V1] [Hybrid] Mamba2 Automatic Prefix Caching (@s3woz #25752)
  • Support expert parallel in Transformers backend (@hmellor #26162)
  • [Model] Support nested structures for TensorSchema (@DarkLight1337 #26212)
  • [Misc] Require merge_by_field_config argument (@DarkLight1337 #26214)
  • [Misc] Remove unused executor.apply_model (@DarkLight1337 #26215)
  • [CI Failure] fix_test_auto_prefix_cache_support (@hl475 #26053)
  • Revert "Add batch invariant kernel override for FlashInfer backend [2/n]" (@DarkLight1337 #26220)
  • Add Olmo 3 reasoning parser (@soldni #26054)
  • [Core] Enable decode of context length equal to max model length (@yannicks1 #26168)
  • [Bugfix] Fix _reqs_to_process leak on abort (@NickLucche #26012)
  • [Model] CLIP Embedding Support (@DarkLight1337 #26010)
  • Fix tensor device and dtype placement in Qwen2VL model (@yuafng #26219)
  • [V1] [Hybrid] Remove code to override default CUDA graph configuration (@tdoublep #26226)
  • [CPU] Refine batch reorder of CPU attention backend (@bigPYJ1151 #26096)
  • [Frontend] Cache chat template kwargs resolution (@Isotr0py #26227)
  • [Renderer] Clean up renderer code (@DarkLight1337 #26216)
  • [Model] Use merge_by_field_config for MM models (H-L) (@DarkLight1337 #26230)
  • [Easy] Add str repr for IterationStats (@22quinn #26232)
  • [Bugfix] Allow --skip-tokenizer-init with echo and return_token_ids (@DarkLight1337 #26238)
  • Add documentation for granite 4 tool calling (@maxdebayser #26175)
  • [Perf][Easy] Early stop in request_block_hasher (@Jialin #26112)
  • [Bugfix]: Assertion error when using FlashInfer backend (@simondanielsson #25933)
  • [Bugfix] Always apply MM processor even when no MM items are passed (@DarkLight1337 #26240)
  • [Bugfix][Hardware][RISC-V] Limit supported dtypes to float32 to avoid scheduler segfault (@ihb2032 #26228)
  • [Refactor][Kernel] support loading kernel from other place (@ILikeIneine #25823)
  • Convert formatting to use ruff instead of yapf + isort (@hmellor #26247)
  • Remove all references to yapf as it's no longer used (@hmellor #26251)
  • Remove all cases of fmt: on/off (@hmellor #26253)
  • fix(tests): Resolve late binding of loop variable in assert message lambda (@ihb2032 #26249)
  • Fix per file ruff ignores related to typing (@hmellor #26254)
  • Update ruff pre-commit hooks version (@hmellor #26255)
  • [CI] fix mamba kernel test (@ZJY0516 #26250)
  • [NVIDIA] flashinfer TRTLLM attention prefill token limit (@jasonlizhengjian #25998)
  • Fix per file ruff ignores related to simplification (@hmellor #26259)
  • [CI] Add Blackwell LM Eval Small Models test to nightly (@mgoin #26052)
  • [DOC] Update production-stack.md (@elieserr #26177)
  • [CI] Add comment about the single cudagraph capture size that is used (@tdoublep #26252)
  • [V1] [Hybrid] Some additional clean-up in Mamba2 prefix caching (@tdoublep #26222)
  • [Doc] Edited minor typo (@orangeng #26266)
  • [MISC] Add heheda12345 to CODEOWNERS of vllm/config/cache.py (@heheda12345 #26270)
  • [CI][gpt-oss] Enable python tool tests in CI (@wuhang2014 #24315)
  • Fix per file ruff ignores related to line length (@hmellor #26262)
  • Bump actions/stale from 10.0.0 to 10.1.0 (@dependabot[bot] #26272)
  • [Benchmarking] Add disable_shuffle option for dataset loading (@ymoslem #26258)
  • [Misc] Clean up unnecessary E501 ignore (@ywang96 #26274)
  • [Docs] Edit HF Inference Endpoints documentation (@ariG23498 #26275)
  • [Doc] add KAITO to integrations (@abhisheksheth28 #25521)
  • [Frontend] Consolidate tokenizer init code (@DarkLight1337 #26276)
  • [Model] Use merge_by_field_config for MM models (Llava family) (@DarkLight1337 #26280)
  • Support expert parallel load balancing in Transformers backend (@hmellor #26287)
  • [Bugfix] Fix mrope in Transformers Backend (@zucchini-nlp #26087)
  • Fix DotsOCR tensor type (@what-in-the-nim #26281)
  • [Model] EVS support for nano_nemotron_vl (@tomeras91 #26269)
  • [Attention] Remove unused reorder_batch method (@MatthewBonanni #24463)
  • [Tests] conftest: Extending VllmRunner and HfRunner to accept token_ids as input (@yannicks1 #26295)
  • [CI Bugfix] Make sure TRTLLM attention is available in test_blackwell_moe (@mgoin #26188)
  • Support llama3 eagle3 head with llama4 verifier (@rahul-tuli #25961)
  • [Misc] auto_tune: kill specific vllm process (@karan #26304)
  • [Bugfix][Spec Decode] Fix wrong valid_mask for padded speculation when chunked prefill occurs (@seven-mile #26231)
  • Add bias handling to CPUFusedMOE kernel (@cfRod #26289)
  • [Bugfix] Fix gemma3 with transformers backend (@zucchini-nlp #23178)
  • [Benchmark] Enable MM Embedding benchmarks (@DarkLight1337 #26310)
  • [Docs] Fix broken table in moe_kernel_features doc (@varun-sundar-rabindranath #26314)
  • [BugFix] Pad input buffers in _dummy_run (@varun-sundar-rabindranath #26209)
  • [Bugfix] Allow skipping MoE in NVFP4 (fix for MTP) (@benchislett #25987)
  • [ROCm] Split AITER unified attention into its own backend (@gshtras #25507)
  • [Perf] Add decode full-graph support to FlashInfer-MLA backend (@benchislett #26313)
  • [Misc] Define EP kernel arch list in Dockerfile (@simon-mo #25635)
  • [Docs][DBO] Add initial doc that describes the DBO implementation (@SageMoore #26024)
  • [Core] Simplify the Dp padding/should ubatch coordination logic (@SageMoore #25768)
  • [UX] Support nested dicts in hf_overrides (@mgoin #25727)
  • [BUG] Fix file parsing for load_format runai_streamer_sharded (@ahao-anyscale #26324)
  • [Model] Define merge_by_field_config MM interface (U-Z) (@ayushsatyam146 #26261)
  • [Deprecation] Deprecate LLM.set_tokenizer (@DarkLight1337 #26333)
  • [responsesAPI][bugfix] serialize harmony messages (@qandrew #26185)
  • [Model] Define merge_by_field_config MM interface (R-T) (@ayushsatyam146 #26260)
  • [BugFix] Update KV block hash type from BlockHash to ExternalBlockHash in kv_events_subscriber - #26264 (@atalhens #26265)
  • [V0 Deprecation] Remove VLLM_USE_V1 from docs and scripts (@DarkLight1337 #26336)
  • Optimize KV cache distribution for asymmetric pipeline parallelism (@gholmes829 #25164)
  • Add topk logits torch op for DS3.2. (@dcampora #25945)
  • Add TRL example notebook to RLHF docs (@sergiopaniego #26346)
  • [Docs] add docs for cuda graph v1 (@fhl2000 #24374)
  • [Model] Use merge_by_field_config for MM models (Ovis family) (@Isotr0py #26308)
  • [Feature][OCP MX] Support mxfp6 and mixed mxfp6-mxfp4 (@fxmarty-amd #21166)
  • [Model] Add support for ModernBertForTokenClassification (@antrec #26340)
  • [Misc] Move LRUCache into its own file (@DarkLight1337 #26342)
  • [V0 Deprecation] Remove VLLM_USE_V1 from tests (@DarkLight1337 #26341)
  • [Model] Lfm2Moe (@paulpak58 #26344)
  • [ci] Rename test_mxfp4_moe.py to test_ocp_mx_moe.py (@fxmarty-amd #26364)
  • [CI] Add Qwen3 MoE NVFP4 to Blackwell lm-eval (@mgoin #26316)
  • [deepseek] add EP8 FusedMOE config for H200 and B200 (@heheda12345 #26331)
  • [Bug] Fix Shape Validation for Fallback while Enabling E8M0 for DeepGEMM (@yewentao256 #26322)
  • [Bugfix] Add missing sink tensor into flash attn cascade attn implementation (@plliao #26325)
  • [Frontend] CompilationConfig overhaul (#20283): deprecate use_inductor in favor of backend, simplify custom_ops (@morrison-turnansky #26113)
  • [V1] Logit processors for rejection sampler (@southfreebird #19482)
  • [Spec Decode] Enable efficient speculative decoding with FlashInfer-MLA (@benchislett #25984)
  • [TPU] update TPU benchmark threshold (@jcyang43 #25713)
  • Add more libraries to rlhf.md (@mgoin #26374)
  • [Bugfix] Fix MTP+FlashInfer crash when trtllm kernels are available but disabled (@benchislett #26361)
  • Revert #24446 and #26168 (@tdoublep #26332)
  • [Misc] Clean up cruft from previous FlashMLA sparse implementation (@LucasWilkinson #26125)
  • [torchao] safetensors integration (@liangel-02 #25969)
  • Add SwigluOAI implementation for CPUFusedMOE (@isharif168 #26347)
  • [Core] Simplify setting new_token_ids in CachedRequestData (@njhill #26388)
  • fix(v1/kv_cache): resolve async KV transfer bug in cascade attention (@ayushsatyam146 #23485)
  • Add gather_indexer_k_quant_cache kernel (@Barry-Delaney #25931)
  • [Bugfix] Incorrect MM data format in vllm bench throughput (@DarkLight1337 #26395)
  • fix[DP][v1]: Prevent hangs from mismatched worker configurations (@ayushsatyam146 #26218)
  • [TPU] Rename tpu_commons to tpu_inference (@utkarshsharma1 #26279)
  • [Feature] Enable E8M0 by Default on Hopper for DeepGEMM, 5% E2E throughput improvement (@yewentao256 #26197)
  • [Misc] add usedforsecurity=False in md5 hash call (@dtrifiro #26357)
  • [Model] Allow passing custom number of max tiles to Nano 2 VL (@BloodAxe #26403)
  • [Docs] Have mergify leave a comment with the docs preview link (@hmellor #26412)
  • [CI] Pooling models mteb test disable enforce_eager (@noooop #26408)
  • [Benchmarks] Add support for Qwen 3 VL MoE tuning (@lgeiger #26419)
  • Tidy vllm/config/__init__.py to only add classes and functions (@hmellor #26405)
  • [NIXL][non-cuda] Add install script for nixl with non-cuda ucx (@xuechendi #25959)
  • [Refactor] Refactor FP8 & INT8 Quant Folder inside w8a8 (@yewentao256 #25293)
  • [CI Failure] Fix pre-commit issue for install_nixl_from_source_ubuntu.py (@mgoin #26424)
  • [Bugfix] Fix vllm bench ... on CPU-only head nodes (@Aydin-ab #25283)
  • [Bug] Fix DeepGEMM Attention Test (@yewentao256 #26423)
  • [Benchmarks] Fix imports in FP8 tuning script (@lgeiger #26407)
  • [Bug] Fix Test in Batch Invariant (@yewentao256 #26128)
  • Remove Python 3.9 support ahead of PyTorch 2.9 in v0.11.1 (@hmellor #26416)
  • [Feature] Change cache.py with pydantic validation (@vrdn-23 #26390)
  • [Attention] Implement universal BACKEND_MAP (@MatthewBonanni #25900)
  • [Bugfix][Flashinfer] fix VLLM_USE_TRTLLM_ATTENTION issue for models with diff hyperparameters (@elvischenv #25924)
  • [BugFix] Fix failing test quantization/test_compressed_tensors.py::test_compressed_tensors_fp8_block_enabled (@morrison-turnansky #26436)
  • [Kernel] Centralize platform kernel import in current_platform.import_kernels (@NickLucche #26286)
  • [Models] Improve iteration over layers (@lgeiger #26425)
  • [Bugfix] Respect min_tokens in scheduler stop check (@elaineyz #26317)
  • [Kernels] Modular kernel refactor (@bnellnm #24812)
  • [Attention] Register FLASHMLA_SPARSE (@MatthewBonanni #26441)
  • Separate MLAAttention class from Attention (@therealnaveenkamal #25103)
  • [Misc] Redact ray runtime env before logging (@ruisearch42 #26302)
  • [Bugfix] Set the minimum python version for gpt-oss (@jeejeelee #26392)
  • [Minor] Change warning->warning_once in preprocess (@zhuohan123 #26455)
  • [Bugfix] Catch and log invalid token ids in detokenizer #2 (@njhill #26445)
  • [Bugfix] Incorrect another MM data format in vllm bench throughput (@huydhn #26462)
  • [Hardware][AMD] Enable FlexAttention backend on ROCm (@mawong-amd #26439)
  • [MM][Doc] Add documentation for configurable mm profiling (@wwl2755 #26200)
  • [Core][KVConnector] Propagate all tokens on resumed preemptions (@QierLi #24926)
  • [Hybrid]: Decouple Kernel Block Size from KV Page Size (@zhiyuan1i #24486)
  • [CI/Build] Fix model nightly tests (@DarkLight1337 #26466)
  • [Core] Relax the LoRA max rank (@jeejeelee #26461)
  • Update Dockerfile and install runai-model-streamer[gcs] package (@pwschuurman #26464)
  • Bump Flashinfer to v0.4.0 (@elvischenv #26326)
  • [Model] Gemma3: Fix GGUF loading and quantization (@lucianommartins #26189)
  • Enable RMSNorm substitution for Transformers backend (@hmellor #26353)
  • Add: Support for multiple hidden layers in Eagle3 (@rahul-tuli #26164)
  • [torchao] Add support for ModuleFqnToConfig using regex (@jerryzh168 #26001)
  • [Misc] Misc code simplifications (@njhill #26450)
  • [doc] add Volcengine as a compute sponsor (@youkaichao #26477)
  • [Feature] Use pydantic validation in lora.py and load.py configs (@simondanielsson #26413)
  • [Misc] Upgrade more code to Python 3.10 (@DarkLight1337 #26463)
  • [Bugfix] Fix SHM cache initialization (@DarkLight1337 #26427)
  • [Models][Qwen3VL] Optimise _validate_and_reshape_mm_tensor (@lgeiger #26426)
  • [Bugfix] Move current_platform import to avoid python import cache. (@iwzbi #16601)
  • [V0 deprecation] Remove QKVCrossParallelLinear implementation (@Isotr0py #26475)
  • [Feature] Use pydantic validation in parallel.py config (@simondanielsson #26417)
  • Revert #26113 "[Frontend] CompilationConfig overhaul (#20283): deprecate use_inductor in favor of backend, simplify custom_ops" (@ZJY0516 #26472)
  • Upgrade Pydantic to v2.12.0 and remove hack for Python 3.13 (@hmellor #26481)
  • [Models][Qwen] Replace pad with cat for better performance (@lgeiger #26486)
  • [Attention][DCP] Support DCP with query length > 1 (MTP) with FA3 (@minosfuture #25049)
  • [Model] Apply shared experts overlap optimization to all models with shared experts (@bnellnm #26145)
  • [BUGFIX] Add cu_tokens_across_sp to DPMetadata (@SageMoore #26457)
  • [Bugfix] Enable padded FP4 quantization (@roikoren755 #25947)
  • [Bugfix] Disable moe inplace for torch >= 2.9 (@bnellnm #26497)
  • [Flashinfer][gpt-oss] Support FP8-qkv Flashinfer TRTLLM Sinks Attention (@elvischenv #25674)
  • [Core] Remove unused prev_sampled_token_ids_invalid_indices input batch field (@njhill #26514)
  • [UX] Add FlashInfer as default CUDA dependency (@mgoin #26443)
  • [Bugfix] Fix CUDA graph selection bug in FlashInfer at high concurrency (@benchislett #26499)
  • [Bug] Fix modular_kernel: ZeroDivisionError: integer division or modulo by zero (@yewentao256 #26528)
  • [CI] Fix Pre-commit Issue Cannot determine type of "rank" and "world_size" (@yewentao256 #26448)
  • Refactor MistralTokenizer (@juliendenize #26358)
  • [DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY (@ruisearch42 #23849)
  • [Core] Small simplification in GPUModelRunner._update_states() (@njhill #26508)
  • [Chore]: One pythonic tool parser test uses the wrong parser (@bbrowning #26515)
  • [Spec-Decode] Support piecewise cudagraphs for Eagle head (@LucasWilkinson #25109)
  • fix test_simple_inductor_graph_partition (@BoyuanFeng #26522)
  • [deepseek] kernel block size for UniformTypeKVCacheSpecs (@heheda12345 #26559)
  • [Metrics] Log multi-modal cache stats and fix reset (@DarkLight1337 #26285)
  • [GPT-OSS] Add support for arrays at tool message content (@luis5tb #25593)
  • Remove LoRA bias support (@ashwin-phadke #25807)
  • [CI] fix ruff format (@chaunceyjiang #26579)
  • [bugfix][DCP] fix block_size of hash in DCP prefix caching (@heheda12345 #26296)
  • [NIXL] Ignore abort on already-finished request (@markmc #25067)
  • [Bugfix] Convert untraceable GroupShape to list for AMD impl (@Lucaskabela #26535)
  • [BugFix] Fix noop elimination edge case (@andylolu2 #26394)
  • [CI] fix test_run_batch.py::test_completions - AssertionError (@chaunceyjiang #26578)
  • [BugFix][torch.compile] Fix fused_scaled_matmul_reduce_scatter signature for PyTorch 2.8 (@jasonlizhengjian #26038)
  • Added test_top_k_per_row to test-pipeline.yaml. (@dcampora #26569)
  • [Bugfix] Make DP padding optional in coordinate_batch_across_dp (@SageMoore #26375)
  • Silu v2 (@elvircrn #25074)
  • [Metrics] Add test for multi-modal cache stats logging (@markmc #26588)
  • [torch.compile] Make inductor partition rules respect splitting_ops #25691 (@baonudesifeizhai #25845)
  • [Bugfix] fixed top_logprobs: -1 does not appear to work as intended (@chaunceyjiang #26470)
  • [Model][Qwen3VL] Compute cu_seqlens on CPU to remove (@lgeiger #26496)
  • [Model] Add FlexOlmo model implementation (@2015aroras #24923)
  • [Transform] [Quantization] Add QuTLASS support to vLLM (@LopezCastroRoberto #24440)
  • Add Qwen3-Omni moe thinker (@wangxiongts #25550)
  • Update pre-commit hook versions (@hmellor #26591)
  • Update CUDA architecture list in build pipeline for 12.9.1 wheels (@wseaton #26592)
  • Fix some typing issues found by mypy==1.18.2 (@hmellor #26596)
  • [BUG] Qwen3-next MTP. Fix attn metadata build bug (@vadiklyutiy #26564)
  • [BugFix] Fix async scheduling + request preemption (@njhill #26385)
  • Cache the environment variable check for batch invariance (@bwasti #26510)
  • AOT Compilation for torch.compile (Bundled) (@zhxchen17 #24274)
  • [BugFix] Make penalties and bad_words work with async scheduling (@njhill #26467)
  • [Frontend] Improve the performance of is_reasoning_end (@chaunceyjiang #25735)
  • [CI/Build] Fix ppc64le CPU build and tests (@npanpaliya #22443)
  • [XPU] Upgrade NIXL to remove CUDA dependency (@zhenwei-intel #26570)
  • [MM] Move Qwen3Omni MRoPE impl to model file (@ywang96 #26608)
  • [Bugfix][Multi Modal] Fix incorrect Molmo image processing (@sangho-vision #26563)
  • [Refactor]: Use M-RoPE interface directly while defining model class instead of maintaining model specific M-RoPE implementation in mrope.py (@divyanshsinghvi #24172)
  • fix(nix): Allow local oneDNN path to fix vLLM CPU build failure (@ihb2032 #26401)
  • Add EAGLE-3 Speculative Decoding Support for Qwen3 MoE (@rahul-tuli #26485)
  • [CPU] fix the issue when the node is '-' cause json decode error. (@muzian666 #26562)
  • [Refactor]Reduce duplicate code in serving_chat (@chaunceyjiang #26627)
  • [compile] Add patched_fused_scaled_matmul_reduce_scatter (@angelayi #26604)
  • [Bugfix][Qwen3VL] fix deepstack in qwen3vl (@JJJYmmm #26626)
  • [Bugfix] Fix qwen-moe packed_modules_mapping (@jeejeelee #26634)
  • [Benchmark] Support Infinity API (@DarkLight1337 #26641)
  • CP: make correct_attn_out robust to 4‑D views and fix Triton arg binding (@hl475 #26509)
  • [compile] Fix inductor partition config (@angelayi #26645)
  • [EPLB] Support ernie4.5-moe (@HsChen-sys #22100)
  • Add @noooop to codeowner for pooling models (@noooop #26652)
  • [PERF] [Qwen3-next] Speed up gated RMSNorm (@vadiklyutiy #26207)
  • [MISC] Rename the torch profiler filename as instance_id+rank_id for merging the Profiler results of each Rank (@noooop #25867)
  • [Bugfix][CI/Build] Fix failing Mteb CI (@Isotr0py #26638)
  • [Bugfix][DCP] Set default CUDAGraphMode to PIECEWISE for DCP (@FENP #26574)
  • [TEST][BUG FIX] Fix DP GPU_ID issue (@xuechendi #26442)
  • Update Optional[x] -> x | None and Union[x, y] to x | y (@hmellor #26633)
  • [Feature] Add support for naver/splade-v3 (BERT-based sparse embedding model) (@gjgjos #26339)
  • [Models][Qwen3VL] Speedup fast_pos_embed_interpolate (@lgeiger #26647)
  • [easy] fix pre commit error on trunk (@hl475 #26665)
  • [CI/Build] Add tool to build vllm-tpu wheel (@mgoin #19165)
  • [Misc] cache result of disable_inplace (@bnellnm #26666)
  • [Bugfix][Core]Fix block table out-of-range issue in priority scheduling (@quanliu1991 #26661)
  • [FIX] Throwing an exception when the model does not support pool tasks (#25840) (@yyzxw #25855)
  • docs: wrong command in structured_outputs README (@yihong0618 #26677)
  • [Model] Fix Skywork R1V mlp (@jeejeelee #26673)
  • [Model] Add reasoning_parser and tool_parser for Ernie45 thinking (@CSWYF3634076 #25027)
  • Ignore large reformatting PRs in git blame (@hmellor #26690)
  • [Model][0/N] Improve all pooling task | clean up (@noooop #25817)
  • [ResponseAPI] Simplify input/output message serialization (@Jialin #26620)
  • [Bugfix] Fix out of bound index issue for Jina-embedding-v3 RoPE with cuda graph (@Isotr0py #26687)
  • [unrevert] Add batch invariant kernel override for FlashInfer backend [2/n] (@bwasti #26373)
  • [Hardware][CPU] Disable torch.compile for RISC-V to prevent APIError (@ihb2032 #26693)
  • [FEATURE]: Use pydantic validation in multimodal.py config (@andycandy #26629)
  • [UX] Speedup DeepGEMM warmup with heuristics (@mgoin #25619)
  • [P/D] [NixlConnector] kv load recovery integration (@wseaton #26171)
  • [Misc] Separate prompt logging to debug (@aitsvet #26713)
  • [CI/Build] upgrade compressed-tensors to 0.12.2 to address LGPLv3 (@csy1204 #26501)
  • [Bugfix][Rocm] fix qr error when different inp shape (@haoyangli-amd #25892)
  • [Bugfix][Speculative Decoding] Extend Eagle quantization config fix to llama_eagle.py (@rahul-tuli #26590)
  • [Model] Use merge_by_field_config for MM models (M-N) (@DarkLight1337 #26710)
  • [Log] Optimize Startup Log (@yewentao256 #26601)
  • [CI][Release][Arm64]: Build arm64 release for gpu arch 8.9 (@cyb70289 #26698)
  • [Quantization] [Performance] Enable Marlin GEMM kernels for the calibration-free RTN-based quantization (@sakogan #26051)
  • [Frontend][1/N] Improve all pooling task | Support FP16 Embedding Base64 (Still uses fp32 by default). (@noooop #26414)
  • [CI] Fix mypy for vllm/distributed (@yewentao256 #26593)
  • [CI Perf]Prune Tests in kernel/mamba (@kfhfar #26538)
  • [Bug] Fix Assertion error DeepEP/csrc/kernels/intranode.cu:928: 'false and Unsupported type' (@yewentao256 #26532)
  • [FrontEnd] UNREVERT CompilationConfig overhaul (#20283): deprecate use_inductor in favor of backend, simplify custom_ops (@morrison-turnansky #26502)
  • Pruning kernel Core Tests (@kfhfar #26727)
  • [ResponseAPI] Further polish message serialization and unit tests (@Jialin #26728)
  • Add tests for chunked prefill and prefix cache with causal pooling models (@maxdebayser #26526)
  • [Misc][DP] support customized aggregated logger for dp (@luccafong #24354)
  • [UX] Replace VLLM_ALL2ALL_BACKEND with --all2all-backend (@mgoin #26732)
  • [compile] Enable sequence parallelism for full cuda graph without specifying compile sizes (@angelayi #26681)
  • [Easy] Fix env type check errors from VLLM_DEBUG_LOG_API_SERVER_RESPONSE (@Jialin #26742)
  • [build][torch.compile] upgrade depyf version (@youkaichao #26702)
  • [torch.compile] Unwrap fused_marlin_moe custom op (@varun-sundar-rabindranath #26739)
  • [Feature][Quantization] auto_round format add support for regex (@n1ck-guo #24024)
  • Add support for the /rerank endpoint in vllm bench serve (@maxdebayser #26602)
  • [Docs] Add a start tag to build.inc.md (@windsonsea #26747)
  • Fix lora tests failure in TPU CI due to the removal of LoRA bias (@vanbasten23 #26723)
  • [CI] [ROCm] Automate CC list for ROCm related issue (@vllmellm #26753)
  • Adding the test-amd.yaml for test definitions for the AMD backend. (alternative PR) (@Alexei-V-Ivanov-AMD #26718)
  • scheduler.py: Update the name of the default scheduler. (@ryanli #26758)
  • [Model][Bugfix]fix ernie45 load failed due to ernie45 eplb code (@CSWYF3634076 #26684)
  • [CI/Build] Use 127.0.0.1 instead of localhost in utils (@yeqcharlotte #26750)
  • fix(frontend): always include usage, when configured to do so (@max-wittig #20983)
  • [Plugin] Make plugin group clear (@wangxiyuan #26757)
  • [Bugfix] Standardize merging multimodal embeddings (@DarkLight1337 #26771)
  • [Model] Use merge_by_field_config for MM models (O-P) (@DarkLight1337 #26776)
  • [NIXL][HeteroTP]Enable KV transfer from HND prefill to NHD decode (@xuechendi #26556)
  • [Chore] Use max_transformers_version for Qwen-VL test (@DarkLight1337 #26792)
  • Don't allow typos to fix by default (@hmellor #26785)
  • [Doc] ruff format some Python examples (@DarkLight1337 #26767)
  • [CI] Fix test_tool_id_kimi_k2 (@chaunceyjiang #26787)
  • [Chore] Remove SupportsV0Only interface and update supported models docs (@DarkLight1337 #26783)
  • [Feature] Change vllm.py with pydantic validation (@VladOS95-cyber #26726)
  • [CI/Build] Cleanup LoRA test (@jeejeelee #26752)
  • [DCP] Support Decode Context Parallel (DCP) for GQA with FlashAttention (@FENP #24864)
  • Adjusted the model order of the model registration file (@princepride #26798)
  • use combo kernel to fuse qk-norm and qk-rope (@BoyuanFeng #26682)
  • [issues template] Encourage the author implement their own ideas (@noooop #26671)
  • [KVConnector][Metrics] Aggregate scheduler-side KVConnectorStats (@QierLi #26046)
  • [Feature][Responses API] Stream Function Call - harmony (@chaunceyjiang #24317)
  • Revert "[issues template] Encourage the author implement their own ideas" (@noooop #26814)
  • [Config] Remove Unused Environment Variable VLLM_DISABLE_PAD_FOR_CUDAGRAPH (@yewentao256 #26743)
  • Update coveragerc and add codecov.yml for path fixes (@rzabarazesh #26435)
  • [CI] Raise VLLM_MAX_SIZE_MB to 500 due to failing Build wheel - CUDA 12.9 (@mgoin #26722)
  • [Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVidia B200 (@zklapow #26818)
  • [CI Failure] Fix tests with missing TinyLlama-1.1B-Chat-v1.0-FP8-e2e (@mgoin #26816)
  • llama4_vision_rope: add HIP override to accept (q, k) and avoid (positions, q, k) mismatch (@hl475 #26790)
  • [Attention][Spec Decode] FlashMLA spec decode support (@MatthewBonanni #26541)
  • [Core] Reuse empty block lists whenever possible in KVCacheBlocks to mitigate GC costs (@Jialin #24964)
  • Notice for deprecation of AutoAWQ (@HDCharles #26820)
  • [Perf] Cache vllm.env.getattr result to avoid recomputation (@Jialin #26146)
  • Added MoE configs for llama 4, H200 device with tp=4/8 tuning (@Dhruvilbhatt #26837)
  • fix: response_format for completion (@Nan2018 #23212)
  • [Minor] Group async_scheduling related fields in model runner init (@njhill #26736)
  • remove attn output view kernel (@BoyuanFeng #26680)
  • [Core] Streamline some structured output related code (@njhill #26737)
  • [CI Failure] Fix torchao dep failure for Quantization Test (@mgoin #26824)
  • [frontend][gptoss] Add per turn stats into Harmony Context (@lacora #25061)
  • [WideEP][P/D] Add usage stats for DP+EP and KV Connector (@tlrmchlsmth #26836)
  • [torch.compile] Fix tests for torch==2.9 inductor partition (@ProExpertProg #26116)
  • [Core][Easy] Use envs.getattr for all Unify to environment variable access (@Jialin #26810)
  • [Bugfix]fix Qwen3 xml tool parser (@Zhikaiiii #26345)
  • [BUGFIX][NIXL] quick fix for 'assert self.connector_worker is not None' in get_kv_connector_stats (@xuechendi #26851)
  • Disable FlashInfer sampler by default (@mgoin #26859)
  • [Frontend][torch.compile] CompilationConfig Overhaul (#20283): name change compilation level to compilation mode, deprecation compilation level (@morrison-turnansky #26355)
  • [Bugfix] Fixes prefix-repetition benchmark script (@kouroshHakha #26828)
  • [Model] Add DeepSeek-V3.1 reasoning parser (split from PR #24972) (@taohui #25589)
  • [Docs] Move build.inc into arm.inc (@windsonsea #26862)
  • [CI/Build][Bugfix] fix qutlass cmake error when set QUTLASS_SRC_DIR (@izhuhaoran #26773)
  • [Feature] default --extra-body param to disable thinking in vllm bench serve (@lengrongfu #26784)
  • [BugFix] Patch inductor partitioning logic (@angelayi #26735)
  • [Bugfix] Fix qwen3-omni audio truncation issue (@Isotr0py #26815)
  • [Graph Partition] pass tests for decorator (@BoyuanFeng #26831)
  • [Bugfix][Multi Modal] Fix incorrect Molmo token processing (@sangho-vision #26873)
  • [DSA][MLA] Tiny refactor on DeepSeek to make it reusable for different backends (@MengqingCao #26656)
  • [Misc] Use helper function to generate dummy messages in OpenAI MM tests (@DarkLight1337 #26875)
  • [bugfix] Lazy import cv2 (@angelayi #26869)
  • [Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather (@zyongye #26456)
  • [CI/Build] Add Qwen2.5-VL-7B-Instruct ChartQA Accuracy Tests in CI (@zhewenl #21810)
  • [CI] Fix mypy for vllm/executor (@yewentao256 #26845)
  • [Doc] ruff format remaining Python examples (@DarkLight1337 #26795)
  • [doc] add Context Parallel Deployment doc (@youkaichao #26877)
  • [Misc] Update TritonLanguagePlaceholder to have attributes that are used by Flash Linear Attention ops. (@madongfly #26853)
  • [Fix] Remove divisibility requirement between num_kv_heads and tp_size in bailing_moe (@ant-yy #26876)
  • [Easy] Get rid of unnecessary paraenthesis in kv_cache_manager (@Jialin #26842)
  • [Platform] allow platform to init dp group (@wangxiyuan #22243)
  • [Lora]Load tuned multi-lora kernel configs from json files (@li2haipeng #26319)
  • [Model][2/N] Improve all pooling task | Support multi-vector retrieval (@noooop #25370)
  • [Misc] Remove isort and yapf ignores (@DarkLight1337 #26888)
  • [Misc] rename torch_dtype to dtype (@wangxiyuan #26695)
  • chore: remove unused marker (@max-wittig #26890)
  • [BugFix] Patch inductor memory plan logic (@BoyuanFeng #26878)
  • [Chore] Separate out vllm.utils.func (@DarkLight1337 #26904)
  • [Chore] Separate out vllm.utils.async_utils (@DarkLight1337 #26913)
  • Lower severity of log when model info cache misses due to exception (@hmellor #26917)
  • Olmo 3 tool parser and tests (@pdasigi #26143)
  • [Feature]: Use pydantic validation in observability.py config (@cern1710 #26637)
  • [ModelOpt] Remove NVFP4 MoE K%16==0 constraint (@XiaobingSuper #26891)
  • [Chore] Clean up CODEOWNERS (@WoosukKwon #26923)
  • [NVIDIA] Add support for cudnn fp4 gemm via flashinfer (@kaixih #26107)
  • Vectorize RMS norm variance using vectorize_read_with_alignment (@bbeckca #26234)
  • support flashinfer_fp4 moe for 5090 gpu (@XiaobingSuper #26669)
  • [Bug] Temporally Disable VLLM_ALLREDUCE_USE_SYMM_MEM by Default (@yewentao256 #26925)
  • Move query quantization to attention layer for Flashinfer & Triton. (@adabeyta #26534)
  • Adjusting AMD test composition 2025-10-14 (@Alexei-V-Ivanov-AMD #26852)
  • [Qwen3-Next] Add tuned MoE config for Qwen3-Next FP8 on H100 tp2 (@felixzhu555 #26887)
  • [Bugfix] reasoning_parser parameter handling in run_batch.py (@inc-jeong #26225)
  • [ROCm][FEAT] Fuse DeepSeek shared experts into AITER fused_moe ops (@kliuae #24097)
  • [CI] Enable Blackwell Llama4 MoE tests (@mgoin #26731)
  • [BUG] Allow runai_streamer_sharded in config check (@ahao-anyscale #26958)
  • [bugfix] Fix SP + PP without specifying compile size (@angelayi #26955)
  • [BugFix] Work around graph partition x torch.compile cache issue (@zou3519 #26956)
  • [DOC][XPU]update feature parity with Intel GPU (@xuechendi #26954)
  • [Chore] Rename utils submodules (@DarkLight1337 #26920)
  • [PERF] Qwen3-next MTP speedup (change bool mask indexing to index_select / index_copy to reduce d2h) (@vadiklyutiy #26437)
  • Deepseek-v3 Batch Invariant on 8xH100 (@bwasti #26609)
  • [CI/Build] Update expected beam search output for Phi3V (@DarkLight1337 #26978)
  • [Hardware][CPU][PowerPC]Disable torch.compile() in toptopk sampling (@Akashcodes732 #26987)
  • [CI/Build] Fix AMD import failures in CI (@zhewenl #26841)
  • [Benchmark] Use truncation by default for pooling benchmarks (@DarkLight1337 #26992)
  • [Chore] Separate out vllm.utils.collections (@DarkLight1337 #26990)
  • [Model][Bugfix] fix ernie45 vl run failed from shared experts optimization (@CSWYF3634076 #26885)
  • Cleanup code after Python 3.10 upgrade (@lgeiger #26520)
  • [MISC] fix import violations for re and triton modules (@llsj14 #26654)
  • [Bugfix] Correct LayerNorm epsilon parameter in modernbert.py (@bogdanminko #27008)
  • [Benchmark] Show E2EL by default for pooling models (@DarkLight1337 #27014)
  • [Attention] Tune CUTLASS MLA num_splits (@MatthewBonanni #26846)
  • [NIXL] Improve request_finished() debug logs (@markmc #25665)
  • [docs] standardize Hugging Face env var to HF_TOKEN (deprecates HUGGING_FACE_HUB_TOKEN) (@yankay #27020)
  • [CI] Replace large models with tiny alternatives in tests (@tahsintunan #24057)
  • [Feature] Add process_weights_after_loading to AttentionImpl (@lengrongfu #26870)
  • [Model] Fix Qwen3VL mm mapping (@jeejeelee #27027)
  • Fix Qwen2.5 VL image grid docstring (@skyloevil #27033)
  • Support set in the CLI generation (@hmellor #27031)
  • [gpt-oss][1/N] EZ: refactor serving_responses for modularity (@qandrew #26948)
  • Support block size of 256 used by Intel HPU (@mandy-li #26883)
  • [Compressed Tensors] Always clone output for compile robustness (@kylesayrs #26849)
  • Adding Warmup to Benchmark Serving (@kimbochen #26943)
  • [Bug] Fix batch invariant test has to is (@yewentao256 #27032)
  • [GPTOSS][DP/EP][Marlin] Enable GPTOSS Batched DP/EP using Marlin kernels (@varun-sundar-rabindranath #25997)
  • [Feature] Migrate DeepGEMM API from get_m_alignment_for_contiguous_layout to get_mk_alignment_for_contiguous_layout (@yewentao256 #26935)
  • [CI] Prune Quantization Tests and skip compilation (@mgoin #27038)
  • [Bug] Add Assertion for random-input-len / random-output-len (@yewentao256 #26834)
  • [small][batch invariance] Rename the env and internal flags to simplify usage (@bwasti #26855)
  • Refactor Transformers backend to use mixins (@hmellor #26906)
  • [NVIDIA] [Perf] Update to leverage flashinfer trtllm FP4 MOE throughput kernel (@jiahanc #26714)
  • [torch.compile] Passing only necessary compilation config to inductor pass config (@luccafong #27041)
  • [Chore] Separate out vllm.utils.import_utils (@DarkLight1337 #27022)
  • [torch.compile] fix simple inductor graph partition test (@BoyuanFeng #27050)
  • Remove unused imports (@lgeiger #26972)
  • vllm bench serve shows num of failed requests (@tomasruizt #26478)
  • [Docs] Reduce custom syntax used in docs (@hmellor #27009)
  • [Perf] Exploit out-of-band buffers in shm_broadcast (@njhill #26961)
  • disable graph partition in custom op (@BoyuanFeng #26952)
  • [Bugfix][Qwen] fixes the weights dtype in qwen3_next: it is actually a bfloat16 (@sighingnow #27030)
  • [Core] Change execute_model_with_error_logging() to be a ctx manager (@njhill #27060)
  • [Bugfix] Fix ReplicatedLinearWithLoRA (@jeejeelee #27065)
  • [Kernel] Lazy import FlashInfer (@jeejeelee #26977)
  • [CI/Build] Update Llama4 eval yaml (@zhewenl #27070)
  • [Model] Always use Transformers backend for PaliGemma and Gemma3-MM (@DarkLight1337 #26715)
  • [Model] Add support for LightOnOCR (@staghado #26916)
  • [CI/Build] Update compressed tensor test path to fix CPU CI (@bigPYJ1151 #27068)
  • [Kernel][Performance] Fuse float cast and renormalize to topk softmax kernel (@izhuhaoran #26717)
  • [CI] fix docs build failed (@chaunceyjiang #27082)
  • Update troubleshooting.md and remind VLLM_TRACE_FUNCTION usage (@Prowindy #27069)
  • [VLM][Refactor] Remove useless func get_input_positions in MRotaryEmbedding (@MengqingCao #27088)
  • [Docs] Replace all explicit anchors with real links (@hmellor #27087)
  • [Docs] Replace rst style double-backtick with md single-backtick (@hmellor #27091)
  • [Model]Improve Qwen3VLMoeForConditionalGeneration packed_modules_mapping (@jeejeelee #27096)
  • [Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI350 and MI355 (@rkarhila-amd #25586)
  • Fix incorrect docstring for stop_profile() method (@hyongtao-code #27101)
  • [torch.compile] Enable attention and allreduce fusion without custom ops enabled (@ProExpertProg #24604)
  • [CI] Nixl integration tests (@NickLucche #27010)
  • [Data-parallel] Allow DP>1 for world_size > num_gpus on node (8) (@patrickvonplaten #26367)
  • [bugfix] Qwen3-VL fix video incorrect timestamp calculations while do_sample_frames=True (@wulipc #27104)
  • [CI] Remove forbidden slash (@NickLucche #27112)
  • [ROCM] MoE fp4 CK kernel (@maleksan85 #26545)
  • [ROCm][Bugfix][Model] Fix illegal memory access when running qwen3_moe models with rms_norm (Qwen3-235B-A22B, Qwen3-30B-A3B, etc.) (@rasmith #26192)
  • [Bugfix] [AITER] [ROCm] Fix Quark MoE Quant Config and AITER Fused MoE quant type logic (@vllmellm #27029)
  • [Chore] Remove unused PolyNorm layer (@Isotr0py #27110)
  • [Bugfix] Use PIECEWISE cudagraphs on Blackwell if max_model_len > 131072 (@mgoin #27114)
  • [Minor] Remove unnecessary error message (@zhuohan123 #27115)
  • [V1][Spec Decode] Fix greedy temperature detection after sampler refactor (@Pradyun92 #27077)
  • [Test] Make test_failure more stable for batch invariance (@yewentao256 #27054)
  • [BugFix][Core] Fix error when enable async-scheduling in multi-node env (@lhtin #25887)
  • [Perf] Add H100 fused MoE config (@skyloevil #25398)
  • [CI/Build] tests(v1): feed Triton attention the (num_blocks, 2, …) KV cache layout in backend-correctness tests (@hl475 #26663)
  • [GPT-OSS] Structure_Tag support for gpt-oss tool-call in cot (@Hanchenli #25515)
  • [Misc] Rev DeepEP (@varun-sundar-rabindranath #27122)
  • [DOC][FEATURES][CPU]update cpu feature for v1 (@xuechendi #27135)
  • [Test] Add test for /health endpoint on engine failure (@dongbo910220 #26074)
  • [Chore] Separate out vllm.utils.mem_utils (@iAmir97 #27143)
  • [Feature] Batch Invariant: Support DeepGEMM and Blackwell (@yewentao256 #27127)
  • [fix][cpu] fix prefill attention in CPU attention backend (@fadara01 #27035)
  • [Misc] Refactor get_kv_cache_spec into AttentionLayerBase (@NickLucche #26587)
  • [Models][QwenVL] Remove unnecessary .contiguous() calls (@lgeiger #27106)
  • [Chore] Clean up pytorch helper functions in vllm.utils (@Isotr0py #26908)
  • Fix incorrect string formatting in barrier timeout exceptions (@hyongtao-code #27149)
  • [Minor] Add some clarifying comments to recent changes (@njhill #27130)
  • [BugFix] Fix failing gemma-3-1b-it test: test_lm_eval_accuracy_v1_engine[google/gemma-3-1b-it] (@LucasWilkinson #27111)
  • [Chore] Separate out profiling utilities from vllm.utils (@dongbo910220 #27150)
  • [BugFix] fix graph partition signature (@BoyuanFeng #27139)
  • [BugFix] Disable fp8 kv-cache by default for DeepSeek V3.2 (@LucasWilkinson #27121)
  • [V1][Metrics][Plugin] Add plugin support for custom StatLoggerBase implementations (@ptovam #22456)
  • [Minor] Remove unused env variable (@WoosukKwon #27161)
  • [BugFix] Fix lazy imports involving outlines_core (@22quinn #27158)
  • [Chore] Separate out hashing utilities from vllm.utils (@dongbo910220 #27151)
  • [Benchmark] Convenience script for multiple parameter combinations (@DarkLight1337 #27085)
  • output type conversion fix (@jianyuh #27159)
  • [Chore] Separate out vllm.utils.network_utils (@iAmir97 #27164)
  • [Misc] Move utils to avoid conflicts with stdlib, and move tests (@DarkLight1337 #27169)
  • [Bugfix] Fix error with penalties when speculative decoding and structural output are enabled (@southfreebird #26586)
  • Fix typo in ValueError message: use kv_role instead of kv_disagg_role (@hyongtao-code #27166)
  • [Model][VLM] Support Bee-8B Model (@uyzhang #27012)
  • [LoRA] LoRA cuda graph specialization (@andylolu2 #25914)
  • [Kernel] Accelerate solve_tril with TMA (@ZJY0516 #26746)
  • AArch64 CPU Docker pipeline #26931)
  • Nemotron Nano V2 VL + EVS Video Support (@BloodAxe #27107)
  • [Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on H100 (FP8/BF16) (@shivampr #26268)
  • [Bugfix][CI] Fix Distributed Tests (4 GPUs) async_sched+ray test (@NickLucche #27195)
  • [Feature][Quantization] auto_round support for mixed bits quantization (@n1ck-guo #23812)
  • [ROCm] enable some tests in entrypoints test groups on AMD (@Concurrensee #26725)
  • [ez] add uv lock to gitignore (@qandrew #27212)
  • [Quantization] Automatically infer AWQ modules_to_not_convert field (@Isotr0py #26909)
  • [V0 Deprecation] Remove V0 metrics code (@njhill #27215)
  • [cpu] Dispatch un-quantized linear to oneDNN/ACL by default for AArch64 (@fadara01 #27183)
  • create is_in_the_same_node on cpu (@helunwencser #26832)
  • [Frontend] Enforce tokenize=False when applying chat template (@russellb #27205)
  • [Feature][Kernel]FusedMoE LoRA (@wcwuwc #21229)
  • [BugFix] GPT-OSS Attention DP + MoE TP weight loading issue (@nvpohanh #24032)
  • [ModelOpt] Load w13/w2_input_scale for all experts, nvfp4 (@wenscarl #26135)
  • [Bugfix] Fix gpt-oss w4a8 DP/EP on B200 (@varun-sundar-rabindranath #26729)
  • [Bugfix] Fix broken MTP weight loading for FP8 KV Scales (@benchislett #27227)
  • [Fix][Spec Decode] Fix llama4 draft loading with different quantization (@linzebing #27136)
  • [Nixl] Minor refactor to handshake related metadata (@NickLucche #26410)
  • [MM][Core] Decouple ViT backend from LM backend (@ywang96 #27061)
  • [Deepseek v3.2] Optimize top_k_per_row (@dcampora #26763)
  • [Chore] Separate out NCCL utilities from vllm.utils (@dongbo910220 #27197)
  • [CI] Install pre-release version of apache-tvm-ffi for flashinfer (@hmellor #27262)
  • [ROCM] Enable CompressedTensorsWNA16 (@JartX #27187)
  • Add @pavanimajety to .github/codeowners (@pavanimajety #27213)
  • [ROCm] Update Triton, Torch, and AITER branches for ROCm base Dockerfile (@micah-wil #27206)
  • [Feature] Batch Invariant for R1 TP 8 on Blackwell (@yewentao256 #27229)
  • [Bugfix][P/D] Reduce num_threads used by nixl ucx backend (@dagrayvid #27196)
  • [V0 Deprecation] Remove V0 executors (@njhill #27142)
  • [Bugfix] fixes the decoding metadata of dense mla's fp8 kvcache. (@sighingnow #27144)
  • Update PyTorch to 2.9.0+cu129 (@huydhn #24994)
  • [Performance] Dual stream execution of "shared_experts" and "selected_experts" inside FusedMoE (@alexm-redhat #26440)
  • Updated xgrammar backend to not deny supported string formats (@ExtReMLapin #27253)
  • [Bugfix] skip cuda graph for drafter when running with eager (@benchislett #26821)
  • [P/D] KVConnector for decode benchmarking (@tlrmchlsmth #25986)
  • [Deepseek v3.2] Remove extra logics in indexer (@IwakuraRein #26465)
  • [DOC] [ROCm] Add ROCm quickstart guide (@vllmellm #26505)
  • [CI] Nixl integration tests DP-EP (@NickLucche #27199)
  • [Benchmark] Add plot utility for parameter sweep (@DarkLight1337 #27168)
  • [torch.compile] Enable silu_mul_fp8_quant fusion without custom ops enabled (@ZJY0516 #27146)
  • [1/N][Platform] Cleanup useless function (@wangxiyuan #26982)
  • Update release pipeline for PyTorch 2.9.0 (@huydhn #27303)
  • Remove last level references not removed #26355 (@hmellor #27260)
  • fixed reasoning streaming with tool_choice="required" (@ExtReMLapin #24108)
  • [Frontend][3/N] Improve all pooling task | Support binary embedding response (@noooop #27066)
  • [Bugfix][CPU] Disable dual stream execution for experts on CPU (@bigPYJ1151 #27320)
  • [Bug] Raise error for LLM(data_parallel_size=k) single-process DP Usage (@yewentao256 #27282)
  • Bugfix - pass 'max_num_tokens_padded' into 'moe_lora_align_block_size' (@gnovack #27311)
  • [Core] Handle MoE LoRA edge cases (@jeejeelee #27335)
  • [docs] Update v1 metrics design doc (@markmc #27332)
  • Mirroring changes in test-pipeline.yaml into test-amd.yaml (@Alexei-V-Ivanov-AMD #27242)
  • [Chore] Separate out optional dependency checks from vllm.utils (@dongbo910220 #27207)
  • [Model] Upstream Deepseek-OCR model (@Isotr0py #27247)
  • [NIXL] Terminate handshake listener thread in shutdown (@markmc #26404)
  • [Bug] Fix DeepSeek-V2.5-1210-FP8 issue (@yewentao256 #27267)
  • [bugfix] remove unused parameters to reduce unnecessary vram usage (@ReinForce-II #26789)
  • [Bugfix] Add missing 'is_internal_router' attribute to FusedMoEWithLoRA (@jeejeelee #27351)
  • [NIXL] use Host buffer to support TP_ratio > 1 for XPU (@xuechendi #27140)
  • [Bugfix] Make get_mrope_input_positions instance methods (@DarkLight1337 #27342)
  • [Bugfix] Fix HF format InternVL large variants video processing (@Isotr0py #27330)
  • [Frontend] Require flag for loading text and image embeds (@russellb #27204)
  • [P/D] Dynamic kv_output_aggregator collect size (@NickLucche #26734)
  • Support Anthropic API /v1/messages Endpoint (@LiuLi1998 #22627)
  • [Bugfix] Disable FlexAttention direct block mask building for encoder-only models (@Isotr0py #27344)
  • [Model] Revert PR #26715: Restore custom PaliGemma and Gemma3-MM impl… (@lucianommartins #27309)
  • [Doc] Fix numbering sequence in prefix caching (@gigit0000 #27357)
  • [Prefix Cache] Use LoRA name for consistent KV-cache block hashing (@sagiahrac #27211)
  • [Feature] publisher default set zmq in kv_event config (@lengrongfu #26915)
  • [BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (@Daisy-Ma-coder #27128)
  • [Chore] Separate out system utilities from vllm.utils (@dongbo910220 #27201)
  • [MLA] Bump FlashMLA (@MatthewBonanni #27354)
  • [Bugfix] Fix deepseek-ocr multi-image inference and add merge_by_field_config=True with tensor schema support (@Isotr0py #27361)
  • [Bugfix] Fix SLA tuner initialization (@DarkLight1337 #27355)
  • [Bugfix] Fix incorrect kv cache metrics in grafana.json (@fangpings #27133)
  • [Bugfix][Core] running queue index leakage exception (@CLFutureX #26754)
  • [CORE] Support Prefix Caching with Prompt Embeds (@qthequartermasterman #27219)
  • [V1][spec decode] return logprobs for spec decoding (@TheEpicDolphin #26060)
  • [Model] Add num_cached_tokens for PoolingRequestOutput (@noooop #27378)
  • [Chore] Remove duplicate has_ functions in vllm.utils (@jonathanc-n #27372)
  • [CI/Build] Fix Prithvi plugin test (@DarkLight1337 #27393)
  • [Bugfix] Fix args settings for guided decoding args (@luccafong #27375)
  • [CI/Build] Fix AMD CI: test_cpu_gpu.py (@zhewenl #27388)
  • add SLA information into comparison graph for vLLM Benchmark Suite (@louie-tsai #25525)
  • [CI] Reorganize entrypoints tests (@chaunceyjiang #27403)
  • [Metrics] [KVConnector] Add connector prefix cache hit rate stats (@ptovam #26245)
  • [Model] Add MoE support for NemotronH (@tomeras91 #25863)
  • Run mypy on the lowest supported Python version instead of system Python (@hmellor #27048)
  • [Bugfix] Honor --mm_encoder_attn_backend when used (@bradleyhd #27124)
  • [Feature] Pydantic validation for speculative.py (@Navya1707 #27156)
  • [Misc] Remove use of CUDA_VISIBLE_DEVICES for device selection (fix DP slow startup time &c) (@ilmarkov #26709)
  • [CI/Build] Remove unnecessary flags from test registry (@DarkLight1337 #27353)
  • [Frontend][4/N] Improve all pooling task | Add plugin pooling task (@noooop #26973)
  • Mirroring the test definitions (2025-10-22) (@Alexei-V-Ivanov-AMD #27362)
  • [Bugfix] Fix dp_chunking enablement logic in FusedMoE layer (@alexm-redhat #27220)
  • [Bugfix][ROCm][DeepSeek] Fix for forward_hip in rope for DeepSeek (@gshtras #27373)
  • [Bugfix] Fix AWQ marlin layer skipping (@Isotr0py #27416)
  • [Misc] Add triton_kernels dependency (@varun-sundar-rabindranath #27370)
  • [Chore] Separate out vllm.utils.platform_utils.py (@jonathanc-n #27374)
  • [Attention] Fix FlashMLA metadata builder arguments for q_len > 1 (@MatthewBonanni #27368)
  • [Bugfix][DP] Fix creating too many DP Placement Groups (@kebe7jun #26880)
  • [Model] Siglip Embedding Support (@piood #27324)
  • [Hardware][POWERPC] Disable oneDNN path in vllm/model_executor/layers/utils.py for Powerpc (@Akashcodes732 #27422)
  • Granite 4.0 quark quantization support (@xiao-llm #26944)
  • Fix pooling adapters for Transformers backend (@hmellor #27338)
  • [Kernel] Add GPTQv2 format support for low-bit or asymmetric quantization, by adapting gptq_gemm (@xxxxyu #26092)
  • [Misc] Add TPU usage report when using tpu_inference. (@hfan #27423)
  • [Bugfix][CI] Move resolving cudagraph_mode before initializing attn_metadata_builder (@fhl2000 #27427)
  • Fix EventPublisherFactory logic for disabled KV cache events (@usberkeley #27419)
  • [Chore] remove structural tags logging lines (@aarnphm #27451)
  • [Bugfix] Fix Pydantic union resolution for ResponseFunctionToolCall in Responses API (@strinczer #26706)
  • [Misc] Avoid "PyTorch non-writable tensors" warning in RayPPCommunicator (@ruisearch42 #27443)
  • [Docs] remove v1 column for embedding models (@piood #27446)
  • [MM][Bugfix] Replace PatchEmbed's conv3d to linear layer (@Isotr0py #27418)
  • [BugFix] Fix torchrun DP with LLM class (@22quinn #27395)
  • [Refactor] move tool parsing logic from protocol.py to the tool parser (@chaunceyjiang #27383)
  • [Benchmark] Enable benchmark to run with encoding_format="bytes" (@DarkLight1337 #27467)
  • Fix AArch64 CPU Docker pipeline #27331)
  • [MISC] cudagraph_capture_sizes related improvements (@fhl2000 #26016)
  • Fix test named tool use (@chaunceyjiang #27458)
  • [Doc] Fix minor issues in docs/design/metrics.md (@draftbk #27436)
  • [cpu][fix] Fix onednn_mm crash on consecutive matmuls with same M,K,N and different dtype (@fadara01 #27472)
  • [compile] Turn standalone_compile back on (@zou3519 #27460)
  • [NIXL][BUGFIX] delay done_recving queue cleanup to bottom of get_finished (@xuechendi #27297)
  • [Bugfix] Fix MultiConnector stats reconstruction across process boundaries (@kouroshHakha #27366)
  • [Attention] Add MLA prefill backend: trtllm_ragged_attention_deepseek (@minosfuture #26397)
  • [Bugfix] Fix interns1-vit qk norm code path (@Isotr0py #27480)
  • [CI/Build] Fix test_torch_utils in AMD CI (@zhewenl #27317)
  • [Document] Add ms-swift library to rlhf.md (@hjh0119 #27469)
  • [Perf][Async Scheduling] Remove CPU->GPU sync in dummy_run (@lhtin #27455)
  • [Distributed] Basic set of configuration for large EP deployment on GB200 (@wpc #27328)
  • [Log] Optimize Startup Log (@yewentao256 #26740)
  • [Misc][DP] Guard mxfp4 implementation selection (@varun-sundar-rabindranath #27484)
  • [KVConnector] Migrate the LMCache integration code to be vLLM native (@ApostaC #25542)
  • [CI] Add tests for cudagraph (@ZJY0516 #27391)
  • Revert "[Misc] Remove use of CUDA_VISIBLE_DEVICES for device selectio… (@zhuohan123 #27502)
  • [Core][Hybrid allocator + kv connector 1/n] Enable hybrid allocator + KV cache connector (@KuntaiDu #25712)
  • [Misc] Simplify max tokens in multimodal registry (@DarkLight1337 #27500)
  • [Attention] Add missing kv cache scale setup (@MatthewBonanni #27490)
  • [CI/Build] Refactor processing tests (@DarkLight1337 #27470)
  • [CI/Build] Use CPU for mm processing test on CI (@Isotr0py #27522)
  • [BUGFIX][ROCM] ViT FlashAttention on ROCm (no GFX9) and contiguous on qwen3vl ROCm TORCH_SDPA (@JartX #27190)
  • [Bugfix] Fix processor initialization for model from modelscope instead of HF (@lengrongfu #27461)
  • [Bugfix] fix empty prompts for async-engine mode in benchmark throughput (@luccafong #27494)
  • [Doc] Remove Molmo warning (@DarkLight1337 #27527)
  • [Doc] Fix links to GH projects (@DarkLight1337 #27530)
  • [Chore]:Extract math and argparse utilities to separate modules (@yeshsurya #27188)
  • Revert "[CI/Build] Use CPU for mm processing test on CI (#27522)" (@DarkLight1337 #27531)
  • [CI/Build] Update causal-conv1d installation (@DarkLight1337 #27529)
  • [Model][MiniMax-M2] Support MiniMax-M2 Model (@rogeryoungh #27535)
  • fix m2 test (@youkaichao #27536)
  • Fix MiniMax-M2 copyright (@rogeryoungh #27537)
  • [Model][Bugfix] fix ernie45 moe 300B SharedFusedMoE output tuple (@CSWYF3634076 #27316)
  • [Model] Use merge_by_field_config for MM models (Qwen series) (@DarkLight1337 #27546)
  • [Docs] reemove the incorrect enable_reasoning parameter (@yyzxw #27550)
  • [Performance][LoRA] add context varying params to 'do_not_specialize' in fused moe lora (@gnovack #27445)
  • [Model] Deprecate merge_by_field_config=False (@DarkLight1337 #27551)
  • [Doc] Slight improvement to M2 and beyond (@jeejeelee #27554)
  • [Kernel] Adding split_K implementation for fused_moe_lora (@dcmaddix #27291)
  • [Misc] Clean up utils (@DarkLight1337 #27552)
  • [Bugfix] Limit the default value of max_model_len when it is not specified by users (@shen-shanshan #27556)
  • [Bugfix] Fixed when return_token_ids=False, the first event still contains prompt_token_ids. (@chaunceyjiang #27561)
  • [cpu][perf] Fix low CPU utilization with VLLM_CPU_OMP_THREADS_BIND on AArch64 (@fadara01 #27415)
  • [Kernel] Enable moe LoRA kernel support FP16 (@jeejeelee #27468)
  • [Hybrid] Added supports_mamba_prefix_caching Protocol (@Josephasafg #27339)
  • [Model] Siglip2 Model Support (@piood #27566)
  • [Bugfix][LoRA][FusedMoE] Select MxFP4 Backend based on LoRA Enablement (@varun-sundar-rabindranath #27487)
  • fixing mm placeholder replacement issue with gemma3 (@tingtingtangmeta #27538)
  • [Chore]: Stream tokens vs characters in tool call parser tests (@bbrowning #26513)
  • [Misc] Clean up more utils (@DarkLight1337 #27567)
  • [ROCm] Update AITER branch for ROCm base docker (@micah-wil #27586)
  • Code quality improvements: version update, type annotation enhancement, and enum usage simplification (@usberkeley #27581)
  • [gpt-oss][2/N] Support input_messages in responsesRequest (@qandrew #26962)
  • [Bugfix][CI] Fix config resolving logic with remote models (@ywang96 #27610)
  • [Stability fix] turn off HMA allocator when connector is set (@KuntaiDu #27592)
  • [Bugfix] fixed inconsistent finish_reason handling between V0 and V1 engines (@chaunceyjiang #27555)
  • [ROCm] [Doc] Update ROCm installation docs (@vllmellm #27327)
  • [Hardware][AMD][Model] Triton MoE tuning configs for GLM-4.6 for MI300X (@minatoaquaMK2 #27323)
  • [Bugfix][CPU] Fallback oneDNN linear to torch linear to fix half gemm support on legecy platforms (@bigPYJ1151 #27526)
  • [Core][Bookkeeping Optimization] Update against numpy view of is_token_ids tensor (@Jialin #27618)
  • [CI/Build] Fix amd model executor test (@zhewenl #27612)
  • Fix a robust parsing issue in KimiK2ToolParser that causes IndexError (@wangln19 #27565)
  • [V0 Deprecation] Remove vestigial V0 logits_processors.py file (@njhill #27601)
  • [Bugfix] In LongRoPE, decide short vs long based on max_model_len (@MatthewBonanni #27431)
  • [Misc] Separate out utils.counter and move utils.Device to engine (@DarkLight1337 #27588)
  • [Bug] Fix shape issue for eplb expert weights (@yewentao256 #27589)
  • [compile] Add enable_prompt_embeds to compile hash. (@zhxchen17 #27285)
  • [Hybrid] Add mamba_block_size to Engine Args (@Josephasafg #27289)
  • [compile] Disable dynamo guards check for AOT compilation. (@zhxchen17 #27288)
  • fix: allow HuggingFace standard chat template params via **kwargs (@wangln19 #27622)
  • [Core] Enable async scheduling for external_launcher mode (@22quinn #27394)
  • [Bugfix][Frontend] validate arg priority in frontend LLM class before add request (@junpuf #27596)
  • [BugFix] Also consider RAY_EXPERIMENTAL_NOSET_* when storing compilation cache (@HollowMan6 #27294)
  • [nit]: lmcache integration import (@sammshen #27600)
  • [FLA] Introduce Kimi Delta Attention(KDA) to VLLM (@zhiyuan1i #27654)
  • [Bugfix] Fix allocation & free logic of SingleWriterShmRingBuffer (@imkero #27117)
  • [Bugfix][CI] Fix v1 attention backend tests and add CI coverage (@mmangkad #26597)
  • [Misc] Make LayerBlockType a Literal instead of Enum (@DarkLight1337 #27658)
  • [compile] Add fallback path to AOT compile when serialization fails. (@zhxchen17 #27350)
  • Add load pattern configuration guide to benchmarks (@mpashkovskii #26886)
  • [Misc] Make reorder batch also separate extends (@LucasWilkinson #27367)
  • [Test] Batch Invariant: Unit test using parameterized backend (@yewentao256 #27478)
  • [Core] Scheduler: Publish connector events after output (@orozery #25875)
  • [AsyncScheduling] Make async overlap work with logprobs (@njhill #27615)
  • [Misc][qwen2_5_vl][torch.compile] Enable supports_torch_compile on generic nn.Module and demonstrate speedup on Qwen Vision model (@Lucaskabela #23207)
  • [Bug] Fix deepep low latency use nvlink by default (@yewentao256 #27677)
  • [Core] Early return in SlidingWindowManager.remove_skipped_blocks (@Jialin #27673)
  • Install pre-built xformers-0.0.32.post2 built with pt-2.9.0 (@huydhn #27598)
  • Revert "Install pre-built xformers-0.0.32.post2 built with pt-2.9.0" (@simon-mo #27714)
  • [Build] Revert triton_kernels requirements (@varun-sundar-rabindranath #27659)
  • [NIXL][XPU] update name of nixl wheel (@zhenwei-intel #27631)
  • [Model] Fix Qwen3VL and Qwen3Omni after torch.compile changes (@lgeiger #27705)
  • [KV cache] Fix lmcache connector (@Shaoting-Feng #27681)
  • [CI/Build][Bugfix]Fix Quantized Models Test on AMD (@zhewenl #27712)
  • [Bugfix] Fix non-contiguous tensor error in rocm_unquantized_gemm_impl (@zhewenl #27605)
  • [Speculators] Move tests + fix integration (@dsikka #27308)
  • [CI/Build] Move pre-commit only scripts to tools/pre_commit (@DarkLight1337 #27657)
  • [perf] Enable concurrent execution of "shared_experts" and "selected_experts" in qwen3-next (@ZJY0516 #27578)
  • [Bugfix] Fix modular kernel tests (@bnellnm #27707)
  • [Frontend] [gpt-oss] Tool json call parsing error retry (@alecsolder #27675)
  • [Frontend] [gpt-oss] Mcp type bug (@alecsolder #27689)
  • [Fix] import get_kv_cache_torch_dtype error in vllm_v1_adapter.py (@KevinCheung2259 #27670)
  • [Misc] Raise error for missing video metadata in MultiModalDataParser (@Isotr0py #27664)
  • Feature/video support in random mm dataset (@BloodAxe #25963)
  • [chore] Remove models weight on S3 logic (@khluu #27725)
  • [VLM] Add Qwen3-VL generation test (@Isotr0py #25185)
  • [CI/Build] Skip cpu offloading test on AMD (@zhewenl #27690)
  • [Frontend] Add vllm bench sweep to CLI (@DarkLight1337 #27639)
  • Fix MiniMax-M2 rmsnorm precision and remove useless code (@rogeryoungh #27627)
  • [ROCm][Platform] Add MI308X device id in _ROCM_DEVICE_ID_NAME_MAP (@sammysun0711 #27623)
  • [CI] Fix flaky test_two_responses_with_same_prev_id test (@NickLucche #27745)
  • [Chore] Optimize P2PNCCLEngine http_address (@yewentao256 #27488)
  • [Core] Exposing engine sleep & wake_up state as prometheus metrics (@dumb0002 #24176)
  • [FIXBUG] Qwen3VL hallucinations without Contiguous on Torch.SDPA (@JartX #27744)
  • use_aot_compile should respect VLLM_DISABLE_COMPILE_CACHE (@BoyuanFeng #27698)
  • [CI/Build] Test torchrun with 8 cards (@22quinn #27548)
  • [Bug] Raise error explicitly if using incompatible backend (@yewentao256 #27424)
  • [KVConnector] Add metrics to Prometheus-Grafana dashboard (@NickLucche #26811)
  • [Bug] Fix DeepEP low latency assert self.batched_router_logits.size(-1) == full_router_logits.size(-1) Bug (@yewentao256 #27682)
  • [BugFix] Fix handling of resumed reqs in SharedStorageConnector (@njhill #27719)
  • [Bug] Fix DBO IMA issue for DeepEPHT (@yewentao256 #27666)
  • [Temp fix] Disable torch.compile for Qwen2.5 VL's VisionBlock temporarily. (@huachenheli #27760)
  • [XPU][bugfix] fix rope for llama4 and deepseek (@yma11 #25145)
  • [Bugfix] mamba-block-size is set for vision language model (@heheda12345 #27773)
  • [XPU] Update latest IPEX 2.8 release (@jikunshang #27735)
  • [BugFix] Handle unscheduled requests properly when async scheduling (@njhill #27756)
  • [Feat] Adds runai distributed streamer (@bbartels #27230)
  • kernels/moe test pruning (@kfhfar #27053)
  • [BugFix] Reordering extend logic fix (@LucasWilkinson #27739)
  • [Benchmark] Cleanup deprecated nightly benchmark and adjust the docstring for performance benchmark (@KuntaiDu #25786)
  • Add more dims for batch invariant shims (@bwasti #27489)
  • use stringData in secret yaml to store huggingface token (@yitingdc #25685)
  • [CI/Build]Add eval config for Qwen3-235B-A22B-Instruct-2507-FP8 (@hl475 #27113)
  • [BugFix][VL] Fix FA selection on Qwen2.5-VL (@zhewenl #27790)
  • [V0 deprecation] Remove VLLM_USE_V1 usage in config module (@wangxiyuan #27784)
  • [CI Failure] fix test_default_mm_loras (@hl475 #27795)
  • [CI] Fix mypy for vllm/v1/core and vllm/v1/engine (@yewentao256 #27108)
  • [Bugfix] Improve GPU validation logging in Ray fallback scenarios (@sairampillai #25775)
  • [Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. (@noooop #25524)
  • [CI Failure] Fix test_kv_cache_model_load_and_run (@hl475 #27717)
  • [Model] Introduce Kimi Linear to vLLM (@zhiyuan1i #27809)
  • [KV offload] Enable CPU KV offload on CUDA alike Platforms (@zhewenl #27770)
  • [Model][Ouro] Support Ouro Model (@FlamingoPg #27794)
  • [Bugfix][CPU] Fix MRoPE dispatch on the CPU backend (@bigPYJ1151 #27800)
  • [BugFix] Stopgap - Flashinfer Autotuner + GPT-OSS + DP/TP (@varun-sundar-rabindranath #27762)
  • [Misc] Replace CUDA_VISIBLE_DEVICES in DP with torch.cuda.set_device for device selection on cuda-like devices (@ilmarkov #27564)
  • [Docs] add Shanghai Meetup - 2025/10 (@kebe7jun #27545)
  • Reapply "Install pre-built xformers-0.0.32.post2 built with pt-2.9.0" (@huydhn #27768)
  • [MTP] Refactor mtp predictor to avoid d2h operation (@MengqingCao #27643)
  • [Model] Use the same fused_moe configs for all H200 devices (@bufferoverflow #23642)
  • [Bugfix] Fix 2 precommit issues - (mamba_block_size, kv_cache_config) (@tlrmchlsmth #27811)
  • [Core][Bookkeeping] Update cu_num_accepted_tokens for all req_index (@Jialin #27629)
  • [EP/DP][API Server] Enable DP-aware routing in OpenAI API requests (@Prowindy #24945)
  • [Fix] Skip record_sleep_state logic in PrometheusStatsLogger if not in dev mode (@SumanthRH #27789)
  • [Refactor] Remove VLLM_DEEPEP_LOW_LATENCY_ALLOW_NVLINK (@yewentao256 #27750)
  • [Core][Perf] Only invoke save_new_computed_blocks when computed blocks are not empty (@Jialin #27799)
  • [Feature] Batch invariant torch.compile (@PaulZhang12 #27660)
  • [BugFix] Fix broken import in initialize_ray_cluster() (@njhill #27838)
  • [Misc] Make all tool scripts executable (@MatthewBonanni #27831)
  • [CI/Build][Intel] Enable performance benchmarks for Intel Gaudi 3 (@jakub-sochacki #26919)
  • [CI Test] Add Scheduled Integration Test (@yewentao256 #27765)
  • [benchmark] Make request IDs unique across clients by default (@eicherseiji #27723)
  • [Hardware][Powerpc] Fix VLLM_CPU_OMP_THREADS_BIND="auto" low CPU utilization for Power (@Akashcodes732 #27734)
  • [Kimi-Linear] Correct prefixes and add compatibility to AWQ quants (@toncao #27834)
  • [Bugfix] Avoid too small block m/n for FlexAttention kernel option (@Isotr0py #27853)
  • [BugFix] Don’t compute reorder threshold when there are no attention groups (@hl475 #27861)
  • [Perf] Decouple torch op from GDA to leverage torch.compile (@ZJY0516 #27871)
  • [CI/Build] Add gpt-oss LoRA test (@jeejeelee #27870)
  • [Bugfix] Allow 64-bit integer values for LoRA IDs to avoid overflow/truncation (@shadeMe #27876)
  • [Bugfix] Fix broken MRoPE for GLM-4.1V/GLM-4.5V (@Isotr0py #27860)
  • [Bugfix] Missing NIXL metadata for handshake initialization if instance spans multi-node (@GuanLuo #26338)
  • Docs update tpu install instructions (@RobMulla #27824)
  • [bugfix] Missing cached item in beam search (@fake0fan #27874)
  • fix incorrect type annotation in KimiMLP (@skyloevil #27885)
  • Flashinfer_CUTLASS_MOE fuses quantization for TP (@wenscarl #27223)
  • [Cleanup] Remove no-longer-used SpeculativeConfig.enable_chunked_prefill (@njhill #27826)
  • [Feature] Pydantic validation for scheduler.py and structured_outputs.py (@vrdn-23 #26519)
  • Add FLASHINFER_MLA to test_mla_backends and add B200 CI run (@MatthewBonanni #27663)
  • Batch invariance doc (@bwasti #27839)
  • [Hybrid] A simpler algorithm to find kernel_block_size (@heheda12345 #26476)
  • [Core] Async scheduling + structured outputs compatibility (@njhill #26866)
  • [Kernel] Enable FusedMoEModularKernel support bias (@jeejeelee #27754)
  • [Bugfix] Fix KDA output (@jeejeelee #27905)
  • [Multimodal][XPU]Enable vision attn backend for xpu platform (@yma11 #27525)
  • Adding SplitK in fused_moe_lora kernel (@yugong333 #27818)
  • [CI/Build] Bump transformers version (@DarkLight1337 #27528)
  • [Bugfix] [Model] Missing MRoPE function definition from KeyeForConditionalGeneration (@tjtanaa #27895)
  • [Add] cmdline argument parsing for KV cache offloading modules (@ApostaC #27621)
  • feat(benchmarks): support HF model names in multi-turn benchmark (@ai-jz #27850)
  • [Docs] Mock all imports for docs (@hmellor #27873)
  • [V0 deprecation] Remove VLLM_USE_V1 usage in platform and v1 module (@wangxiyuan #27798)
  • [Bugfix] DeepSeek V3.2 MTP metadata & CUDA graph issues (@xiaohajiayou #26779)
  • [Bugfix] Python 3.10 compatibility for Self (@DarkLight1337 #27918)
  • [Core][TPU] Support TPU Data Parallalism (@wenxindongwork #27365)
  • [BugFix] Fix mixed penalties batch with async scheduling (@njhill #27910)
  • Adds anthropic /v1/messages endpoint to openai api_server (@bbartels #27882)
  • [KV offload] Offloading connector async scheduling support (@KevinCheung2259 #27648)
  • [CI/Build] Fix flaky test_transcription_validation.py::test_basic_audio_gemma (@bbrowning #27924)
  • [Bugfix] Fix Qwen Omni audio inference (@DarkLight1337 #27920)
  • Performance fix MistralTokenizer: cache special ids and tokens (@juliendenize #27925)
  • [V1] [Hybrid] Mamba1 Automatic Prefix Caching (@Josephasafg #26377)
  • [Misc] Provide Siglip2 chat template (@DarkLight1337 #27939)
  • [Bugfix][llm]: Abort orphaned requests when llm.chat() batch fails (@Flink-ddd #27420)
  • [BugFix][LoRA] use adapter_id instead of id field of lora_request (@biswapanda #27728)
  • [Frontend] Align finish_reason when tool is called with OpenAI (@n0gu-furiosa #25054)
  • [Hybrid] Pass kernel block size to builders (@tdoublep #27753)
  • [Bugfix] Padded Eagle Specdec with Chunked Prefill (@Flechman #26263)
  • [XPU]Refine Dockerfile.xpu, avoid oneccl dependency issue (@jikunshang #27964)
  • Add ORCA endpoint load metrics support (@efimki #24905)
  • [CI/Build] Remove the flaky gpt-oss lora test (@jeejeelee #27966)
  • [Model] Add PaddleOCR-VL Model Support (@zhang-prog #27758)
  • Early exit for MoE LoRA kernels (@gnovack #27131)
  • [Bugfix] Skip gs:// model paths for speculator detection (@pwschuurman #27846)
  • [BUG] Make 'binary' default option for saving torch compile artifacts when using standalone_compile (@ahao-anyscale #27616)
  • [CI/Testing] Add basic single node dual batch overlap test (@LucasWilkinson #27235)
  • [Spec Decode] Integrate Suffix Decoding from Arctic Inference (@aurickq #25784)
  • [Feature][Benchmarks] Support inf burstiness (@sducouedic #26941)
  • [Bugfix][Qwen][Multimodal] Move Qwen2_5_vl sdpa to custom op and reenable compile (@Lucaskabela #27764)
  • [Bugfix] change FlashMLA reorder_batch_threshold (@MatthewBonanni #27777)
  • [Docs] add runai_streamer_sharded to LoadConfig (@andyxning #27937)
  • Add TP parameter to attention tests (@MatthewBonanni #27683)
  • [Bugfix][plugin] fla crash on plugin (@ILikeIneine #27322)
  • [Bugfix] Fix MoE Routing Simulation (@tlrmchlsmth #28002)
  • Remove the tpu docker image nightly build. (@QiliangCui #27997)
  • [Bugfix][ROCm] Fix ViT rotary embeddings for torch.compile compatibility on ROCm (@vllmellm #27748)
  • [LoRA] Lora shrink swizzle (@li2haipeng #27694)
  • [Refactor] Lazy import tool_parser (@chaunceyjiang #27974)
  • [NIXL][XPU] Pin NIXL version to 0.7.0 (@zhenwei-intel #27849)
  • [Metrics] Enable sleep state metric outside of dev mode (@markmc #27867)
  • [Bug] Batch invariant: Fix flash attn MLA RuntimeError: scheduler_metadata must have shape (metadata_size) (@yewentao256 #27884)
  • [CPU]Improve dynamic 4bit moe performance (@xiangze-arm #27240)
  • [CI/Build] Update LM Eval Version in AMD CI (@zhewenl #27944)
  • [KV Connector] Make KVCacheConfig an explicit constructor argument (@markmc #27887)
  • [Model] fix ernie45 reasoning_parser (@CSWYF3634076 #27973)
  • [CI/Build] Fix OpenAI API correctness on AMD CI (@zhewenl #28022)
  • [BugFix][Performance] Restore flashinfer autotuning for all scenarios (@varun-sundar-rabindranath #27904)
  • Load tuned fused_moe_lora shrink and expand kernel configs separately (@yugong333 #27435)
  • Support using Int4PreshuffledTensor after loading (@jerryzh168 #26066)
  • [Core] Enable StatLogger in LLMEngine (@zhuohan123 #28020)
  • [Model][Bugfix] fix pipeline parallelism support for NemotronH (@tomeras91 #27968)
  • [Model] add optimal triton fused moe configs for NemotronH MoE (@tomeras91 #27967)
  • [Kernels] Isolate modular kernel code from FusedMoEMethodBase subclasses. (@bnellnm #27123)
  • [BugFix] Fix incorrect preallocated sampled_token_ids tensor size (@njhill #28025)
  • [Perf] SM100 - add swap AB optimization to CUTLASS FP8 GEMM (@LyrisZhong #27284)
  • [PERF] Decouple projections from GDN custom op (@vadiklyutiy #27512)
  • [model] Add support for openPangu_Ultra_MoE (@yt0428 #27521)
  • [PerfFix] Avoid separate thread for MP executor shm spin (@njhill #28012)
  • [AsyncScheduling] Don't schedule past request max_tokens (@njhill #27922)
  • Remove deprecated --rope-scaling and --rope-theta (@hmellor #28006)
  • [ROCm][Perf] New design on ROCm AITER MHA backend Implementation (@ganyi1996ppo #25763)
  • Added disable rule to track files under benchmarks/lib (@nadavkluger #28048)
  • [Multimodal] Make MediaConnector extensible. (@huachenheli #27759)
  • [ROCm] gemm_a16w16 upstreaming (@maleksan85 #26969)
  • Revert "[PERF] Decouple projections from GDN custom op" (@vadiklyutiy #28080)
  • [Qwen3-Next] MOE configs for A100-SXM4-80GB TP4 TP8 (@toulzx #27740)
  • [XPU] Add gpt-oss model support for Intel GPU (@jikunshang #27786)
  • [CI/Build] Enable some fixed tests in AMD CI (@zhewenl #28078)
  • [V0 deprecation] Remove VLLM_USE_V1 usage in most modules (@wangxiyuan #27955)
  • [Bugfix] Fix encoder-only model support for transformers backend (@Isotr0py #28021)
  • [BugFix] Fix DCP Assert (AssertionError: DCP not support reorder_batch_threshold > 1 now.) (@LucasWilkinson #28100)
  • [Model, Core] Support Granite Speech & LoRA for STT (@alex-jw-brooks #24455)
  • [Refactor] Lazy-loaded reasoning_parser (@chaunceyjiang #28092)
  • [Refactor] to simplify and extract the shared logic between chat completion and responses (@chaunceyjiang #27961)
  • [bugfix] fix wrong dcp_local_seq_lens calc (@pisceskkk #27518)
  • [Hybrid allocator + kv connector] revert connector test changes related to hybrid allocator (@KuntaiDu #28011)
  • [Misc] fix import error for DeepSeekR1ReasoningParser (@chaunceyjiang #28114)
  • Fix excessive logging noise by reducing the log level of the MinimaxM2ToolParser import success message (@minatoaquaMK2 #27635)
  • Bugfix: Cutlass FP8 FusedMoE bad scaling factors (@amirkl94 #27255)
  • [Graph Partition][Cache] Use inductor partition ops config (@BoyuanFeng #27702)
  • [XPU] Enable custom routing functions in IPEX for Llama4 (@frost-intel #28004)
  • add kimi reasoning parser (@MoyanZitto #28128)
  • [DCP] check return_lse for all layers in dcp (@heheda12345 #27929)
  • [BugFix] Support EP/DP + EPLB with MTP (@ilmarkov #25311)
  • Enabling cooperative multi-gpu tests on multi-gpu nodes (@Alexei-V-Ivanov-AMD #27986)
  • [ROCm][MLA] Support block-size > 1 for AITER MLA backend (@ganyi1996ppo #27224)
  • [Bugfix] Validate custom logits processor xargs for online serving (@Isotr0py #27560)
  • [misc] add vLLM Beijing Meetup (@jjzhang #28127)
  • [Kernel] Fuse computation of g and beta for Gated Delta Net (@ZJY0516 #28095)
  • [Core] add support for reasoning parser plugins (@walterbm #28075)
  • [Bugfix] vLLM should check Inductor config for compile cache enablement status (@gmagogsfm #27637)
  • [FlashInfer] Avoid FlashInfer block_size 16 + head_size 256 on blackwell (@heheda12345 #27994)
  • [CI]: Add LMCache Unit Tests (@sammshen #27852)
  • [Feature] Extend batch invariant torch.compile to B200 (@PaulZhang12 #27856)
  • [Bugfix] Fix Qwen3-Reranker-8B load (@noooop #28117)
  • [Docs] Clean up README_TUNING.md (@windsonsea #28088)
  • [Hardware][IBM Z] Optimize s390x Dockerfile (@R3hankhan123 #28023)
  • [Chore] Remove Nemotron-Nano-VL config copy (@Isotr0py #28126)
  • [Docs] Add guide to debugging vLLM-torch.compile integration (@zou3519 #28094)
  • [Feature]: Add corrupted request metric to V1 metrics system. (@atalhens #27306)
  • [CI/Build] Update checking logic in cutlass_group_gemm_supported (@zhewenl #27948)
  • [CI/Build] Fix test_defaults_with_usage_context in AMD CI (@zhewenl #27926)
  • [Core][Hybrid allocator + connector 2/n] Unify remove_skipped_blocks by get_last_useful_token (@KuntaiDu #25431)
  • [Debugging] Add annotation for easier trace analysis (@dayeol #22496)
  • [PERF] Decouple projections from GDN custom op. Attempt 2 (@vadiklyutiy #28083)
  • [Bug] Fix cpu disable shared_experts VLLM_DISABLE_SHARED_EXPERTS_STREAM (@yewentao256 #28157)
  • [Bug] Fix env string "0" same to True (@yewentao256 #28159)
  • [Feature] Enable TP + EP shared_experts overlap with router, 3.7% E2E performance improvement (@yewentao256 #28164)
  • [CI Failure] nm-testing/Qwen2-0.5B-Instruct-FP8-SkipQKV was removed from HF. Skip it in tests (@vadiklyutiy #28170)
  • [Misc] Remove the duplicate code (@chaunceyjiang #28111)
  • [Chore] Clean up deepseek v2/v3 config copy (@Isotr0py #28055)
  • [Core][MM] Use non-blocking CPU-GPU copy of multimodal data (@lgeiger #28141)
  • Make the cv2 dependency optional (@cmpute #27780)
  • [CI] Add compile/test_multimodal_compile.py to CI (@gmagogsfm #28151)
  • [flashinfer] fix FI all2all with FI cutlass moe (@mxz297 #28166)
  • Patch Mistral Tokenizer (@juliendenize #28146)
  • Fix hard-coded parameter name in gemma3n.py (@seungduk-yanolja #27946)
  • [CPU] Enable torch profiling (@aditew01 #28130)
  • [V0 deprecation]clean up is_v1_supported_oracle (@wangxiyuan #28116)
  • [Bugfix][Kernel] fix merge attn states when both prefix and suffix are empty (@courage17340 #28181)
  • [Frontend] OpenAI Responses API supports Tool/Function calling - non-harmony (@chaunceyjiang #26874)
  • [CPU]Improve cpu fused moe perf (@xiangze-arm #27244)
  • Disable nm-testing models with issues in CI (@mgoin #28206)
  • [Docs] Switch to directory style URLs (@hmellor #28058)
  • [Kernel][Model] Tune fused_moe Triton configs for MiniMax-M2 on H100 (@minatoaquaMK2 #28200)
  • [Doc] Add Arm CPUs are on the list of supported targets in vLLM (@milpuz01 #26018)
  • [HARDWARE][CPU] Add Option for Disabling Binding to Specific CPU Cores (@StanHatko #27953)
  • [Frontend] Fix logging format when enable response logging (@esmeetu #28049)
  • CODEOWNERS: Add myself as reviewer on security docs (@russellb #28216)
  • [Structured outputs] Upgrade llguidance to 1.3.0 (@andylolu2 #28039)
  • Add llama 4 scaling support (@juliendenize #28145)
  • [Chore] eliminate duplicated and unconditional object serialization in anthropic messages api (@vicoooo26 #27792)
  • [ROCm] triton fp8 kernel (@maleksan85 #27058)
  • [Doc]: Make extraInit containers fully configurable in helm chart (@HanFa #27497)
  • [Test] Add non-MoE DP test coverage (@MatthewBonanni #28235)
  • [BugFix] Fix FusedMoELoRA + ModularKernel Integration (@varun-sundar-rabindranath #28237)
  • Fix failing test for CRadio (@BloodAxe #27738)
  • Speed up mm processor kwargs per request by spliting dynamic and static kwargs (@LJH-LBJ #26483)
  • [Multimodal][torch.compile] Add compilation config field for turning off ViT/MM compile (@Lucaskabela #28242)
  • [CI/Build] Loosen STT LoRA Translate Check (Flaky Test) (@alex-jw-brooks #28247)
  • Add runai model streamer e2e test for GCS (@amacaskill #28079)
  • Fix issues from #28242 (@hmellor #28257)
  • [amd][gptoss] Perf gain because of block alignment (@smitkadvani #28024)
  • [Bug] Fix missing token_ids for reasoning parser models in chat completions #28246 (@baonudesifeizhai #28256)
  • [CI] Reduce Blackwell Fusion test runtime by filtering tests and only run all tests in nightly (@Copilot #28074)
  • [Kernel] LoRA triton kernels support PDL (@jeejeelee #27402)
  • [Perf] Introduce FlattenLogprobs to store logprobs results to reduce GC overhead (@Jialin #28171)
  • [FixBug]Aeala/ShareGPT_Vicuna_unfiltered marked as multimodal benchmark (@princepride #28265)
  • [CPU]Avoid repeated random sample compile (@xiangze-arm #28260)
  • [Misc][Model][Refactor] Pass the prefix into Linear layers (@MengqingCao #28259)
  • [fix] Revert "fixing mm placeholder replacement issue with gemma3" (@khluu #28285)
  • [Core][MM] Add mechanism to configure multimodal fields which should stay on CPU (@lgeiger #28168)
  • [Bugfix] Use latency MOE backend as default for Flashinfer and other misc fixes (@pavanimajety #27439)
  • [CLI] add --max-tokens to vllm complete (@Iceber #28109)
  • [Feature] Default ignore_eos True for random dataset (@yewentao256 #28227)
  • [Log] update shm wait time msg (@BoyuanFeng #28255)
  • Revert "[PerfFix] Avoid separate thread for MP executor shm spin (#28012)" (@NickLucche #28289)
  • [README] Add Arm CPUs to the list of supported targets (@fadara01 #28290)
  • [doc] add guide about the provided PTX was compiled with an unsupported toolchain (@youkaichao #28305)
  • [Build] Fix release pipeline failing annotation (@simon-mo #28272)
  • [Bugfix] Fix and add tests for GptOss reasoning parser (@benchislett #28000)
  • [Core] Rework handling of async scheduling config (@njhill #28250)
  • [PerfFix] Avoid separate thread for MP executor shm spin (take 2) (@njhill #28319)
  • Update Flashinfer from v0.4.1 to v0.5.2 (@hmellor #27952)
  • [XPU] Enable Expert parallel for MoE models (@jikunshang #28263)
  • remove resolve_op_overloads and use splitting_ops directly (@BoyuanFeng #28081)
  • [Bugfix][LoRA][Spec Decode] Support LoRA with speculative decoding (@xiaohongchen1991 #21068)
  • Update gpu.rocm.inc.md to add support for AMD Ryzen AI MAX / AI 300 Series (gfx1151, gfx1150) (@hammmmy #28308)
  • [Perf][DeepSeek] Add sigmoid+bias fusion to fused_grouped_topk from TRTLLM (@mgoin #28124)
  • Bump arctic-inference requirement (@aurickq #28174)
  • [bugfix] support eagle with lora cudagraph specialization (@gnovack #28318)
  • [Model] Consolidate Deepseek-MoE implementation with DeepSeek-v2 (@Isotr0py #28101)
  • Refactor CPU/GPU extension targets for CMake build (@ashahba #28026)
  • [flashinfer][fix] do not check nvcc availability when using pre-downloaded cubins (@mxz297 #27990)
  • [Attention] Remove max cudagraph size limit of 992 (@22quinn #27840)
  • reasoning_content -> reasoning (@hmellor #27752)
  • [Bugfix] Update device name for H200 detection (@robertgshaw2-redhat #28349)
  • [Bugfix] Spec decode + structured output + spec model max len edge case (@andylolu2 #28298)
  • [DCP] Support dcp kv_cache interleave size > 1 (@zhangsicheng5 #26696)
  • Enhance run_cluster.sh for multi-NIC support (@evberrypi #28328)
  • [Feat] Drop-in Torch CUDA Profiler (@benchislett #27841)
  • Remove setuptools upper bound constraint (<80) (@ColeMurray #28337)
  • [Bugfix] Fix test fused quant layernorm tests (@ElizaWszola #27865)
  • [Performance][gpt-oss] Revert gpt-oss max cudagraph size to 1024 (@mmangkad #28345)
  • [chore] Move some wikimedia images to S3 (@khluu #28351)
  • fix: close issue 28338 by fixed python version (@yihong0618 #28339)
  • [Misc] fix typo and add detailed log (@andyxning #28178)
  • [ROCm] Add env to enable/disable aiter triton gemm (@sarckk #28321)
  • [Misc] Add some comments in qwen3-next (@ZJY0516 #28267)
  • [CI] Fix flaky test_eagle_correctness test (@NickLucche #28364)
  • [Core] Simplify async KV output aggregation (@njhill #28327)
  • [Core] Separate out attention metadata building logic from prepare inputs (@LucasWilkinson #26764)
  • [BugFix] Fix cu_num_generated_tokens slicing logic in LogprobsLists.slice() method (@usberkeley #28214)
  • [CI/Build] Temporary fix to LM Eval Small Models (@zhewenl #28324)
  • [Kernel] Fix fused_gdn_gating (@ZJY0516 #28343)
  • [ROCm][Platform] Add RX7900XTX device id in _ROCM_DEVICE_ID_NAME_MAP (@JartX #28279)
  • [CI] lora/test_mixtral.py : Add additional expected outputs due to flakiness (@varun-sundar-rabindranath #28322)
  • [Hardware][AMD][Model] Add Triton MoE tuning support and optimized configs for Qwen3 omni for MI308X (@sammysun0711 #28373)
  • [V0 deprecation] Remove no longer used get_metadata_cls (@LucasWilkinson #28370)
  • Restore PlaMo2 unit test as pfnet/plamo-2-1b now supports transformers >=4.56 (@Alnusjaponica #28019)
  • [Metrics] Refactor LoRA state tracking (@markmc #26801)
  • [bugfix] fix siglip batch text output error (@piood #28365)
  • [Fix] optimize visual token mask with caching and multi-token support (@bo-ke #28374)
  • Add @tjtanaa to codeowner for ROCm and multi-modal (@tjtanaa #28360)
  • [Rocm][fused_moe][fp4] view weight to torch.float4_e2m1fn_x2 when running aiter fused moe for fp4 model (@zejunchen-zejun #27474)
  • [Kernel] Optimization of the mm_k operator. (@caozuoba #28280)
  • [RFC][ROCm][AITER] Keep all AITER kernels in _aiter_ops class like _custom_ops and _ipex_ops (@vllmellm #24490)
  • [V0 Deprecation] Remove unused context_len and seq_len from M-RoPE (@DarkLight1337 #28395)
  • [Bugfix] Fix persistent_masked_m_silu_mul_quant tests (@varun-sundar-rabindranath #28366)
  • [Performance] Support FP8 flashinfer TRTLLM MOE on Qwen3 and Qwen-3next (@jiahanc #27492)
  • [Bugfix] Fix llguidance backend, rollback when EOS was encountered (@Flechman #25905)
  • [FA/Chore] Bump FA version for FP8 two-level accumulation (@jmkuebler #27889)
  • [Bugfix][EPLB] Disabled shared expert overlap when EPLB is enabled (@SageMoore #28377)
  • [Misc] Add more scoping for improved trace (@frank-wei #28329)
  • [BugFix] Fix DeepGEMM over-allocating workspace (@LucasWilkinson #28254)
  • [Frontend][2/n] remove empty content from _parse_tool_calls_from_content (@qandrew #28331)
  • [CI] Fix Plugin Tests Tests (@robertgshaw2-redhat #28413)
  • [ROCm] Add missing gemm_a8w8_blockscale import (@sarckk #28378)
  • [PERF] Allreduce fusion. Support torch native matching. Tuning of the thresholds (@ilmarkov #24248)
  • [Perf] Move gc.freeze logic from EngineCoreProc to EngineCore for better coverage (@Jialin #27896)
  • [Bugfix] Ensure calculated KV scales are applied in attention. (@adabeyta #27232)
  • [Test] Remove old non-varlen FA2 test (@MatthewBonanni #28420)
  • [Feature] Refactor batch invariant fp8 DeepGEMM (@yewentao256 #27606)
  • [CI/Test Fix] Fix CP tests on Blackwell (@LucasWilkinson #28404)
  • [Feature] Add env var VLLM_MOE_USE_DEEP_GEMM (@yewentao256 #28422)
  • Only register rocm_aiter_ops if aiter is found (@mgoin #28428)
  • Fix rotary embedding benchmark script (@xyang16 #28323)
  • [Misc] FlattenLogprobs -> FlatLogprobs (@zhuohan123 #28335)
  • [Frontend] Add sagemaker_standards dynamic lora adapter and stateful session management decorators to vLLM OpenAI API server (@zhaozuy #27892)
  • [Bugfix] Fix Stream Sync for Shared Expert Overlap (@robertgshaw2-redhat #28430)
  • [Doc] Sleep mode documentation (@iAmir97 #28357)
  • [BugFix] Avoid calling KV connector layer APIs when metadata is unset (@sdavidbd #28253)
  • [Bugfix] Fix max image size for PaddleOCR-VL (@ywang96 #28442)
  • [EPLB] Refactor balance_packing to use numpy and optimize GPU-CPU transfers in EPLB (@SageMoore #28369)
  • [Bugfix] fix qwen3-next crash (@ZJY0516 #28202)
  • [BugFix] 'DeepseekV2Config' object has no attribute 'use_mla'` (@faaany #28387)
  • [Model][Qwen3VL] Slighly speedup fast_pos_embed_interpolate (@lgeiger #28434)
  • Multi turn benchmark progress bar for synthetic conversation generation (@segevido #28394)
  • [CI] Add mergify rules for nvidia label (@mgoin #28417)
  • [Attention] Refactor CUDA attention backend selection logic (@MatthewBonanni #24794)
  • Fix Fused MoE LoRA Triton kernel bug (@chaojun-zhang #28450)
  • [Model] Pass mm_features directly into get_mrope_input_positions (@DarkLight1337 #28399)
  • Add request timeout override for multi-turn benchmarks (@segevido #28386)
  • [Docs] Fix grammar in CPU installation guide (@maryamtahhan #28461)
  • [Kernels] Split up fused_moe/layer.py, isolate more modular kernel code (@bnellnm #28064)
  • [BugFix] Fix Failing Ruff Check (@jvlunteren #28469)
  • Add @markmc to CODEOWNERS for Observability (@markmc #28457)
  • [BugFix] Fix RuntimeError in PixtralHFAttention on CPU/XPU (@faaany #28444)
  • [BugFix] Add test_outputs.py to CI pipeline (@usberkeley #28466)
  • [Doc] Fix typo in serving docs (@the-codeboy #28474)
  • Remove weight_scale.T special case for SM90 Block FP8 CUTLASS kernel (@mgoin #28431)
  • [NIXL] Generalize block-first backend layouts (FlashInfer-like) (@NickLucche #28282)
  • [Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model (@izhuhaoran #27165)
  • [ROCm][Quantization] extend AMD Quark to support mixed-precision quantized model (@xuebwang-amd #24239)
  • [Quantization] fix attention quantization of gpt_oss model (@xuebwang-amd #27334)
  • [CI/Build] Refactor Attention backend for test_prefix_prefill from xformers to SDPA (@zhewenl #28424)
  • Prefer FlashAttention MLA as default over FlashMLA (@MatthewBonanni #27363)
  • [Kernel] Optimize rms_norm kernel (@xyang16 #27931)
  • [BugFix] Fix Siglip2Attention on XPU (@faaany #28448)
  • [Misc] Remove unused attention prefix prefill ops functions (@lgeiger #26971)
  • [Perf] Use np.ndarray instead of list[list[int]] to reduce GC overhead (@Jialin #28245)
  • [V0 deprecation] Clean up num_prefill_tokens logic for V0 (@gcanlin #28203)
  • [Misc] fix typo in DCP comment (@Livinfly #28389)
  • [LoRA][1/N]Remove LoRA extra vocab (@jeejeelee #28382)
  • [TPU] Rename path to tpu platform (@kyuyeunk #28452)
  • [Misc] Cleanup Executor interface (@wangxiyuan #28441)
  • Add Zurich vLLM Meetup (@mgoin #28488)
  • [Bugfix] Disable shared expert overlap if Marlin MoE is used (@mgoin #28410)
  • [Feature] Allow configuring FlashInfer workspace size (@maxyanghu #28269)
  • Use FLASHINFER MLA backend when testing fp8_kv_scale_compile (@adabeyta #28491)
  • [BugFix] Graceful handling of torch symm mem errors. (@ilmarkov #27671)
  • [Frontend] Change CompilationMode to a proper Enum (@gmagogsfm #28165)
  • [Performance] Cache loaded custom logitsprocs to avoid overheads (@Isotr0py #28462)
  • [[V0 deprecation]]Remove VLLM_USE_V1 env (@wangxiyuan #28204)
  • [CPU] Refactor CPU attention backend (@bigPYJ1151 #27954)
  • VLLM_USE_TRITON_FLASH_ATTN V0 variable deprecation (@AndreasKaratzas #27611)
  • [Model][Qwen3VL] Simplify get_mrope_input_positions using numpy (@lgeiger #28302)
  • [Core] Encoder separation for Encode-Prefill-Decode Disaggregation (@fake0fan #25233)
  • [BugFix] Add fallback path in apply_rotary_pos_emb_flashattn for non-cuda platforms (@faaany #28447)
  • [Benchmark] Add retry support to fix workload bias in multi-turn benchmark (@ai-jz #28493)
  • [Core] Cache vllm_is_batch_invariant (@lgeiger #28304)
  • [CI/Build] Fix crash due to removed VLLM_USE_V1 attribute in EPD (@fake0fan #28521)
  • [CI] Introduce autorun_on_main feature (@hl475 #27836)
  • [BugFix]: --enable-lora with model granite-4.0-micro crash (@yyzxw #27733)
  • [Model] fix glm4_moe_mtp load weights with GLM-4.6 checkpoint. (@wuyaoxuehun #27597)
  • [XPU]Fix crash due to removed VLLM_USE_V1 attribute (@chaojun-zhang #28520)
  • [KVConnector] Enable get_block_ids_with_load_errors() in LMCache connector (@ziruiliu #27978)
  • add cpu option for p/d in nixl_connector (@ZhengHongming888 #28356)
  • [ROCm] [Bugfix] Fix fused_qknorm_rope_kernel rocm compatibility (@tjtanaa #28500)
  • [Bugfix] Fix gpt_oss packed_modules_mapping (@jeejeelee #28536)
  • [V0 deprecation] Deprecate use_v1 parameter (@wangxiyuan #28112)
  • Fix pre-commit (and XPU) on main (@hmellor #28556)
  • [Performance][Hopper] Avoid M dim padding to 4x for most cases (due to cuda graphs paddings) (@alexm-redhat #28492)
  • [Refactor] Remove redundant TP gather/split in split_qkv in QwenVL (@gcanlin #28271)
  • [Misc] Refactor Attention kv transfer methods into decorator (@NickLucche #27816)
  • Remove deprecated fields from CompilationConfig (@hmellor #27593)
  • [Perf] Refactor cudagraph_support to enable full CUDA graphs for spec decoding with FlashInfer (@benchislett #28479)
  • Implement ARC KV cache eviction policy (@albertoperdomo2 #27039)
  • [EPLB][ROCm]: support EPBL for ROCm backend (@PerryZhang01 #27731)
  • [Model] [Config] Correctly identify granite-4.0-micro as non-hybrid model (@tdoublep #28563)
  • [CI] Skip "Multi-Modal Models Test (Extended) 3" test that's broken in current Transformers (@hmellor #28559)
  • [KV connector][WIP] KV cache proxy based on LMCache multi-process mode (@ApostaC #27902)
  • [BugFix] Priority scheduling and spec tokens preemption (@andylolu2 #28558)
  • [Misc]Fix typo in llm_engine.py (@frank-wei #28584)
  • [Performance][B200] Fix deepgemm prologue (@varun-sundar-rabindranath #27897)
  • [ROCM] Fix ROCm warnings, environment flag access, and GEMM kernel naming for consistency in _aiter_ops.py (@vllmellm #28464)
  • [TPU] Support GCS path in VLLM_TORCH_PROFILER_DIR (@QiliangCui #28487)
  • [Bugfix] Adjust Marlin CUDA arch selection to 8.0+PTX;9.0+PTX (@mgoin #28294)
  • [Core][AMD] Migrate fully transparent sleep mode to ROCm platform (@HollowMan6 #12695)
  • [MoE][Kernel][Perf] Improve Shared Expert Stream Overlap (@alexm-redhat #28406)
  • Skip models that cannot currently init on Transformers v5 (@hmellor #28471)
  • [Docs] Update meetups.md description (@mgoin #28583)
  • [ROCm][Bugfix] Revert removing setuptools version restriction (@gshtras #28592)
  • [platform] Move get_cu_count to utils (@wangxiyuan #27005)
  • [Bugfix] Fix SM100 gpt-oss regression due to faulty attn sink support (@mgoin #28561)
  • [BugFix] Fix mm_encoder_attn_backend arg type checking (@njhill #28599)
  • [Docs] Add some details about what the MoE block needs for the Transformers backend (@hmellor #28588)
  • Rename clashing method names for vLLM model protocol (@hmellor #27583)
  • [n-gen] DO NOT repeatedly return finished child requests (@Jialin #28591)
  • [Frontend] split append tool output (@qandrew #28333)
  • [Frontend][responsesAPI][1/n] convert responses API tool input to chat completions tool format (@qandrew #28231)
  • [BugFix][ROCm] Fix get_cu_count missing variable error (@ganyi1996ppo #28608)
  • [XPU] Support Triton path for LoRA operations on XPU (@faaany #28511)
  • Support DeepEP for Kimi-k2-thinking through enabling gemm selection for compressed-tensor marlin wna16 (@luccafong #28574)
  • [build][cmake]: Bundle static ACL and torch libgomp for CPU extension builds (@Radu2k #28059)
  • [ROCm][BugFix] Remove the usage of device_info from aiter (@ganyi1996ppo #28383)
  • [Bugfix] Prevent crash on empty grammar string (@tjandy98 #28210)
  • Use official xformers-0.0.33 built for PT 2.9 (@huydhn #28600)
  • Add NUMA node validation for CPU thread binding (@usberkeley #28555)
  • [Bugfix] fix kimi-linear crash (@ZJY0516 #28445)
  • [Frontend] supports interleaved thinking (@chaunceyjiang #28531)
  • Support all interleaved layer types (@sarckk #28485)
  • Fix: Correctly filter special tokens in benchmark_prefix_caching (@dw2761 #28615)
  • [BugFix] Fix type error when assign a trition kernel tensor to a torch.nn.Parameter (@liuzijing2014 #28603)
  • Fix io processor pooling #28273 (@baonudesifeizhai #28484)
  • [XPU] add sym params to IPEXConfig (@zufangzhu #28611)
  • [Bugfix] Fix FPS value type for Qwen2.5-Omni video processing (@faaany #28630)
  • [Hardware][PowerPC] Fix fp16 compilation error for Power in cpu attention backend and bump oneDNN version (@Akashcodes732 #28535)
  • [ROCm][BugFix]Fix get_cu_count in rocm_aiter_fa.py (@ganyi1996ppo #28618)
  • [CI/Build] Install uv for AMD MI300: Language Models Tests (Hybrid) %N (@amdfaa #28142)
  • [CI Failure] Fix backend selection for encoder-only models (@hl475 #28534)
  • [BugFix] DeepSeek-OCR: apply NoRepeatNGramLogitsProcessor to greedy path (@YuanpingSong #28617)
  • Fix get_num_experts when config sets it explicitly to None (@hmellor #28652)
  • [Misc] Turn off encoder torch compile by default (@ywang96 #28634)
  • Rewrite C++ meta funcs to Python (@janeyx99 #28595)
  • [BugFix] Ensure EngineArgs.create_engine_config is idempotent (@njhill #28515)
  • [TPU] patch TPU wheel build script to resolve metadata issue (@jcyang43 #27279)
  • [Performance][B200] silu_mul_quant: pack scales in int32 (@varun-sundar-rabindranath #28358)
  • [Bugfix] Fix validate model input for decoder models (@yannicks1 #27099)
  • [Attention][Bugfix] Fix FA sink support (@MatthewBonanni #28660)
  • [Perf] Support stream interval for reducing host overhead (@elvischenv #27869)
  • [bugfix] correct local_chunk_len for DCP in reorg_kvcache with long context (@pisceskkk #28526)
  • [Bugfix] Eliminate tuple inputs to submodules in graph partitioning (@gmagogsfm #28533)
  • [Bugfix] [CPU] bump torch to 2.9.0 for Darwin to fix segmentation fault (@kebe7jun #27791)
  • [Misc] Update CODEOWNERS for simon-mo and comaniac (@simon-mo #28675)
  • [CI] Bug: Fix ci entrypoint pooling (@yewentao256 #28684)
  • [KV Connector] Test async mode in scheduler tests (@markmc #28550)
  • Mirrored test group definitions for AMD (2025-11-11) (@Alexei-V-Ivanov-AMD #28573)
  • [quantization][config] enable override existing quant_config (@ILikeIneine #28510)
  • [ROCm] Bump up the version of amd-smi to 6.4.3 (@SageMoore #28680)
  • [CPU][Bugfix] Fix Apple Silicon M1 compilation failure (@mgoin #28681)
  • [ci][amd] fix basic models extra init test (@bradleyhd #28676)
  • [Misc] Remove warn_for_unimplemented_methods (@DarkLight1337 #28613)
  • [XPU][CI]disable lm cache uts (@jikunshang #28696)
  • [Misc] Update xformers to 0.33.0.post1 (@ywang96 #28678)
  • [Misc] add ignore mapper for quark quantization (@haoyangli-amd #28275)
  • [Bugfix][CI/Test][Spec Decode] Fix illegal memory access in offline_inference/spec_decode.py (Issue 27619) (@rasmith #28432)
  • [BugFix][CI/Build][ROCM] Fix import error and apply assert in appropriate case in test_struct_output_generate (@rasmith #28311)
  • use default CCL_ZE_IPC_EXCHANGE (@yma11 #28700)
  • [Bugfix] fix dots.ocr pp support (@ZJY0516 #28705)
  • [BugFix] Fix multi-modal async scheduling race condition (@njhill #28706)
  • Add output token counting to gsm8k eval (@mgoin #28594)
  • [Minor] avoid register new custom and just import silly_attn (@BoyuanFeng #28578)
  • [Misc] fix comment in test_envs (@xingliu14 #28529)
  • [feat]: log number of preempted requests (@610lyn #28522)
  • [Frontend] Added chat-style multimodal support to /classify. (@WorldExplored #27516)
  • [Model][MM] Extract conv layer as CustomOp (@shen-shanshan #28455)
  • [DCP] Support Decode Context Parallel (DCP) for GQA with Flashinfer (@gjc0824 #25438)
  • Fix KV sharing fast prefill with cudagraph enabled (@sarckk #28537)
  • [BugFix] Fix FA3 IMA with FULL_AND_PIECEWISE and cascade attention (default) (@LucasWilkinson #28702)
  • [Doc] Fix macOS installation dependency resolution issue (@shahfasal #26721)
  • [Model] Fix bailing_moe accuracy problem (@zhaozx-cn #28277)
  • [Bugfix][Nixl] Fix kernel physical<>logical block_size issue (@NickLucche #28677)
  • [Config] Clean up SchedulerConfig initialization (@DarkLight1337 #28665)
  • [Kernels] Enable FlashInfer FP8 Blockscale on SM90 (for TEP DSR1) (@djmmoss #27134)
  • [Fix] improve aspect ratio in dummy image generation and add common VLM tests for PaddleOCR-VL (@dongbo910220 #28711)
  • [Docs] Update the name of Transformers backend -> Transformers modeling backend (@hmellor #28725)
  • [CI][CPU] Smoke test for Apple Silicon using GHA MacOS runner (@mgoin #28688)
  • [DisaggEverything] Tokens in<>out /generate endpoint (@NickLucche #24261)
  • [Attention] Bump FA for removed method (@MatthewBonanni #28429)
  • Fix typo in comment: existance -> existence (@OthmanMohammad #28737)
  • Remove audio optional dependency for mistral-common (@juliendenize #28722)
  • [kernel] Improve FP8 PTPC on Hopper for larger shapes (@czhu-cohere #28692)
  • docs(lora_resolvers): clarify multi-resolver order and storage path requirement (@wangchen615 #28153)
  • LLaMA4 LoRA Adapter Enablement (@kfhfar #28602)
  • [Bugfix] [ROCm] [AITER]: Fix aiter block quant not compatible with torch compile dynamo (@tjtanaa #28716)
  • [Docs] Enable some more markdown lint rules for the docs (@hmellor #28731)
  • [Chore] Rename SchedulerConfig.chunked_prefill_enabled (@DarkLight1337 #28735)
  • [Bugfix] resolve Qwen3-VL GPTQModel quantized model loading failure (@GuanH #28663)
  • [BugFix] Fix misprint introduced by modular_kernel refactoring. (@halyavin #28728)
  • [ROCm][Bugfix] Fix compilation errors with fused_qknorm_rope_kernel.cu (@SageMoore #28682)
  • [CI] Fix macos smoke test uv cache issue (@mgoin #28736)
  • [Bugfix] TypeError: 'NoneType' object is not callable (@mostrowskix #27410)
  • [ROCm][CI/Build] Change install location of uv (@gshtras #28741)
  • Avoid bytecode hook and simplify TorchCompileWrapperWithCustomDipatch (@laithsakka #25110)
  • [Bugfix] Fix incorrect use of hidden_states for shared_experts due to do_naive_dispatch_combine (@alexm-redhat #28740)
  • [Bugfix] Fix ChunkedLocalAttention CUDA Graph setting (@benchislett #28739)
  • [Hybrid] [Kernel] Fix chunk scan kernel when BLOCK_SIZE_DSTATE > 128 (@tdoublep #28295)
  • [Log] Save profiler results to file instead of stdout (@rasmith #28144)
  • [ROCm][CI/Build] Upgrade to ROCm 7.1 and AITER main (@gshtras #28753)
  • [Test] Rework e2e async scheduling tests (@njhill #28744)
  • [Core] Performance: Use list[np.ndarray] instead of list[list[int]] for output tokens for GC optimization (@Jialin #26368)
  • [TPU] Fix import error in tpu launch (@QiliangCui #28758)
  • [Model][Qwen3VL] Use mm_position to compute mrope positions (@lgeiger #28730)
  • [Bugfix] Build hadacore kernels on >SM90 (@mgoin #28748)
  • Revert "[Core] Performance: Use list[np.ndarray] instead of list[list… (@njhill #28773)
  • Fix IntermediateTensors initialization and add type hints (@OthmanMohammad #28743)
  • [NIXL] heterogeneous block_size support (@xuechendi #26759)
  • [Performance][DeepGEMM] Estimate expected_m (@varun-sundar-rabindranath #28694)
  • [Redo] #26368 (@DarkLight1337 #28771)
  • [RL] [V1] Remove unused device argument from reset_kv_cache (@zhuohan123 #28766)
  • Use narrow over indexing in hadacore_transform to prep for ABI stable (@janeyx99 #28756)
  • [Kernel][Moe Configs] llama4 maverick fp8 moe config tp8 on mi325 (@zhewenl #28709)
  • [Misc] Make SchedulerConfig.max_model_len init-only (@DarkLight1337 #28733)
  • [PERF] Remove TRTLLM Gen attn kernel limitation max_seq_len <=131072 (@vadiklyutiy #28755)
  • [compile] Enable sequence parallelism matching w/o custom ops enabled (@angelayi #27126)
  • Allow Gemma3 to take image embeddings (@tingtingtangmeta #28483)
  • [Doc] Fix failing doc build (@DarkLight1337 #28772)
  • [Model] Fix lmhead init bug of bailing_moe (@hwhaokun #28777)
  • Add support for Eagle with separate lm-head and embed_tokens layers (@eldarkurtic #28549)
  • [CI] Fix broken pipeline (@njhill #28781)
  • [Model][Qwen3VL] Cache positional embedding indices (@lgeiger #28475)
  • [Doc]: fix typos in various files (@didier-durand #28567)
  • [BugFix] Fix AssertionError: DCP not support reorder_batch_threshold > 1 now. (@LucasWilkinson #28751)
  • Adding a benchmark for batch invariance (@bwasti #28161)
  • [Benchmark] Fix client seed synchronization in multi-turn benchmark (@ai-jz #28512)
  • [Model] Allow users to control skip reading cache per request. (@noooop #28194)
  • [V1] Support MP Executor for multi node distributed inference (@luccafong #23691)
  • Fixed gpt-oss _load_weights_other() parameter position bug (@River12 #28715)
  • [Bugfix] Fix host and port join for ipv6 in bench serve (@scottzh8 #28679)
  • Fix gpt oss weight loading with EP + bf16 (@ashors1 #28765)
  • [Doc]: fix typos in various files (@didier-durand #28811)
  • fix comment typo (@andyxning #28802)
  • [Model][QwenVL] Optimize Qwen2_5_VisionAttention q,k preparation (@lgeiger #28769)
  • Feature: Support Relu2 in FusedMoE fp8 cutlass path (@amirkl94 #27261)
  • [BugFix] Fix async scheduling + chunked prefill + preemption (@njhill #28787)
  • [Performance][Fix] update nvfp4 code to support renorm routing (@jiahanc #28569)
  • [NIXL][XPU] update install script of NIXL (@zhenwei-intel #28778)
  • [ROCm][Qwen3-32B] Fix AITER MHA accuracy issue cause by #25763 (@sammysun0711 #28670)
  • [Bugfix][Model] Prevent special token leakage in KimiK2ToolParser streaming mode (@jscaldwell55 #28543)
  • [Doc] Add llama4 LoRA tag (@jeejeelee #28825)
  • [CPU][Bugfix] Fix _to_list in CPU model runner (@bigPYJ1151 #28824)
  • [BugFix] Fix glm4_moe_mtp load weights bug (@wuyaoxuehun #28805)
  • [Metrics] Fix KV cache usage percent metric multiproc (@jaywonchung #28792)
  • [XPU] work around for sp, avoid custom op import error (@jikunshang #28822)
  • [BugFix] Temporary fix for IMA with MTP = 2 and full-cg (@LucasWilkinson #28315)
  • [Bugfix][Perf] Revert applying HF processor on text-only inputs for multimodal models (@ywang96 #28858)
  • Cast return value to int64_t for cache size (@tiehexue #28814)
  • [Bugfix] Fix GPT-OSS on AMD after #28603 (@zhewenl #28816)
  • [Core] Async Scheduling X Spec Decoding Compatibility (@Ronald1995 #24799)
  • [BugFix] Fix PP performance and PP kv connector output regression (@njhill #28768)
  • [Quantization] [Eagle] Add complete quantization support to the draft model in Eagle (@shreyas269 #28435)
  • [Test] Batch Invariant: Rename and organize tests (@yewentao256 #27421)
  • [Model] Add Afmoe architecture implementation (@pranav4501 #28332)
  • [BugFix] Corner case that could cause out-of-sync with external launcher mode and dp >1 (@bangshengtang #28774)
  • [Misc] Fix wrong comment in scheduler (@zhuohan123 #28880)
  • [Bugfix] Fix Kimi-K2 tool parser concatenated tool calls parsing (@bbartels #28831)
  • Run macos smoke test workflow on main commit (@mgoin #28752)
  • [ROCm][Quantization] add apply_vllm_mapper in quark config for models like gpt-oss (@xuebwang-amd #28638)
  • [Refactor] Remove Unused Func in Batch Invariant (@yewentao256 #28881)
  • [Bugfix] Fix wrong CLI defaults for dynamic SchedulerConfig fields (@DarkLight1337 #28872)
  • [Doc]: fix typos in various files (@didier-durand #28863)
  • [Misc] Remove unnecessary parentheses from log statements (@andyxning #28897)
  • [CI] Fix async scheduling + spec decoding test flake (@njhill #28902)
  • [MISC] Remove format.sh (@KuntaiDu #28906)
  • [CI/Build] Replace wikipedia url with local server ones (@Isotr0py #28908)
  • [BugFix] Fix PP/async scheduling with pooling models (@njhill #28899)

New Contributors

  • @bwasti first commit is #25603
  • @Renovamen first commit is #25796
  • @patrick-toulme first commit is #25084
  • @kingsmad first commit is #25825
  • @yingjun-mou first commit is #25827
  • @zhoukezi first commit is #25854
  • @leejnau first commit is #25706
  • @adabeyta first commit is #25513
  • @acisseJZhong first commit is #25912
  • @a120092009 first commit is #25942
  • @Anionex first commit is #25354
  • @DrStone1971 first commit is #25843
  • @certainly-param first commit is #25935
  • @natoscott first commit is #26007
  • @kmaehashi first commit is #26005
  • @leo-pony first commit is #25470
  • @huijjj first commit is #24947
  • @levunet first commit is #24768
  • @Egor-Krivov first commit is #25668
  • @sixiang-google first commit is #25992
  • @astralord first commit is #26027
  • @jasl first commit is #26098
  • @nrghosh first commit is #26148
  • @southfreebird first commit is #25974
  • @soldni first commit is #26054
  • @yuafng first commit is #26219
  • @ILikeIneine first commit is #25823
  • @jasonlizhengjian first commit is #25998
  • @elieserr first commit is #26177
  • @orangeng first commit is #26266
  • @ymoslem first commit is #26258
  • @abhisheksheth28 first commit is #25521
  • @seven-mile first commit is #26231
  • @cfRod first commit is #26289
  • @atalhens first commit is #26265
  • @gholmes829 first commit is #25164
  • @dcampora first commit is #25945
  • @antrec first commit is #26340
  • @plliao first commit is #26325
  • @morrison-turnansky first commit is #26113
  • @isharif168 first commit is #26347
  • @Barry-Delaney first commit is #25931
  • @utkarshsharma1 first commit is #26279
  • @Aydin-ab first commit is #25283
  • @therealnaveenkamal first commit is #25103
  • @QierLi first commit is #24926
  • @zhiyuan1i first commit is #24486
  • @iwzbi first commit is #16601
  • @roikoren755 first commit is #25947
  • @luis5tb first commit is #25593
  • @wangxiongts first commit is #25550
  • @sangho-vision first commit is #26563
  • @muzian666 first commit is #26562
  • @HsChen-sys first commit is #22100
  • @FENP first commit is #26574
  • @gjgjos first commit is #26339
  • @andycandy first commit is #26629
  • @aitsvet first commit is #26713
  • @cyb70289 first commit is #26698
  • @kfhfar first commit is #26538
  • @n1ck-guo first commit is #24024
  • @ryanli first commit is #26758
  • @VladOS95-cyber first commit is #26726
  • @zklapow first commit is #26818
  • @HDCharles first commit is #26820
  • @Dhruvilbhatt first commit is #26837
  • @madongfly first commit is #26853
  • @li2haipeng first commit is #26319
  • @pdasigi first commit is #26143
  • @cern1710 first commit is #26637
  • @inc-jeong first commit is #26225
  • @bogdanminko first commit is #27008
  • @mandy-li first commit is #26883
  • @kimbochen first commit is #26943
  • @staghado first commit is #26916
  • @rkarhila-amd first commit is #25586
  • @hyongtao-code first commit is #27101
  • @jianyuh first commit is #27159
  • @uyzhang first commit is #27012
  • @shivampr first commit is #26268
  • @helunwencser first commit is #26832
  • @dagrayvid first commit is #27196
  • @ExtReMLapin first commit is #27253
  • @ReinForce-II first commit is #26789
  • @LiuLi1998 first commit is #22627
  • @sagiahrac first commit is #27211
  • @fangpings first commit is #27133
  • @jonathanc-n first commit is #27372
  • @bradleyhd first commit is #27124
  • @Navya1707 first commit is #27156
  • @piood first commit is #27324
  • @xxxxyu first commit is #26092
  • @usberkeley first commit is #27419
  • @strinczer first commit is #26706
  • @hjh0119 first commit is #27469
  • @wpc first commit is #27328
  • @yeshsurya first commit is #27188
  • @rogeryoungh first commit is #27535
  • @dcmaddix first commit is #27291
  • @tingtingtangmeta first commit is #27538
  • @minatoaquaMK2 first commit is #27323
  • @wangln19 first commit is #27565
  • @junpuf first commit is #27596
  • @sammshen first commit is #27600
  • @mpashkovskii first commit is #26886
  • @KevinCheung2259 first commit is #27670
  • @sammysun0711 first commit is #27623
  • @dumb0002 first commit is #24176
  • @sairampillai first commit is #25775
  • @FlamingoPg first commit is #27794
  • @SumanthRH first commit is #27789
  • @PaulZhang12 first commit is #27660
  • @jakub-sochacki first commit is #26919
  • @RobMulla first commit is #27824
  • @yugong333 first commit is #27818
  • @ai-jz first commit is #27850
  • @xiaohajiayou first commit is #26779
  • @biswapanda first commit is #27728
  • @efimki first commit is #24905
  • @zhang-prog first commit is #27758
  • @xiangze-arm first commit is #27240
  • @yt0428 first commit is #27521
  • @ganyi1996ppo first commit is #25763
  • @nadavkluger first commit is #28048
  • @toulzx first commit is #27740
  • @frost-intel first commit is #28004
  • @jjzhang first commit is #28127
  • @walterbm first commit is #28075
  • @dayeol first commit is #22496
  • @cmpute first commit is #27780
  • @seungduk-yanolja first commit is #27946
  • @aditew01 first commit is #28130
  • @milpuz01 first commit is #26018
  • @StanHatko first commit is #27953
  • @vicoooo26 first commit is #27792
  • @HanFa first commit is #27497
  • @amacaskill first commit is #28079
  • @smitkadvani first commit is #28024
  • @xiaohongchen1991 first commit is #21068
  • @hammmmy first commit is #28308
  • @ashahba first commit is #28026
  • @zhangsicheng5 first commit is #26696
  • @evberrypi first commit is #28328
  • @ColeMurray first commit is #28337
  • @bo-ke first commit is #28374
  • @caozuoba first commit is #28280
  • @zhaozuy first commit is #27892
  • @maryamtahhan first commit is #28461
  • @the-codeboy first commit is #28474
  • @xuebwang-amd first commit is #24239
  • @Livinfly first commit is #28389
  • @AndreasKaratzas first commit is #27611
  • @wuyaoxuehun first commit is #27597
  • @ziruiliu first commit is #27978
  • @ZhengHongming888 first commit is #28356
  • @albertoperdomo2 first commit is #27039
  • @PerryZhang01 first commit is #27731
  • @Radu2k first commit is #28059
  • @tjandy98 first commit is #28210
  • @dw2761 first commit is #28615
  • @zufangzhu first commit is #28611
  • @amdfaa first commit is #28142
  • @YuanpingSong first commit is #28617
  • @janeyx99 first commit is #28595
  • @xingliu14 first commit is #28529
  • @610lyn first commit is #28522
  • @WorldExplored first commit is #27516
  • @gjc0824 first commit is #25438
  • @shahfasal first commit is #26721
  • @zhaozx-cn first commit is #28277
  • @OthmanMohammad first commit is #28737
  • @GuanH first commit is #28663
  • @halyavin first commit is #28728
  • @mostrowskix first commit is #27410
  • @laithsakka first commit is #25110
  • @hwhaokun first commit is #28777
  • @River12 first commit is #28715
  • @scottzh8 first commit is #28679
  • @ashors1 first commit is #28765
  • @jscaldwell55 first commit is #28543
  • @tiehexue first commit is #28814
  • @Ronald1995 first commit is #24799
  • @shreyas269 first commit is #28435
  • @pranav4501 first commit is #28332

Full Changelog: v0.11.0...v0.11.1