v0.11.1
Highlights
This release includes 1456 commits from 449 contributors (184 new contributors)!
Key changes include:
- PyTorch 2.9.0 + CUDA 12.9.1: Updated the default CUDA build to
torch==2.9.0+cu129, enabling Inductor partitioning and landing multiple fixes in graph-partition rules and compile-cache integration. - Batch-invariant
torch.compile: Generalized batch-invariant support across attention and MoE backends, with explicit support for DeepGEMM and FlashInfer on Hopper and Blackwell GPUs. - Robust async scheduling: Fixed several correctness and stability issues in async scheduling, especially when combined with chunked prefill, structured outputs, priority scheduling, MTP, and DeepEP / DCP. We expect
--async-schedulingto be enabled by default in the next release. - Stronger scheduler + KV ecosystem: Improved test coverage in CI and made scheduler behavior more robust with KV connectors, prefix caching, and multi-node deployments.
- Anthropic API Support: Added support for the
/v1/messagesendpoint, allowing users to interact withvllm serveusing Anthropic-compatible clients.
Detailed release notes will be updated in the next few days.
What's Changed
- [Bugfix] Improve GLM4 MoE Reasoning Parser's is_reasoning_end Condition (@frankwang28 #25355)
- [Docs] Add Toronto Meetup (@mgoin #25773)
- [CI] Add E2E Blackwell Quantized MoE Test (@mgoin #25723)
- [V1] address post issues related to #20059 (part 1); cascade attention reenable by default (@fhl2000 #23046)
- [CI] Fix FlashInfer AOT in release docker image (@mgoin #25730)
- [spec decode] Consolidate speculative decode method name for MTP (@zixi-qi #25232)
- Reduce the Cuda Graph memory footprint when running with DBO (@SageMoore #25779)
- Kernel-override Determinism [1/n] (@bwasti #25603)
- [Bugfix] Optimize CpuGpuBuffer initialization (@namanlalitnyu #25447)
- [Spec decode] automatically disable mm for text-only draft models (@jmkuebler #25667)
- [Core] Don't count preempted tokens in prefix cache hit rate (@zhuohan123 #25787)
- Add option to restrict media domains (@russellb #25783)
- Add flashinfer-build.sh and register precompiled cu128 wheel in Dockerfile (@mgoin #25782)
- [Multimodal][Speculative Decoding]Eagle Eagle3 mm support, enablement on qwen2.5vl (@david6666666 #22872)
- [Bugfix] Allow Only SDPA Backend for ViT on B200 for Qwen3-VL (@yewentao256 #25788)
- [CI/Build] Consolidate model loader tests and requirements (@DarkLight1337 #25765)
- [CI/Build] Add timing to Model Executor Test (@22quinn #25799)
- [CI/Build] Reorganize root-level V1 tests (@DarkLight1337 #25767)
- [Misc] Fix codeowners override for v1 sample and attention (@22quinn #25037)
- [Misc] Update openai client example file for multimodal (@ywang96 #25795)
- [Bugfix] Add missing
image_sizefor phi4_multimodal (@Renovamen #25796) - [Bugfix] Merge MM embeddings by index instead of token IDs (@DarkLight1337 #16229)
- Validate API tokens in constant time (@russellb #25781)
- Add filtering for chat template kwargs (@russellb #25794)
- Fix GPTQ model loading in Transformers backend (@hmellor #25770)
- [Bugfix] Fix triton import precommit failure (@tlrmchlsmth #25803)
- [Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models (@tlrmchlsmth #24982)
- [docs] Resolve transcriptions API TODO (@yyzxw #25446)
- [env] default nixl side port conflicts with kv-event zmq port (@panpan0000 #25056)
- [Core] Refactor self.model() to call a helper for subclassing. (@patrick-toulme #25084)
- [torch.compile]: Add VLLM_DEBUG_DUMP_PATH environment variable (@ZJY0516 #25651)
- [Bug]: Set LD_LIBRARY_PATH to include the 'standard' CUDA location (@smarterclayton #25766)
- [Core] GC Debug callback (@Jialin #24829)
- [Bugfix][NIXL] Fix Async Scheduler timeout issue (@NickLucche #25808)
- [MM] Optimize memory profiling for scattered multimodal embeddings (@ywang96 #25810)
- [Bugfix] Fix Qwen3-VL regression from #24982 (@ywang96 #25814)
- [VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling (@Isotr0py #25557)
- Fix random dataset mismatched token length with config. (@weireweire #24937)
- Update GLM-4.5 Doc transformers version (@zRzRzRzRzRzRzR #25830)
- [Bugfix] fix Qwen3VLMoe load when pp > 1 (@JJJYmmm #25838)
- Remove redundant cudagraph dispatcher warning (@mgoin #25841)
- [Misc] fix tests failure by using current_platform (@kingsmad #25825)
- [P/D] NIXL Updates (@robertgshaw2-redhat #25844)
- Add Phi4FlashForCausalLM to _PREVIOUSLY_SUPPORTED_MODELS (@tdoublep #25832)
- [XPU]Fix xpu spec decoding UTs, avoid using cuda graph (@jikunshang #25847)
- [Bugfix] Fallback ViT attn backend to SDPA for blackwell (@ywang96 #25851)
- [V0 Deprecation][Models] Remove all V0 condition for mm embeddings merge (@Isotr0py #25331)
- [Misc] Remove more
get_input_embeddings_v0(@DarkLight1337 #25857) - update to latest deepgemm for dsv3.2 (@youkaichao #25871)
- [Bugfix] Fix requirements paths in install instructions (@yingjun-mou #25827)
- [Model][Bugfix] Fix issues in MiDashengLM implementation for quantized models (@zhoukezi #25854)
- [torch.compile] serialize cudagraph_mode as its enum name instead of value (@ZJY0516 #25868)
- [Cuda2CPU][P/D] Add cuda2cpu support in NixlConnector (@chenxi-yang #24690)
- [Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue (@rahul-tuli #25883)
- [CI/Build] Include Transformers backend test in nightly transformers test (@Isotr0py #25885)
- [Model] Remove MotifForCausalLM (@jeejeelee #25866)
- [Bugfix] Use correct key "ignore" for config.json non-quantized layers (@leejnau #25706)
- [BugFix][torch.compile] KV scale calculation issues with FP8 quantization (#21640) (@adabeyta #25513)
- [Doc] Add documentation for vLLM continuous benchmarking and profiling (@namanlalitnyu #25819)
- [Bugfix][ROCm] Fixing trying to import non-existent symbols from libnccl.so (@gshtras #25605)
- [Kernel] Chunk-aligned mamba2 (@tdoublep #24683)
- [Doc] Polish example for torchrun dp (@zhuohan123 #25899)
- [NIXL] Increase default KV block eviction timeout on P (@NickLucche #25897)
- [V0 Deprecation] Remove
vllm.workerand update according imports (@aarnphm #25901) - Test Prompt Embeds/LoRA compatibility and Enable LoRA Support for OPT Models (@qthequartermasterman #25717)
- [Bug] Fix Weight Loading for Block FP8 Cutlass SM90 (@yewentao256 #25909)
- [Benchmark] Support benchmark throughput for external launcher DP (@zhuohan123 #25913)
- Move
VllmConfigfromconfig/__init__.pytoconfig/vllm.py(@hmellor #25271) - [BugFix] Fix DP/EP hang (@LucasWilkinson #25906)
- [BugFix] Pass config_format via try_get_generation_config (@acisseJZhong #25912)
- [Model][Bugfix] Fix MiDashengLM audio encoder mask by removing incorrect
logical_not(@zhoukezi #25925) - [Bugfix]: Clean up chunked prefill logging when using whisper (@simondanielsson #25075)
- [New Model] DeepSeek-V3.2 (Rebased to Main) (@zyongye #25896)
- [Doc] Add Cambricon MLU support (@a120092009 #25942)
- Updated TRL integration docs (@sergiopaniego #25684)
- [Bugfix][Model]fix ernie45 moe gate&bias dtype to float32 (@CSWYF3634076 #25936)
- [Model] Move
vision_feature_select_strategyintoresolve_visual_encoder_outputs(@DarkLight1337 #25938) - [perf] Use CPU tensor to reduce GPU->CPU sync (@lhtin #25884)
- [NIXL] Add support for MLA caches with different latent dim (@NickLucche #25902)
- [CI] Move applicable tests to CPU (@rzabarazesh #24080)
- [Fix] Improve CPU backend compatibility for RISC-V (@ihb2032 #25816)
- [Kernel][Moe Configs] Add more tuned triton configs for ExpertsInt8 and FP8 (@Josephasafg #25858)
- Add Hugging Face Inference Endpoints guide to Deployment docs (@sergiopaniego #25886)
- [Bugfix][Model] Fix inference for Hunyuan dense models (@Anionex #25354)
- [Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging (@pavanimajety #25895)
- [Bugfix] Token type and position embeddings fail to be applied to
inputs_embeds(@DarkLight1337 #25922) - [bugfix][deepseek] fix flashmla kernel selection (@youkaichao #25956)
- [Bug] Fix AttributeError: 'QKVParallelLinear' object has no attribute 'orig_dtype' (@yewentao256 #25958)
- [Doc] Improve MM Pooling model documentation (@DarkLight1337 #25966)
- [Docs] Add moe kernel features doc (@bnellnm #25297)
- OffloadingConnector: Fix GPU block tracking bug (@orozery #25856)
- [Llama4] [multimodal] Fix misplaced dtype cast of
cos_sin_cacheinLlama4VisionRotaryEmbedding(@cjackal #25889) - [Bench] Add DeepSeekV32 to MoE benchmark (@jeejeelee #25962)
- [V1] [P/D] Add Support for KV Load Failure Recovery (@sdavidbd #19330)
- Add explicit pooling classes for the Transformers backend (@hmellor #25322)
- [Docs] Remove API Reference from search index (@hmellor #25949)
- [gpt-oss] use vLLM instead of openai types for streaming (@qandrew #25186)
- [Misc] Make EP kernels install script support uv (@LucasWilkinson #25785)
- [Model] MTP fallback to eager for DeepSeek v32 (@luccafong #25982)
- Update launch_bounds_utils.h for correct compile on Multiple Cuda Arch - PTXAS out of range Warning (@DrStone1971 #25843)
- [Log] Optimize Log for FP8MOE (@yewentao256 #25709)
- Fix INT8 quantization error on Blackwell GPUs (SM100+) (@certainly-param #25935)
- [MM] Add text-only mode for Qwen3-VL (@ywang96 #26000)
- [Bugfix] Fix
__syncwarpon ROCM (@zhewenl #25996) - [BugFix] Fix default kv-cache-dtype default for DeepseekV3.2 (@LucasWilkinson #25988)
- Update to Transformers
v4.56.2(@hmellor #24638) - [Misc]allow disable pynccl (@luccafong #25421)
- [Doc] updating torch.compile doc link #25989)
- [BugFix][MM] Fix Nonetype error when video is cache in qwen2.5-omni-thinker (@wwl2755 #26004)
- [Misc] Factor out common
_apply_feature_select_strategy(@DarkLight1337 #26003) - [CI] Only capture a single CUDA graph size in CI by default (@hmellor #25951)
- [MISC] Fix misleading batch_size_capture_list when cuda_graph_sizes < 4 (@billishyahao #25829)
- [Benchmark] Finish documented v0.11.0 deprecation of --endpoint-type (@natoscott #26007)
- [Bugfix] Apply same sampling parameters for both
n=1andn>1(@kmaehashi #26005) - [NVIDIA] Blackwell Family (@johnnynunez #24673)
- Fix test_mamba_ssm_ssd.py due to missing _query_start_loc_to_chunk_indices_offsets (@hl475 #25995)
- [CI] Tweaks to GPT-OSS Eval (Blackwell) for stability (@mgoin #26030)
- [BugFix][DP/EP] Fix CUTLASS MLA hang under load (@LucasWilkinson #26026)
- [ROCm][Build] Add support for AMD Ryzen AI MAX / AI 300 Series (@hyoon1 #25908)
- [Bug] Fix Negative Cuda Memory Usage (@yewentao256 #25683)
- [BugFix] ChunkedLocalAttention is currently not CG compatible (@LucasWilkinson #26034)
- Support RL online quantization with torchao (@jerryzh168 #23014)
- [ROCm][Bugfix] Add missing parameter to ROCm backend (@gshtras #26029)
- [Misc] Make handling of SamplingParams clearer in n>1 case (@njhill #26032)
- Run:ai model streamer add GCS package support (@pwschuurman #24909)
- Update base image to 22.04 (jammy) (@huydhn #26065)
- Change size of single CUDA graph for CI to 4 (@tdoublep #26089)
- [FA/Chore] Bump vllm-flash-attention (@LucasWilkinson #25537)
- [Model] Use
merge_by_field_configfor MM models (A-C) (@DarkLight1337 #26073) - [Model] Use
merge_by_field_configfor MM models (D-F) (@DarkLight1337 #26076) - [Platform][CI] Added OOT platform interface e2e test that running on Ascend NPU (@leo-pony #25470)
- [Qwen][ROCm] Flash Attention Rotary Embeddings (@vllmellm #24642)
- [CI] Add Blackwell DeepSeek FP8 FlashInfer MoE tests (@mgoin #26040)
- [CI/Build] Replace
vllm.entrypoints.openai.api_serverentrypoint withvllm servecommand (@DarkLight1337 #25967) - [BugFix] Fix FI accuracy issue when used for MLA prefill (@LucasWilkinson #26063)
- [Small] Prevent bypassing media domain restriction via HTTP redirects (@huachenheli #26035)
- [Deepseek v3.2] Support indexer prefill chunking (@heheda12345 #25999)
- EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 32% instead of 5% on MTBench (@ekagra-ranjan #25916)
- [Mamba][KVCacheManager] Simplify kv cache manage logic for mamba + MTP (@heheda12345 #25119)
- [Perf] Fix and reapply move apply w8a8 block fp8 linear to class (@ElizaWszola #25696)
- Fix MTP with deepep_low_latency (@MatthewBonanni #25904)
- [Bugfix] Disable cascade attention with FlashInfer (@mgoin #26130)
- [Log] Optimize DeepGEMM Missing Log (@yewentao256 #26106)
- [Bug][Benchmark] Fix duplicate req in oversampling (@ekagra-ranjan #26140)
- [Attention] Move Backend enum into registry (@MatthewBonanni #25893)
- [CI/Build] Conditionally register cutlass_fp4_group_mm to fix building on Hopper (@mgoin #26138)
- [DeepSeek] Improve performance of DS MLA cache kernel (@MatthewBonanni #26132)
- [Bug]: Limit num_reqs in dummy_run when max_num_seqs is small (@benchislett #26144)
- [gpt-oss] disable tool server initialization if no tool in request (@qandrew #25790)
- [Build/CI] Revert back to Ubuntu 20.04, install python 3.12 with uv (@tlrmchlsmth #26103)
- [ROCm] [VL] [Bugfix] Fix vit flash attn dispatcher logic for ROCm (@tjtanaa #26104)
- [Bugfix] Fix import
gemm_afp4wfp4failure on AMD (@zhewenl #26068) - [Model] Use
merge_by_field_configfor MM models (G) (@DarkLight1337 #26117) FusedMoEsupport for the Transformers backend (@hmellor #22650)- [BUG] Reorder model config creation (@ahao-anyscale #26124)
- [Misc] Remove typing.List (@varun-sundar-rabindranath #26150)
- [Input] Remove unused
promptfield (@DarkLight1337 #26097) - [Perf] Optimize
reshape_and_cacheCUDA Kernel (@ZJY0516 #25955) - add(v1): RequestStatesStats to RequestOutput (@huijjj #24947)
- [Model] Use
merge_by_field_configfor MM models (InternVL family) (@DarkLight1337 #26153) - [test utils] correct wrong typing (@yannicks1 #26159)
- [CI] Fix distributed hybrid tests in CI (@tdoublep #26155)
- [NIXL][Misc] Expose metrics from NIXL for logging to CLI (@NickLucche #25388)
- [openai] Fix missing tool usage check (system message) (@levunet #24768)
- [Multi Modal] Configurable MM Profiling (@wwl2755 #25631)
- [Doc] Fixed shape description for fused_batched_moe.py (@Egor-Krivov #25668)
- Quick fix for IMA with the Prefix Prefill kernel during graph capture (@SageMoore #25983)
- [Renderer] Move Processor out of AsyncLLM (@KKSK-DON #24138)
- Re-enable prefill of max model length (@yannicks1 #24446)
- [backends][short_conv] CUDA graph piecewise edits (@paulpak58 #24215)
- [Model] Supplement to PR 24862: Pass param prefix to LLMHead (@whx-sjtu #25805)
- [CI/Build] do not enforce precompilation on tpu ci tests (@sixiang-google #25992)
- [Model] Fixed stream generator for gpt-oss + spec-decoding (@astralord #26027)
- [Renderer] Move Processor out of LLMEngine (@DarkLight1337 #26165)
- Fix undefined symbol: cutlass_moe_mm_sm100 (@jasl #26098)
- [BugFix][QWEN-VL]fix wrong apply_rotary_emb_torch selection introduced by #24642 (@xuechendi #26123)
- Stop mergify from keeping stale PRs alive (@hmellor #26169)
- Avoid division by zero in cache DS MLA kernel (@MatthewBonanni #26174)
- Fix V1 engine serialization error with Ray distributed executor (@nrghosh #26148)
- [Quantization/NVFP4] Speed up TRTLLM NVFP4 MOE weight loading and fix K/V scale loading for MLA Attn (@pavanimajety #25968)
- [Perf] Remove hardcoded num_warps=1 (@chelsea0x3b #26183)
- [Refactor] Optimize FP8 MOE Backend Choice and Log (@yewentao256 #26044)
- [responsesAPI] add better error messaging for long prompts (@qandrew #25724)
- [Bugfix] Relax tokenizer regex for mixtral to include 'tokenizer.model' (@BowenBao #25964)
- [CI] Push multiarch manifests as nightly builds (@csahithi #25764)
- [Misc] Add penalties sampling parameters to serve tool (@southfreebird #25974)
- [BugFix] Fix de-functionalization pass for rotary_embedding (@angelayi #23953)
- [CI] Fix Pre-commit Mypy Error (@yewentao256 #26181)
- [GPTOSS][DP/EP][Marlin] Enable GPTOSS DP/EP using Marlin kernels (@varun-sundar-rabindranath #25488)
- Fix issue of using only the part of video frame [Nemotron Nano] (@BloodAxe #26186)
- [Bugfix] Fix qwen3 vl dummy data generation with overrides (@ywang96 #26193)
- [BugFix] Use async Mistral Tokenizer in Chat Completions (@bbrowning #26134)
- Add batch invariant kernel override for FlashInfer backend [2/n] (@bwasti #25769)
- [cpu][perf] Accelerate unquantized-linear for AArch64 through oneDNN/ACL and weight prepack (@fadara01 #25948)
- [V1] [Hybrid] Mamba2 Automatic Prefix Caching (@s3woz #25752)
- Support expert parallel in Transformers backend (@hmellor #26162)
- [Model] Support nested structures for TensorSchema (@DarkLight1337 #26212)
- [Misc] Require
merge_by_field_configargument (@DarkLight1337 #26214) - [Misc] Remove unused
executor.apply_model(@DarkLight1337 #26215) - [CI Failure] fix_test_auto_prefix_cache_support (@hl475 #26053)
- Revert "Add batch invariant kernel override for FlashInfer backend [2/n]" (@DarkLight1337 #26220)
- Add Olmo 3 reasoning parser (@soldni #26054)
- [Core] Enable decode of context length equal to max model length (@yannicks1 #26168)
- [Bugfix] Fix
_reqs_to_processleak on abort (@NickLucche #26012) - [Model] CLIP Embedding Support (@DarkLight1337 #26010)
- Fix tensor device and dtype placement in Qwen2VL model (@yuafng #26219)
- [V1] [Hybrid] Remove code to override default CUDA graph configuration (@tdoublep #26226)
- [CPU] Refine batch reorder of CPU attention backend (@bigPYJ1151 #26096)
- [Frontend] Cache chat template kwargs resolution (@Isotr0py #26227)
- [Renderer] Clean up renderer code (@DarkLight1337 #26216)
- [Model] Use
merge_by_field_configfor MM models (H-L) (@DarkLight1337 #26230) - [Easy] Add str repr for IterationStats (@22quinn #26232)
- [Bugfix] Allow
--skip-tokenizer-initwithecho and return_token_ids(@DarkLight1337 #26238) - Add documentation for granite 4 tool calling (@maxdebayser #26175)
- [Perf][Easy] Early stop in request_block_hasher (@Jialin #26112)
- [Bugfix]: Assertion error when using FlashInfer backend (@simondanielsson #25933)
- [Bugfix] Always apply MM processor even when no MM items are passed (@DarkLight1337 #26240)
- [Bugfix][Hardware][RISC-V] Limit supported dtypes to float32 to avoid scheduler segfault (@ihb2032 #26228)
- [Refactor][Kernel] support loading kernel from other place (@ILikeIneine #25823)
- Convert formatting to use
ruffinstead ofyapf+isort(@hmellor #26247) - Remove all references to
yapfas it's no longer used (@hmellor #26251) - Remove all cases of
fmt: on/off(@hmellor #26253) - fix(tests): Resolve late binding of loop variable in assert message lambda (@ihb2032 #26249)
- Fix per file ruff ignores related to typing (@hmellor #26254)
- Update
ruffpre-commit hooks version (@hmellor #26255) - [CI] fix mamba kernel test (@ZJY0516 #26250)
- [NVIDIA] flashinfer TRTLLM attention prefill token limit (@jasonlizhengjian #25998)
- Fix per file ruff ignores related to simplification (@hmellor #26259)
- [CI] Add Blackwell LM Eval Small Models test to nightly (@mgoin #26052)
- [DOC] Update production-stack.md (@elieserr #26177)
- [CI] Add comment about the single cudagraph capture size that is used (@tdoublep #26252)
- [V1] [Hybrid] Some additional clean-up in Mamba2 prefix caching (@tdoublep #26222)
- [Doc] Edited minor typo (@orangeng #26266)
- [MISC] Add heheda12345 to CODEOWNERS of vllm/config/cache.py (@heheda12345 #26270)
- [CI][gpt-oss] Enable python tool tests in CI (@wuhang2014 #24315)
- Fix per file ruff ignores related to line length (@hmellor #26262)
- Bump actions/stale from 10.0.0 to 10.1.0 (@dependabot[bot] #26272)
- [Benchmarking] Add disable_shuffle option for dataset loading (@ymoslem #26258)
- [Misc] Clean up unnecessary E501 ignore (@ywang96 #26274)
- [Docs] Edit HF Inference Endpoints documentation (@ariG23498 #26275)
- [Doc] add KAITO to integrations (@abhisheksheth28 #25521)
- [Frontend] Consolidate tokenizer init code (@DarkLight1337 #26276)
- [Model] Use
merge_by_field_configfor MM models (Llava family) (@DarkLight1337 #26280) - Support expert parallel load balancing in Transformers backend (@hmellor #26287)
- [Bugfix] Fix mrope in Transformers Backend (@zucchini-nlp #26087)
- Fix
DotsOCRtensor type (@what-in-the-nim #26281) - [Model] EVS support for nano_nemotron_vl (@tomeras91 #26269)
- [Attention] Remove unused reorder_batch method (@MatthewBonanni #24463)
- [Tests] conftest: Extending VllmRunner and HfRunner to accept token_ids as input (@yannicks1 #26295)
- [CI Bugfix] Make sure TRTLLM attention is available in test_blackwell_moe (@mgoin #26188)
- Support llama3 eagle3 head with llama4 verifier (@rahul-tuli #25961)
- [Misc] auto_tune: kill specific vllm process (@karan #26304)
- [Bugfix][Spec Decode] Fix wrong valid_mask for padded speculation when chunked prefill occurs (@seven-mile #26231)
- Add bias handling to CPUFusedMOE kernel (@cfRod #26289)
- [Bugfix] Fix gemma3 with transformers backend (@zucchini-nlp #23178)
- [Benchmark] Enable MM Embedding benchmarks (@DarkLight1337 #26310)
- [Docs] Fix broken table in moe_kernel_features doc (@varun-sundar-rabindranath #26314)
- [BugFix] Pad input buffers in _dummy_run (@varun-sundar-rabindranath #26209)
- [Bugfix] Allow skipping MoE in NVFP4 (fix for MTP) (@benchislett #25987)
- [ROCm] Split AITER unified attention into its own backend (@gshtras #25507)
- [Perf] Add decode full-graph support to FlashInfer-MLA backend (@benchislett #26313)
- [Misc] Define EP kernel arch list in Dockerfile (@simon-mo #25635)
- [Docs][DBO] Add initial doc that describes the DBO implementation (@SageMoore #26024)
- [Core] Simplify the Dp padding/should ubatch coordination logic (@SageMoore #25768)
- [UX] Support nested dicts in hf_overrides (@mgoin #25727)
- [BUG] Fix file parsing for load_format runai_streamer_sharded (@ahao-anyscale #26324)
- [Model] Define merge_by_field_config MM interface (U-Z) (@ayushsatyam146 #26261)
- [Deprecation] Deprecate
LLM.set_tokenizer(@DarkLight1337 #26333) - [responsesAPI][bugfix] serialize harmony messages (@qandrew #26185)
- [Model] Define merge_by_field_config MM interface (R-T) (@ayushsatyam146 #26260)
- [BugFix] Update KV block hash type from BlockHash to ExternalBlockHash in kv_events_subscriber - #26264 (@atalhens #26265)
- [V0 Deprecation] Remove
VLLM_USE_V1from docs and scripts (@DarkLight1337 #26336) - Optimize KV cache distribution for asymmetric pipeline parallelism (@gholmes829 #25164)
- Add topk logits torch op for DS3.2. (@dcampora #25945)
- Add TRL example notebook to RLHF docs (@sergiopaniego #26346)
- [Docs] add docs for cuda graph v1 (@fhl2000 #24374)
- [Model] Use
merge_by_field_configfor MM models (Ovis family) (@Isotr0py #26308) - [Feature][OCP MX] Support mxfp6 and mixed mxfp6-mxfp4 (@fxmarty-amd #21166)
- [Model] Add support for ModernBertForTokenClassification (@antrec #26340)
- [Misc] Move
LRUCacheinto its own file (@DarkLight1337 #26342) - [V0 Deprecation] Remove
VLLM_USE_V1from tests (@DarkLight1337 #26341) - [Model] Lfm2Moe (@paulpak58 #26344)
- [ci] Rename
test_mxfp4_moe.pytotest_ocp_mx_moe.py(@fxmarty-amd #26364) - [CI] Add Qwen3 MoE NVFP4 to Blackwell lm-eval (@mgoin #26316)
- [deepseek] add EP8 FusedMOE config for H200 and B200 (@heheda12345 #26331)
- [Bug] Fix Shape Validation for Fallback while Enabling E8M0 for DeepGEMM (@yewentao256 #26322)
- [Bugfix] Add missing sink tensor into flash attn cascade attn implementation (@plliao #26325)
- [Frontend] CompilationConfig overhaul (#20283): deprecate use_inductor in favor of backend, simplify custom_ops (@morrison-turnansky #26113)
- [V1] Logit processors for rejection sampler (@southfreebird #19482)
- [Spec Decode] Enable efficient speculative decoding with FlashInfer-MLA (@benchislett #25984)
- [TPU] update TPU benchmark threshold (@jcyang43 #25713)
- Add more libraries to rlhf.md (@mgoin #26374)
- [Bugfix] Fix MTP+FlashInfer crash when trtllm kernels are available but disabled (@benchislett #26361)
- Revert #24446 and #26168 (@tdoublep #26332)
- [Misc] Clean up cruft from previous FlashMLA sparse implementation (@LucasWilkinson #26125)
- [torchao] safetensors integration (@liangel-02 #25969)
- Add SwigluOAI implementation for CPUFusedMOE (@isharif168 #26347)
- [Core] Simplify setting new_token_ids in CachedRequestData (@njhill #26388)
- fix(v1/kv_cache): resolve async KV transfer bug in cascade attention (@ayushsatyam146 #23485)
- Add gather_indexer_k_quant_cache kernel (@Barry-Delaney #25931)
- [Bugfix] Incorrect MM data format in
vllm bench throughput(@DarkLight1337 #26395) - fix[DP][v1]: Prevent hangs from mismatched worker configurations (@ayushsatyam146 #26218)
- [TPU] Rename tpu_commons to tpu_inference (@utkarshsharma1 #26279)
- [Feature] Enable E8M0 by Default on Hopper for DeepGEMM, 5% E2E throughput improvement (@yewentao256 #26197)
- [Misc] add usedforsecurity=False in md5 hash call (@dtrifiro #26357)
- [Model] Allow passing custom number of max tiles to Nano 2 VL (@BloodAxe #26403)
- [Docs] Have mergify leave a comment with the docs preview link (@hmellor #26412)
- [CI] Pooling models mteb test disable enforce_eager (@noooop #26408)
- [Benchmarks] Add support for Qwen 3 VL MoE tuning (@lgeiger #26419)
- Tidy
vllm/config/__init__.pyto only add classes and functions (@hmellor #26405) - [NIXL][non-cuda] Add install script for nixl with non-cuda ucx (@xuechendi #25959)
- [Refactor] Refactor FP8 & INT8 Quant Folder inside
w8a8(@yewentao256 #25293) - [CI Failure] Fix pre-commit issue for install_nixl_from_source_ubuntu.py (@mgoin #26424)
- [Bugfix] Fix
vllm bench ...on CPU-only head nodes (@Aydin-ab #25283) - [Bug] Fix DeepGEMM Attention Test (@yewentao256 #26423)
- [Benchmarks] Fix imports in FP8 tuning script (@lgeiger #26407)
- [Bug] Fix Test in Batch Invariant (@yewentao256 #26128)
- Remove Python 3.9 support ahead of PyTorch 2.9 in v0.11.1 (@hmellor #26416)
- [Feature] Change cache.py with pydantic validation (@vrdn-23 #26390)
- [Attention] Implement universal BACKEND_MAP (@MatthewBonanni #25900)
- [Bugfix][Flashinfer] fix VLLM_USE_TRTLLM_ATTENTION issue for models with diff hyperparameters (@elvischenv #25924)
- [BugFix] Fix failing test quantization/test_compressed_tensors.py::test_compressed_tensors_fp8_block_enabled (@morrison-turnansky #26436)
- [Kernel] Centralize platform kernel import in
current_platform.import_kernels(@NickLucche #26286) - [Models] Improve iteration over layers (@lgeiger #26425)
- [Bugfix] Respect min_tokens in scheduler stop check (@elaineyz #26317)
- [Kernels] Modular kernel refactor (@bnellnm #24812)
- [Attention] Register FLASHMLA_SPARSE (@MatthewBonanni #26441)
- Separate MLAAttention class from Attention (@therealnaveenkamal #25103)
- [Misc] Redact ray runtime env before logging (@ruisearch42 #26302)
- [Bugfix] Set the minimum python version for gpt-oss (@jeejeelee #26392)
- [Minor] Change warning->warning_once in preprocess (@zhuohan123 #26455)
- [Bugfix] Catch and log invalid token ids in detokenizer #2 (@njhill #26445)
- [Bugfix] Incorrect another MM data format in vllm bench throughput (@huydhn #26462)
- [Hardware][AMD] Enable FlexAttention backend on ROCm (@mawong-amd #26439)
- [MM][Doc] Add documentation for configurable mm profiling (@wwl2755 #26200)
- [Core][KVConnector] Propagate all tokens on resumed preemptions (@QierLi #24926)
- [Hybrid]: Decouple Kernel Block Size from KV Page Size (@zhiyuan1i #24486)
- [CI/Build] Fix model nightly tests (@DarkLight1337 #26466)
- [Core] Relax the LoRA max rank (@jeejeelee #26461)
- Update Dockerfile and install runai-model-streamer[gcs] package (@pwschuurman #26464)
- Bump Flashinfer to v0.4.0 (@elvischenv #26326)
- [Model] Gemma3: Fix GGUF loading and quantization (@lucianommartins #26189)
- Enable
RMSNormsubstitution for Transformers backend (@hmellor #26353) - Add: Support for multiple hidden layers in Eagle3 (@rahul-tuli #26164)
- [torchao] Add support for ModuleFqnToConfig using regex (@jerryzh168 #26001)
- [Misc] Misc code simplifications (@njhill #26450)
- [doc] add Volcengine as a compute sponsor (@youkaichao #26477)
- [Feature] Use pydantic validation in lora.py and load.py configs (@simondanielsson #26413)
- [Misc] Upgrade more code to Python 3.10 (@DarkLight1337 #26463)
- [Bugfix] Fix SHM cache initialization (@DarkLight1337 #26427)
- [Models][Qwen3VL] Optimise
_validate_and_reshape_mm_tensor(@lgeiger #26426) - [Bugfix] Move current_platform import to avoid python import cache. (@iwzbi #16601)
- [V0 deprecation] Remove
QKVCrossParallelLinearimplementation (@Isotr0py #26475) - [Feature] Use pydantic validation in parallel.py config (@simondanielsson #26417)
- Revert #26113 "[Frontend] CompilationConfig overhaul (#20283): deprecate use_inductor in favor of backend, simplify custom_ops" (@ZJY0516 #26472)
- Upgrade Pydantic to v2.12.0 and remove hack for Python 3.13 (@hmellor #26481)
- [Models][Qwen] Replace
padwithcatfor better performance (@lgeiger #26486) - [Attention][DCP] Support DCP with query length > 1 (MTP) with FA3 (@minosfuture #25049)
- [Model] Apply shared experts overlap optimization to all models with shared experts (@bnellnm #26145)
- [BUGFIX] Add cu_tokens_across_sp to DPMetadata (@SageMoore #26457)
- [Bugfix] Enable padded FP4 quantization (@roikoren755 #25947)
- [Bugfix] Disable moe inplace for torch >= 2.9 (@bnellnm #26497)
- [Flashinfer][gpt-oss] Support FP8-qkv Flashinfer TRTLLM Sinks Attention (@elvischenv #25674)
- [Core] Remove unused
prev_sampled_token_ids_invalid_indicesinput batch field (@njhill #26514) - [UX] Add FlashInfer as default CUDA dependency (@mgoin #26443)
- [Bugfix] Fix CUDA graph selection bug in FlashInfer at high concurrency (@benchislett #26499)
- [Bug] Fix modular_kernel: ZeroDivisionError: integer division or modulo by zero (@yewentao256 #26528)
- [CI] Fix Pre-commit Issue Cannot determine type of "rank" and "world_size" (@yewentao256 #26448)
- Refactor MistralTokenizer (@juliendenize #26358)
- [DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY (@ruisearch42 #23849)
- [Core] Small simplification in
GPUModelRunner._update_states()(@njhill #26508) - [Chore]: One pythonic tool parser test uses the wrong parser (@bbrowning #26515)
- [Spec-Decode] Support piecewise cudagraphs for Eagle head (@LucasWilkinson #25109)
- fix test_simple_inductor_graph_partition (@BoyuanFeng #26522)
- [deepseek] kernel block size for UniformTypeKVCacheSpecs (@heheda12345 #26559)
- [Metrics] Log multi-modal cache stats and fix reset (@DarkLight1337 #26285)
- [GPT-OSS] Add support for arrays at tool message content (@luis5tb #25593)
- Remove LoRA bias support (@ashwin-phadke #25807)
- [CI] fix ruff format (@chaunceyjiang #26579)
- [bugfix][DCP] fix block_size of hash in DCP prefix caching (@heheda12345 #26296)
- [NIXL] Ignore abort on already-finished request (@markmc #25067)
- [Bugfix] Convert untraceable GroupShape to list for AMD impl (@Lucaskabela #26535)
- [BugFix] Fix noop elimination edge case (@andylolu2 #26394)
- [CI] fix test_run_batch.py::test_completions - AssertionError (@chaunceyjiang #26578)
- [BugFix][torch.compile] Fix fused_scaled_matmul_reduce_scatter signature for PyTorch 2.8 (@jasonlizhengjian #26038)
- Added test_top_k_per_row to test-pipeline.yaml. (@dcampora #26569)
- [Bugfix] Make DP padding optional in coordinate_batch_across_dp (@SageMoore #26375)
- Silu v2 (@elvircrn #25074)
- [Metrics] Add test for multi-modal cache stats logging (@markmc #26588)
- [torch.compile] Make inductor partition rules respect splitting_ops #25691 (@baonudesifeizhai #25845)
- [Bugfix] fixed top_logprobs: -1 does not appear to work as intended (@chaunceyjiang #26470)
- [Model][Qwen3VL] Compute
cu_seqlenson CPU to remove (@lgeiger #26496) - [Model] Add FlexOlmo model implementation (@2015aroras #24923)
- [Transform] [Quantization] Add QuTLASS support to vLLM (@LopezCastroRoberto #24440)
- Add Qwen3-Omni moe thinker (@wangxiongts #25550)
- Update
pre-commithook versions (@hmellor #26591) - Update CUDA architecture list in build pipeline for 12.9.1 wheels (@wseaton #26592)
- Fix some typing issues found by
mypy==1.18.2(@hmellor #26596) - [BUG] Qwen3-next MTP. Fix attn metadata build bug (@vadiklyutiy #26564)
- [BugFix] Fix async scheduling + request preemption (@njhill #26385)
- Cache the environment variable check for batch invariance (@bwasti #26510)
- AOT Compilation for torch.compile (Bundled) (@zhxchen17 #24274)
- [BugFix] Make penalties and bad_words work with async scheduling (@njhill #26467)
- [Frontend] Improve the performance of
is_reasoning_end(@chaunceyjiang #25735) - [CI/Build] Fix ppc64le CPU build and tests (@npanpaliya #22443)
- [XPU] Upgrade NIXL to remove CUDA dependency (@zhenwei-intel #26570)
- [MM] Move Qwen3Omni MRoPE impl to model file (@ywang96 #26608)
- [Bugfix][Multi Modal] Fix incorrect Molmo image processing (@sangho-vision #26563)
- [Refactor]: Use M-RoPE interface directly while defining model class instead of maintaining model specific M-RoPE implementation in mrope.py (@divyanshsinghvi #24172)
- fix(nix): Allow local oneDNN path to fix vLLM CPU build failure (@ihb2032 #26401)
- Add EAGLE-3 Speculative Decoding Support for Qwen3 MoE (@rahul-tuli #26485)
- [CPU] fix the issue when the node is '-' cause json decode error. (@muzian666 #26562)
- [Refactor]Reduce duplicate code in serving_chat (@chaunceyjiang #26627)
- [compile] Add patched_fused_scaled_matmul_reduce_scatter (@angelayi #26604)
- [Bugfix][Qwen3VL] fix deepstack in qwen3vl (@JJJYmmm #26626)
- [Bugfix] Fix qwen-moe packed_modules_mapping (@jeejeelee #26634)
- [Benchmark] Support Infinity API (@DarkLight1337 #26641)
- CP: make correct_attn_out robust to 4‑D views and fix Triton arg binding (@hl475 #26509)
- [compile] Fix inductor partition config (@angelayi #26645)
- [EPLB] Support ernie4.5-moe (@HsChen-sys #22100)
- Add @noooop to codeowner for pooling models (@noooop #26652)
- [PERF] [Qwen3-next] Speed up gated RMSNorm (@vadiklyutiy #26207)
- [MISC] Rename the torch profiler filename as instance_id+rank_id for merging the Profiler results of each Rank (@noooop #25867)
- [Bugfix][CI/Build] Fix failing Mteb CI (@Isotr0py #26638)
- [Bugfix][DCP] Set default CUDAGraphMode to PIECEWISE for DCP (@FENP #26574)
- [TEST][BUG FIX] Fix DP GPU_ID issue (@xuechendi #26442)
- Update
Optional[x]->x | NoneandUnion[x, y]tox | y(@hmellor #26633) - [Feature] Add support for naver/splade-v3 (BERT-based sparse embedding model) (@gjgjos #26339)
- [Models][Qwen3VL] Speedup
fast_pos_embed_interpolate(@lgeiger #26647) - [easy] fix pre commit error on trunk (@hl475 #26665)
- [CI/Build] Add tool to build vllm-tpu wheel (@mgoin #19165)
- [Misc] cache result of disable_inplace (@bnellnm #26666)
- [Bugfix][Core]Fix block table out-of-range issue in priority scheduling (@quanliu1991 #26661)
- [FIX] Throwing an exception when the model does not support pool tasks (#25840) (@yyzxw #25855)
- docs: wrong command in structured_outputs README (@yihong0618 #26677)
- [Model] Fix Skywork R1V mlp (@jeejeelee #26673)
- [Model] Add reasoning_parser and tool_parser for Ernie45 thinking (@CSWYF3634076 #25027)
- Ignore large reformatting PRs in
git blame(@hmellor #26690) - [Model][0/N] Improve all pooling task | clean up (@noooop #25817)
- [ResponseAPI] Simplify input/output message serialization (@Jialin #26620)
- [Bugfix] Fix out of bound index issue for Jina-embedding-v3 RoPE with cuda graph (@Isotr0py #26687)
- [unrevert] Add batch invariant kernel override for FlashInfer backend [2/n] (@bwasti #26373)
- [Hardware][CPU] Disable torch.compile for RISC-V to prevent APIError (@ihb2032 #26693)
- [FEATURE]: Use pydantic validation in
multimodal.pyconfig (@andycandy #26629) - [UX] Speedup DeepGEMM warmup with heuristics (@mgoin #25619)
- [P/D] [NixlConnector] kv load recovery integration (@wseaton #26171)
- [Misc] Separate prompt logging to debug (@aitsvet #26713)
- [CI/Build] upgrade compressed-tensors to 0.12.2 to address LGPLv3 (@csy1204 #26501)
- [Bugfix][Rocm] fix qr error when different inp shape (@haoyangli-amd #25892)
- [Bugfix][Speculative Decoding] Extend Eagle quantization config fix to llama_eagle.py (@rahul-tuli #26590)
- [Model] Use merge_by_field_config for MM models (M-N) (@DarkLight1337 #26710)
- [Log] Optimize Startup Log (@yewentao256 #26601)
- [CI][Release][Arm64]: Build arm64 release for gpu arch 8.9 (@cyb70289 #26698)
- [Quantization] [Performance] Enable Marlin GEMM kernels for the calibration-free RTN-based quantization (@sakogan #26051)
- [Frontend][1/N] Improve all pooling task | Support FP16 Embedding Base64 (Still uses fp32 by default). (@noooop #26414)
- [CI] Fix mypy for
vllm/distributed(@yewentao256 #26593) - [CI Perf]Prune Tests in kernel/mamba (@kfhfar #26538)
- [Bug] Fix Assertion error DeepEP/csrc/kernels/intranode.cu:928: 'false and Unsupported type' (@yewentao256 #26532)
- [FrontEnd] UNREVERT CompilationConfig overhaul (#20283): deprecate use_inductor in favor of backend, simplify custom_ops (@morrison-turnansky #26502)
- Pruning kernel Core Tests (@kfhfar #26727)
- [ResponseAPI] Further polish message serialization and unit tests (@Jialin #26728)
- Add tests for chunked prefill and prefix cache with causal pooling models (@maxdebayser #26526)
- [Misc][DP] support customized aggregated logger for dp (@luccafong #24354)
- [UX] Replace VLLM_ALL2ALL_BACKEND with --all2all-backend (@mgoin #26732)
- [compile] Enable sequence parallelism for full cuda graph without specifying compile sizes (@angelayi #26681)
- [Easy] Fix env type check errors from VLLM_DEBUG_LOG_API_SERVER_RESPONSE (@Jialin #26742)
- [build][torch.compile] upgrade depyf version (@youkaichao #26702)
- [torch.compile] Unwrap fused_marlin_moe custom op (@varun-sundar-rabindranath #26739)
- [Feature][Quantization] auto_round format add support for regex (@n1ck-guo #24024)
- Add support for the /rerank endpoint in vllm bench serve (@maxdebayser #26602)
- [Docs] Add a start tag to build.inc.md (@windsonsea #26747)
- Fix lora tests failure in TPU CI due to the removal of LoRA bias (@vanbasten23 #26723)
- [CI] [ROCm] Automate CC list for ROCm related issue (@vllmellm #26753)
- Adding the test-amd.yaml for test definitions for the AMD backend. (alternative PR) (@Alexei-V-Ivanov-AMD #26718)
- scheduler.py: Update the name of the default scheduler. (@ryanli #26758)
- [Model][Bugfix]fix ernie45 load failed due to ernie45 eplb code (@CSWYF3634076 #26684)
- [CI/Build] Use 127.0.0.1 instead of localhost in utils (@yeqcharlotte #26750)
- fix(frontend): always include usage, when configured to do so (@max-wittig #20983)
- [Plugin] Make plugin group clear (@wangxiyuan #26757)
- [Bugfix] Standardize merging multimodal embeddings (@DarkLight1337 #26771)
- [Model] Use merge_by_field_config for MM models (O-P) (@DarkLight1337 #26776)
- [NIXL][HeteroTP]Enable KV transfer from HND prefill to NHD decode (@xuechendi #26556)
- [Chore] Use
max_transformers_versionfor Qwen-VL test (@DarkLight1337 #26792) - Don't allow
typosto fix by default (@hmellor #26785) - [Doc] ruff format some Python examples (@DarkLight1337 #26767)
- [CI] Fix test_tool_id_kimi_k2 (@chaunceyjiang #26787)
- [Chore] Remove
SupportsV0Onlyinterface and update supported models docs (@DarkLight1337 #26783) - [Feature] Change vllm.py with pydantic validation (@VladOS95-cyber #26726)
- [CI/Build] Cleanup LoRA test (@jeejeelee #26752)
- [DCP] Support Decode Context Parallel (DCP) for GQA with FlashAttention (@FENP #24864)
- Adjusted the model order of the model registration file (@princepride #26798)
- use combo kernel to fuse qk-norm and qk-rope (@BoyuanFeng #26682)
- [issues template] Encourage the author implement their own ideas (@noooop #26671)
- [KVConnector][Metrics] Aggregate scheduler-side KVConnectorStats (@QierLi #26046)
- [Feature][Responses API] Stream Function Call - harmony (@chaunceyjiang #24317)
- Revert "[issues template] Encourage the author implement their own ideas" (@noooop #26814)
- [Config] Remove Unused Environment Variable
VLLM_DISABLE_PAD_FOR_CUDAGRAPH(@yewentao256 #26743) - Update coveragerc and add codecov.yml for path fixes (@rzabarazesh #26435)
- [CI] Raise VLLM_MAX_SIZE_MB to 500 due to failing Build wheel - CUDA 12.9 (@mgoin #26722)
- [Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVidia B200 (@zklapow #26818)
- [CI Failure] Fix tests with missing TinyLlama-1.1B-Chat-v1.0-FP8-e2e (@mgoin #26816)
- llama4_vision_rope: add HIP override to accept (q, k) and avoid (positions, q, k) mismatch (@hl475 #26790)
- [Attention][Spec Decode] FlashMLA spec decode support (@MatthewBonanni #26541)
- [Core] Reuse empty block lists whenever possible in KVCacheBlocks to mitigate GC costs (@Jialin #24964)
- Notice for deprecation of AutoAWQ (@HDCharles #26820)
- [Perf] Cache vllm.env.getattr result to avoid recomputation (@Jialin #26146)
- Added MoE configs for llama 4, H200 device with tp=4/8 tuning (@Dhruvilbhatt #26837)
- fix: response_format for completion (@Nan2018 #23212)
- [Minor] Group async_scheduling related fields in model runner init (@njhill #26736)
- remove attn output view kernel (@BoyuanFeng #26680)
- [Core] Streamline some structured output related code (@njhill #26737)
- [CI Failure] Fix torchao dep failure for Quantization Test (@mgoin #26824)
- [frontend][gptoss] Add per turn stats into Harmony Context (@lacora #25061)
- [WideEP][P/D] Add usage stats for DP+EP and KV Connector (@tlrmchlsmth #26836)
- [torch.compile] Fix tests for torch==2.9 inductor partition (@ProExpertProg #26116)
- [Core][Easy] Use envs.getattr for all Unify to environment variable access (@Jialin #26810)
- [Bugfix]fix Qwen3 xml tool parser (@Zhikaiiii #26345)
- [BUGFIX][NIXL] quick fix for 'assert self.connector_worker is not None' in get_kv_connector_stats (@xuechendi #26851)
- Disable FlashInfer sampler by default (@mgoin #26859)
- [Frontend][torch.compile] CompilationConfig Overhaul (#20283): name change compilation level to compilation mode, deprecation compilation level (@morrison-turnansky #26355)
- [Bugfix] Fixes prefix-repetition benchmark script (@kouroshHakha #26828)
- [Model] Add DeepSeek-V3.1 reasoning parser (split from PR #24972) (@taohui #25589)
- [Docs] Move build.inc into arm.inc (@windsonsea #26862)
- [CI/Build][Bugfix] fix qutlass cmake error when set QUTLASS_SRC_DIR (@izhuhaoran #26773)
- [Feature] default --extra-body param to disable thinking in vllm bench serve (@lengrongfu #26784)
- [BugFix] Patch inductor partitioning logic (@angelayi #26735)
- [Bugfix] Fix qwen3-omni audio truncation issue (@Isotr0py #26815)
- [Graph Partition] pass tests for decorator (@BoyuanFeng #26831)
- [Bugfix][Multi Modal] Fix incorrect Molmo token processing (@sangho-vision #26873)
- [DSA][MLA] Tiny refactor on DeepSeek to make it reusable for different backends (@MengqingCao #26656)
- [Misc] Use helper function to generate dummy messages in OpenAI MM tests (@DarkLight1337 #26875)
- [bugfix] Lazy import cv2 (@angelayi #26869)
- [Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather (@zyongye #26456)
- [CI/Build] Add Qwen2.5-VL-7B-Instruct ChartQA Accuracy Tests in CI (@zhewenl #21810)
- [CI] Fix mypy for
vllm/executor(@yewentao256 #26845) - [Doc] ruff format remaining Python examples (@DarkLight1337 #26795)
- [doc] add Context Parallel Deployment doc (@youkaichao #26877)
- [Misc] Update TritonLanguagePlaceholder to have attributes that are used by Flash Linear Attention ops. (@madongfly #26853)
- [Fix] Remove divisibility requirement between num_kv_heads and tp_size in bailing_moe (@ant-yy #26876)
- [Easy] Get rid of unnecessary paraenthesis in kv_cache_manager (@Jialin #26842)
- [Platform] allow platform to init dp group (@wangxiyuan #22243)
- [Lora]Load tuned multi-lora kernel configs from json files (@li2haipeng #26319)
- [Model][2/N] Improve all pooling task | Support multi-vector retrieval (@noooop #25370)
- [Misc] Remove
isortandyapfignores (@DarkLight1337 #26888) - [Misc] rename torch_dtype to dtype (@wangxiyuan #26695)
- chore: remove unused marker (@max-wittig #26890)
- [BugFix] Patch inductor memory plan logic (@BoyuanFeng #26878)
- [Chore] Separate out
vllm.utils.func(@DarkLight1337 #26904) - [Chore] Separate out
vllm.utils.async_utils(@DarkLight1337 #26913) - Lower severity of log when model info cache misses due to exception (@hmellor #26917)
- Olmo 3 tool parser and tests (@pdasigi #26143)
- [Feature]: Use pydantic validation in observability.py config (@cern1710 #26637)
- [ModelOpt] Remove NVFP4 MoE K%16==0 constraint (@XiaobingSuper #26891)
- [Chore] Clean up CODEOWNERS (@WoosukKwon #26923)
- [NVIDIA] Add support for cudnn fp4 gemm via flashinfer (@kaixih #26107)
- Vectorize RMS norm variance using vectorize_read_with_alignment (@bbeckca #26234)
- support flashinfer_fp4 moe for 5090 gpu (@XiaobingSuper #26669)
- [Bug] Temporally Disable
VLLM_ALLREDUCE_USE_SYMM_MEMby Default (@yewentao256 #26925) - Move query quantization to attention layer for Flashinfer & Triton. (@adabeyta #26534)
- Adjusting AMD test composition 2025-10-14 (@Alexei-V-Ivanov-AMD #26852)
- [Qwen3-Next] Add tuned MoE config for Qwen3-Next FP8 on H100 tp2 (@felixzhu555 #26887)
- [Bugfix] reasoning_parser parameter handling in run_batch.py (@inc-jeong #26225)
- [ROCm][FEAT] Fuse DeepSeek shared experts into AITER fused_moe ops (@kliuae #24097)
- [CI] Enable Blackwell Llama4 MoE tests (@mgoin #26731)
- [BUG] Allow runai_streamer_sharded in config check (@ahao-anyscale #26958)
- [bugfix] Fix SP + PP without specifying compile size (@angelayi #26955)
- [BugFix] Work around graph partition x torch.compile cache issue (@zou3519 #26956)
- [DOC][XPU]update feature parity with Intel GPU (@xuechendi #26954)
- [Chore] Rename
utilssubmodules (@DarkLight1337 #26920) - [PERF] Qwen3-next MTP speedup (change bool mask indexing to index_select / index_copy to reduce d2h) (@vadiklyutiy #26437)
- Deepseek-v3 Batch Invariant on 8xH100 (@bwasti #26609)
- [CI/Build] Update expected beam search output for Phi3V (@DarkLight1337 #26978)
- [Hardware][CPU][PowerPC]Disable torch.compile() in toptopk sampling (@Akashcodes732 #26987)
- [CI/Build] Fix AMD import failures in CI (@zhewenl #26841)
- [Benchmark] Use truncation by default for pooling benchmarks (@DarkLight1337 #26992)
- [Chore] Separate out
vllm.utils.collections(@DarkLight1337 #26990) - [Model][Bugfix] fix ernie45 vl run failed from shared experts optimization (@CSWYF3634076 #26885)
- Cleanup code after Python 3.10 upgrade (@lgeiger #26520)
- [MISC] fix import violations for re and triton modules (@llsj14 #26654)
- [Bugfix] Correct LayerNorm epsilon parameter in modernbert.py (@bogdanminko #27008)
- [Benchmark] Show E2EL by default for pooling models (@DarkLight1337 #27014)
- [Attention] Tune CUTLASS MLA num_splits (@MatthewBonanni #26846)
- [NIXL] Improve request_finished() debug logs (@markmc #25665)
- [docs] standardize Hugging Face env var to
HF_TOKEN(deprecatesHUGGING_FACE_HUB_TOKEN) (@yankay #27020) - [CI] Replace large models with tiny alternatives in tests (@tahsintunan #24057)
- [Feature] Add process_weights_after_loading to AttentionImpl (@lengrongfu #26870)
- [Model] Fix Qwen3VL mm mapping (@jeejeelee #27027)
- Fix Qwen2.5 VL image grid docstring (@skyloevil #27033)
- Support
setin the CLI generation (@hmellor #27031) - [gpt-oss][1/N] EZ: refactor serving_responses for modularity (@qandrew #26948)
- Support block size of 256 used by Intel HPU (@mandy-li #26883)
- [Compressed Tensors] Always clone output for compile robustness (@kylesayrs #26849)
- Adding Warmup to Benchmark Serving (@kimbochen #26943)
- [Bug] Fix batch invariant test
hastois(@yewentao256 #27032) - [GPTOSS][DP/EP][Marlin] Enable GPTOSS Batched DP/EP using Marlin kernels (@varun-sundar-rabindranath #25997)
- [Feature] Migrate DeepGEMM API from
get_m_alignment_for_contiguous_layouttoget_mk_alignment_for_contiguous_layout(@yewentao256 #26935) - [CI] Prune Quantization Tests and skip compilation (@mgoin #27038)
- [Bug] Add Assertion for
random-input-len/random-output-len(@yewentao256 #26834) - [small][batch invariance] Rename the env and internal flags to simplify usage (@bwasti #26855)
- Refactor Transformers backend to use mixins (@hmellor #26906)
- [NVIDIA] [Perf] Update to leverage flashinfer trtllm FP4 MOE throughput kernel (@jiahanc #26714)
- [torch.compile] Passing only necessary compilation config to inductor pass config (@luccafong #27041)
- [Chore] Separate out
vllm.utils.import_utils(@DarkLight1337 #27022) - [torch.compile] fix simple inductor graph partition test (@BoyuanFeng #27050)
- Remove unused imports (@lgeiger #26972)
- vllm bench serve shows num of failed requests (@tomasruizt #26478)
- [Docs] Reduce custom syntax used in docs (@hmellor #27009)
- [Perf] Exploit out-of-band buffers in shm_broadcast (@njhill #26961)
- disable graph partition in custom op (@BoyuanFeng #26952)
- [Bugfix][Qwen] fixes the weights dtype in qwen3_next: it is actually a bfloat16 (@sighingnow #27030)
- [Core] Change
execute_model_with_error_logging()to be a ctx manager (@njhill #27060) - [Bugfix] Fix ReplicatedLinearWithLoRA (@jeejeelee #27065)
- [Kernel] Lazy import FlashInfer (@jeejeelee #26977)
- [CI/Build] Update Llama4 eval yaml (@zhewenl #27070)
- [Model] Always use Transformers backend for PaliGemma and Gemma3-MM (@DarkLight1337 #26715)
- [Model] Add support for LightOnOCR (@staghado #26916)
- [CI/Build] Update compressed tensor test path to fix CPU CI (@bigPYJ1151 #27068)
- [Kernel][Performance] Fuse float cast and renormalize to topk softmax kernel (@izhuhaoran #26717)
- [CI] fix docs build failed (@chaunceyjiang #27082)
- Update troubleshooting.md and remind VLLM_TRACE_FUNCTION usage (@Prowindy #27069)
- [VLM][Refactor] Remove useless func
get_input_positionsinMRotaryEmbedding(@MengqingCao #27088) - [Docs] Replace all explicit anchors with real links (@hmellor #27087)
- [Docs] Replace
rststyle double-backtick withmdsingle-backtick (@hmellor #27091) - [Model]Improve Qwen3VLMoeForConditionalGeneration packed_modules_mapping (@jeejeelee #27096)
- [Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI350 and MI355 (@rkarhila-amd #25586)
- Fix incorrect docstring for stop_profile() method (@hyongtao-code #27101)
- [torch.compile] Enable attention and allreduce fusion without custom ops enabled (@ProExpertProg #24604)
- [CI] Nixl integration tests (@NickLucche #27010)
- [Data-parallel] Allow DP>1 for world_size > num_gpus on node (8) (@patrickvonplaten #26367)
- [bugfix] Qwen3-VL fix video incorrect timestamp calculations while do_sample_frames=True (@wulipc #27104)
- [CI] Remove forbidden slash (@NickLucche #27112)
- [ROCM] MoE fp4 CK kernel (@maleksan85 #26545)
- [ROCm][Bugfix][Model] Fix illegal memory access when running qwen3_moe models with rms_norm (Qwen3-235B-A22B, Qwen3-30B-A3B, etc.) (@rasmith #26192)
- [Bugfix] [AITER] [ROCm] Fix Quark MoE Quant Config and AITER Fused MoE quant type logic (@vllmellm #27029)
- [Chore] Remove unused
PolyNormlayer (@Isotr0py #27110) - [Bugfix] Use PIECEWISE cudagraphs on Blackwell if max_model_len > 131072 (@mgoin #27114)
- [Minor] Remove unnecessary error message (@zhuohan123 #27115)
- [V1][Spec Decode] Fix greedy temperature detection after sampler refactor (@Pradyun92 #27077)
- [Test] Make
test_failuremore stable for batch invariance (@yewentao256 #27054) - [BugFix][Core] Fix error when enable async-scheduling in multi-node env (@lhtin #25887)
- [Perf] Add H100 fused MoE config (@skyloevil #25398)
- [CI/Build] tests(v1): feed Triton attention the (num_blocks, 2, …) KV cache layout in backend-correctness tests (@hl475 #26663)
- [GPT-OSS] Structure_Tag support for gpt-oss tool-call in cot (@Hanchenli #25515)
- [Misc] Rev DeepEP (@varun-sundar-rabindranath #27122)
- [DOC][FEATURES][CPU]update cpu feature for v1 (@xuechendi #27135)
- [Test] Add test for /health endpoint on engine failure (@dongbo910220 #26074)
- [Chore] Separate out
vllm.utils.mem_utils(@iAmir97 #27143) - [Feature] Batch Invariant: Support DeepGEMM and Blackwell (@yewentao256 #27127)
- [fix][cpu] fix prefill attention in CPU attention backend (@fadara01 #27035)
- [Misc] Refactor
get_kv_cache_specintoAttentionLayerBase(@NickLucche #26587) - [Models][QwenVL] Remove unnecessary
.contiguous()calls (@lgeiger #27106) - [Chore] Clean up pytorch helper functions in
vllm.utils(@Isotr0py #26908) - Fix incorrect string formatting in barrier timeout exceptions (@hyongtao-code #27149)
- [Minor] Add some clarifying comments to recent changes (@njhill #27130)
- [BugFix] Fix failing gemma-3-1b-it test:
test_lm_eval_accuracy_v1_engine[google/gemma-3-1b-it](@LucasWilkinson #27111) - [Chore] Separate out profiling utilities from vllm.utils (@dongbo910220 #27150)
- [BugFix] fix graph partition signature (@BoyuanFeng #27139)
- [BugFix] Disable fp8 kv-cache by default for DeepSeek V3.2 (@LucasWilkinson #27121)
- [V1][Metrics][Plugin] Add plugin support for custom
StatLoggerBaseimplementations (@ptovam #22456) - [Minor] Remove unused env variable (@WoosukKwon #27161)
- [BugFix] Fix lazy imports involving outlines_core (@22quinn #27158)
- [Chore] Separate out hashing utilities from vllm.utils (@dongbo910220 #27151)
- [Benchmark] Convenience script for multiple parameter combinations (@DarkLight1337 #27085)
- output type conversion fix (@jianyuh #27159)
- [Chore] Separate out
vllm.utils.network_utils(@iAmir97 #27164) - [Misc] Move utils to avoid conflicts with stdlib, and move tests (@DarkLight1337 #27169)
- [Bugfix] Fix error with penalties when speculative decoding and structural output are enabled (@southfreebird #26586)
- Fix typo in ValueError message: use
kv_roleinstead ofkv_disagg_role(@hyongtao-code #27166) - [Model][VLM] Support Bee-8B Model (@uyzhang #27012)
- [LoRA] LoRA cuda graph specialization (@andylolu2 #25914)
- [Kernel] Accelerate solve_tril with TMA (@ZJY0516 #26746)
- AArch64 CPU Docker pipeline #26931)
- Nemotron Nano V2 VL + EVS Video Support (@BloodAxe #27107)
- [Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on H100 (FP8/BF16) (@shivampr #26268)
- [Bugfix][CI] Fix
Distributed Tests (4 GPUs)async_sched+ray test (@NickLucche #27195) - [Feature][Quantization] auto_round support for mixed bits quantization (@n1ck-guo #23812)
- [ROCm] enable some tests in entrypoints test groups on AMD (@Concurrensee #26725)
- [ez] add uv lock to gitignore (@qandrew #27212)
- [Quantization] Automatically infer AWQ
modules_to_not_convertfield (@Isotr0py #26909) - [V0 Deprecation] Remove V0 metrics code (@njhill #27215)
- [cpu] Dispatch un-quantized linear to oneDNN/ACL by default for AArch64 (@fadara01 #27183)
- create is_in_the_same_node on cpu (@helunwencser #26832)
- [Frontend] Enforce tokenize=False when applying chat template (@russellb #27205)
- [Feature][Kernel]FusedMoE LoRA (@wcwuwc #21229)
- [BugFix] GPT-OSS Attention DP + MoE TP weight loading issue (@nvpohanh #24032)
- [ModelOpt] Load w13/w2_input_scale for all experts, nvfp4 (@wenscarl #26135)
- [Bugfix] Fix gpt-oss w4a8 DP/EP on B200 (@varun-sundar-rabindranath #26729)
- [Bugfix] Fix broken MTP weight loading for FP8 KV Scales (@benchislett #27227)
- [Fix][Spec Decode] Fix llama4 draft loading with different quantization (@linzebing #27136)
- [Nixl] Minor refactor to handshake related metadata (@NickLucche #26410)
- [MM][Core] Decouple ViT backend from LM backend (@ywang96 #27061)
- [Deepseek v3.2] Optimize top_k_per_row (@dcampora #26763)
- [Chore] Separate out NCCL utilities from vllm.utils (@dongbo910220 #27197)
- [CI] Install pre-release version of
apache-tvm-ffiforflashinfer(@hmellor #27262) - [ROCM] Enable CompressedTensorsWNA16 (@JartX #27187)
- Add @pavanimajety to .github/codeowners (@pavanimajety #27213)
- [ROCm] Update Triton, Torch, and AITER branches for ROCm base Dockerfile (@micah-wil #27206)
- [Feature] Batch Invariant for R1 TP 8 on Blackwell (@yewentao256 #27229)
- [Bugfix][P/D] Reduce num_threads used by nixl ucx backend (@dagrayvid #27196)
- [V0 Deprecation] Remove V0 executors (@njhill #27142)
- [Bugfix] fixes the decoding metadata of dense mla's fp8 kvcache. (@sighingnow #27144)
- Update PyTorch to 2.9.0+cu129 (@huydhn #24994)
- [Performance] Dual stream execution of "shared_experts" and "selected_experts" inside FusedMoE (@alexm-redhat #26440)
- Updated xgrammar backend to not deny supported string formats (@ExtReMLapin #27253)
- [Bugfix] skip cuda graph for drafter when running with eager (@benchislett #26821)
- [P/D] KVConnector for decode benchmarking (@tlrmchlsmth #25986)
- [Deepseek v3.2] Remove extra logics in indexer (@IwakuraRein #26465)
- [DOC] [ROCm] Add ROCm quickstart guide (@vllmellm #26505)
- [CI] Nixl integration tests DP-EP (@NickLucche #27199)
- [Benchmark] Add plot utility for parameter sweep (@DarkLight1337 #27168)
- [torch.compile] Enable silu_mul_fp8_quant fusion without custom ops enabled (@ZJY0516 #27146)
- [1/N][Platform] Cleanup useless function (@wangxiyuan #26982)
- Update release pipeline for PyTorch 2.9.0 (@huydhn #27303)
- Remove last
levelreferences not removed #26355 (@hmellor #27260) - fixed reasoning streaming with tool_choice="required" (@ExtReMLapin #24108)
- [Frontend][3/N] Improve all pooling task | Support binary embedding response (@noooop #27066)
- [Bugfix][CPU] Disable dual stream execution for experts on CPU (@bigPYJ1151 #27320)
- [Bug] Raise error for
LLM(data_parallel_size=k)single-process DP Usage (@yewentao256 #27282) - Bugfix - pass 'max_num_tokens_padded' into 'moe_lora_align_block_size' (@gnovack #27311)
- [Core] Handle MoE LoRA edge cases (@jeejeelee #27335)
- [docs] Update v1 metrics design doc (@markmc #27332)
- Mirroring changes in test-pipeline.yaml into test-amd.yaml (@Alexei-V-Ivanov-AMD #27242)
- [Chore] Separate out optional dependency checks from vllm.utils (@dongbo910220 #27207)
- [Model] Upstream Deepseek-OCR model (@Isotr0py #27247)
- [NIXL] Terminate handshake listener thread in shutdown (@markmc #26404)
- [Bug] Fix DeepSeek-V2.5-1210-FP8 issue (@yewentao256 #27267)
- [bugfix] remove unused parameters to reduce unnecessary vram usage (@ReinForce-II #26789)
- [Bugfix] Add missing 'is_internal_router' attribute to FusedMoEWithLoRA (@jeejeelee #27351)
- [NIXL] use Host buffer to support TP_ratio > 1 for XPU (@xuechendi #27140)
- [Bugfix] Make
get_mrope_input_positionsinstance methods (@DarkLight1337 #27342) - [Bugfix] Fix HF format InternVL large variants video processing (@Isotr0py #27330)
- [Frontend] Require flag for loading text and image embeds (@russellb #27204)
- [P/D] Dynamic
kv_output_aggregatorcollect size (@NickLucche #26734) - Support Anthropic API /v1/messages Endpoint (@LiuLi1998 #22627)
- [Bugfix] Disable FlexAttention direct block mask building for encoder-only models (@Isotr0py #27344)
- [Model] Revert PR #26715: Restore custom PaliGemma and Gemma3-MM impl… (@lucianommartins #27309)
- [Doc] Fix numbering sequence in prefix caching (@gigit0000 #27357)
- [Prefix Cache] Use LoRA name for consistent KV-cache block hashing (@sagiahrac #27211)
- [Feature] publisher default set zmq in kv_event config (@lengrongfu #26915)
- [BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (@Daisy-Ma-coder #27128)
- [Chore] Separate out system utilities from vllm.utils (@dongbo910220 #27201)
- [MLA] Bump FlashMLA (@MatthewBonanni #27354)
- [Bugfix] Fix deepseek-ocr multi-image inference and add
merge_by_field_config=Truewith tensor schema support (@Isotr0py #27361) - [Bugfix] Fix SLA tuner initialization (@DarkLight1337 #27355)
- [Bugfix] Fix incorrect kv cache metrics in grafana.json (@fangpings #27133)
- [Bugfix][Core] running queue index leakage exception (@CLFutureX #26754)
- [CORE] Support Prefix Caching with Prompt Embeds (@qthequartermasterman #27219)
- [V1][spec decode] return logprobs for spec decoding (@TheEpicDolphin #26060)
- [Model] Add num_cached_tokens for PoolingRequestOutput (@noooop #27378)
- [Chore] Remove duplicate
has_functions in vllm.utils (@jonathanc-n #27372) - [CI/Build] Fix Prithvi plugin test (@DarkLight1337 #27393)
- [Bugfix] Fix args settings for guided decoding args (@luccafong #27375)
- [CI/Build] Fix AMD CI: test_cpu_gpu.py (@zhewenl #27388)
- add SLA information into comparison graph for vLLM Benchmark Suite (@louie-tsai #25525)
- [CI] Reorganize entrypoints tests (@chaunceyjiang #27403)
- [Metrics] [KVConnector] Add connector prefix cache hit rate stats (@ptovam #26245)
- [Model] Add MoE support for NemotronH (@tomeras91 #25863)
- Run mypy on the lowest supported Python version instead of system Python (@hmellor #27048)
- [Bugfix] Honor --mm_encoder_attn_backend when used (@bradleyhd #27124)
- [Feature] Pydantic validation for speculative.py (@Navya1707 #27156)
- [Misc] Remove use of CUDA_VISIBLE_DEVICES for device selection (fix DP slow startup time &c) (@ilmarkov #26709)
- [CI/Build] Remove unnecessary flags from test registry (@DarkLight1337 #27353)
- [Frontend][4/N] Improve all pooling task | Add plugin pooling task (@noooop #26973)
- Mirroring the test definitions (2025-10-22) (@Alexei-V-Ivanov-AMD #27362)
- [Bugfix] Fix dp_chunking enablement logic in FusedMoE layer (@alexm-redhat #27220)
- [Bugfix][ROCm][DeepSeek] Fix for forward_hip in rope for DeepSeek (@gshtras #27373)
- [Bugfix] Fix AWQ marlin layer skipping (@Isotr0py #27416)
- [Misc] Add triton_kernels dependency (@varun-sundar-rabindranath #27370)
- [Chore] Separate out
vllm.utils.platform_utils.py(@jonathanc-n #27374) - [Attention] Fix FlashMLA metadata builder arguments for q_len > 1 (@MatthewBonanni #27368)
- [Bugfix][DP] Fix creating too many DP Placement Groups (@kebe7jun #26880)
- [Model] Siglip Embedding Support (@piood #27324)
- [Hardware][POWERPC] Disable oneDNN path in vllm/model_executor/layers/utils.py for Powerpc (@Akashcodes732 #27422)
- Granite 4.0 quark quantization support (@xiao-llm #26944)
- Fix pooling adapters for Transformers backend (@hmellor #27338)
- [Kernel] Add GPTQv2 format support for low-bit or asymmetric quantization, by adapting gptq_gemm (@xxxxyu #26092)
- [Misc] Add TPU usage report when using tpu_inference. (@hfan #27423)
- [Bugfix][CI] Move resolving cudagraph_mode before initializing attn_metadata_builder (@fhl2000 #27427)
- Fix EventPublisherFactory logic for disabled KV cache events (@usberkeley #27419)
- [Chore] remove structural tags logging lines (@aarnphm #27451)
- [Bugfix] Fix Pydantic union resolution for ResponseFunctionToolCall in Responses API (@strinczer #26706)
- [Misc] Avoid "PyTorch non-writable tensors" warning in RayPPCommunicator (@ruisearch42 #27443)
- [Docs] remove v1 column for embedding models (@piood #27446)
- [MM][Bugfix] Replace
PatchEmbed's conv3d to linear layer (@Isotr0py #27418) - [BugFix] Fix torchrun DP with LLM class (@22quinn #27395)
- [Refactor] move tool parsing logic from protocol.py to the tool parser (@chaunceyjiang #27383)
- [Benchmark] Enable benchmark to run with
encoding_format="bytes"(@DarkLight1337 #27467) - Fix AArch64 CPU Docker pipeline #27331)
- [MISC]
cudagraph_capture_sizesrelated improvements (@fhl2000 #26016) - Fix test named tool use (@chaunceyjiang #27458)
- [Doc] Fix minor issues in docs/design/metrics.md (@draftbk #27436)
- [cpu][fix] Fix onednn_mm crash on consecutive matmuls with same M,K,N and different dtype (@fadara01 #27472)
- [compile] Turn standalone_compile back on (@zou3519 #27460)
- [NIXL][BUGFIX] delay done_recving queue cleanup to bottom of get_finished (@xuechendi #27297)
- [Bugfix] Fix MultiConnector stats reconstruction across process boundaries (@kouroshHakha #27366)
- [Attention] Add MLA prefill backend: trtllm_ragged_attention_deepseek (@minosfuture #26397)
- [Bugfix] Fix interns1-vit qk norm code path (@Isotr0py #27480)
- [CI/Build] Fix test_torch_utils in AMD CI (@zhewenl #27317)
- [Document] Add ms-swift library to rlhf.md (@hjh0119 #27469)
- [Perf][Async Scheduling] Remove CPU->GPU sync in dummy_run (@lhtin #27455)
- [Distributed] Basic set of configuration for large EP deployment on GB200 (@wpc #27328)
- [Log] Optimize Startup Log (@yewentao256 #26740)
- [Misc][DP] Guard mxfp4 implementation selection (@varun-sundar-rabindranath #27484)
- [KVConnector] Migrate the LMCache integration code to be vLLM native (@ApostaC #25542)
- [CI] Add tests for cudagraph (@ZJY0516 #27391)
- Revert "[Misc] Remove use of CUDA_VISIBLE_DEVICES for device selectio… (@zhuohan123 #27502)
- [Core][Hybrid allocator + kv connector 1/n] Enable hybrid allocator + KV cache connector (@KuntaiDu #25712)
- [Misc] Simplify max tokens in multimodal registry (@DarkLight1337 #27500)
- [Attention] Add missing kv cache scale setup (@MatthewBonanni #27490)
- [CI/Build] Refactor processing tests (@DarkLight1337 #27470)
- [CI/Build] Use CPU for mm processing test on CI (@Isotr0py #27522)
- [BUGFIX][ROCM] ViT FlashAttention on ROCm (no GFX9) and contiguous on qwen3vl ROCm TORCH_SDPA (@JartX #27190)
- [Bugfix] Fix processor initialization for model from modelscope instead of HF (@lengrongfu #27461)
- [Bugfix] fix empty prompts for async-engine mode in benchmark throughput (@luccafong #27494)
- [Doc] Remove Molmo warning (@DarkLight1337 #27527)
- [Doc] Fix links to GH projects (@DarkLight1337 #27530)
- [Chore]:Extract math and argparse utilities to separate modules (@yeshsurya #27188)
- Revert "[CI/Build] Use CPU for mm processing test on CI (#27522)" (@DarkLight1337 #27531)
- [CI/Build] Update causal-conv1d installation (@DarkLight1337 #27529)
- [Model][MiniMax-M2] Support MiniMax-M2 Model (@rogeryoungh #27535)
- fix m2 test (@youkaichao #27536)
- Fix MiniMax-M2 copyright (@rogeryoungh #27537)
- [Model][Bugfix] fix ernie45 moe 300B SharedFusedMoE output tuple (@CSWYF3634076 #27316)
- [Model] Use merge_by_field_config for MM models (Qwen series) (@DarkLight1337 #27546)
- [Docs] reemove the incorrect
enable_reasoningparameter (@yyzxw #27550) - [Performance][LoRA] add context varying params to 'do_not_specialize' in fused moe lora (@gnovack #27445)
- [Model] Deprecate
merge_by_field_config=False(@DarkLight1337 #27551) - [Doc] Slight improvement to M2 and beyond (@jeejeelee #27554)
- [Kernel] Adding split_K implementation for fused_moe_lora (@dcmaddix #27291)
- [Misc] Clean up utils (@DarkLight1337 #27552)
- [Bugfix] Limit the default value of
max_model_lenwhen it is not specified by users (@shen-shanshan #27556) - [Bugfix] Fixed when return_token_ids=False, the first event still contains prompt_token_ids. (@chaunceyjiang #27561)
- [cpu][perf] Fix low CPU utilization with VLLM_CPU_OMP_THREADS_BIND on AArch64 (@fadara01 #27415)
- [Kernel] Enable moe LoRA kernel support FP16 (@jeejeelee #27468)
- [Hybrid] Added supports_mamba_prefix_caching Protocol (@Josephasafg #27339)
- [Model] Siglip2 Model Support (@piood #27566)
- [Bugfix][LoRA][FusedMoE] Select MxFP4 Backend based on LoRA Enablement (@varun-sundar-rabindranath #27487)
- fixing mm placeholder replacement issue with gemma3 (@tingtingtangmeta #27538)
- [Chore]: Stream tokens vs characters in tool call parser tests (@bbrowning #26513)
- [Misc] Clean up more utils (@DarkLight1337 #27567)
- [ROCm] Update AITER branch for ROCm base docker (@micah-wil #27586)
- Code quality improvements: version update, type annotation enhancement, and enum usage simplification (@usberkeley #27581)
- [gpt-oss][2/N] Support input_messages in responsesRequest (@qandrew #26962)
- [Bugfix][CI] Fix config resolving logic with remote models (@ywang96 #27610)
- [Stability fix] turn off HMA allocator when connector is set (@KuntaiDu #27592)
- [Bugfix] fixed inconsistent finish_reason handling between V0 and V1 engines (@chaunceyjiang #27555)
- [ROCm] [Doc] Update ROCm installation docs (@vllmellm #27327)
- [Hardware][AMD][Model] Triton MoE tuning configs for GLM-4.6 for MI300X (@minatoaquaMK2 #27323)
- [Bugfix][CPU] Fallback oneDNN linear to torch linear to fix half gemm support on legecy platforms (@bigPYJ1151 #27526)
- [Core][Bookkeeping Optimization] Update against numpy view of is_token_ids tensor (@Jialin #27618)
- [CI/Build] Fix amd model executor test (@zhewenl #27612)
- Fix a robust parsing issue in KimiK2ToolParser that causes IndexError (@wangln19 #27565)
- [V0 Deprecation] Remove vestigial V0 logits_processors.py file (@njhill #27601)
- [Bugfix] In LongRoPE, decide short vs long based on max_model_len (@MatthewBonanni #27431)
- [Misc] Separate out
utils.counterand moveutils.Deviceto engine (@DarkLight1337 #27588) - [Bug] Fix shape issue for eplb expert weights (@yewentao256 #27589)
- [compile] Add enable_prompt_embeds to compile hash. (@zhxchen17 #27285)
- [Hybrid] Add mamba_block_size to Engine Args (@Josephasafg #27289)
- [compile] Disable dynamo guards check for AOT compilation. (@zhxchen17 #27288)
- fix: allow HuggingFace standard chat template params via **kwargs (@wangln19 #27622)
- [Core] Enable async scheduling for external_launcher mode (@22quinn #27394)
- [Bugfix][Frontend] validate arg priority in frontend LLM class before add request (@junpuf #27596)
- [BugFix] Also consider RAY_EXPERIMENTAL_NOSET_* when storing compilation cache (@HollowMan6 #27294)
- [nit]: lmcache integration import (@sammshen #27600)
- [FLA] Introduce Kimi Delta Attention(KDA) to VLLM (@zhiyuan1i #27654)
- [Bugfix] Fix allocation & free logic of SingleWriterShmRingBuffer (@imkero #27117)
- [Bugfix][CI] Fix v1 attention backend tests and add CI coverage (@mmangkad #26597)
- [Misc] Make
LayerBlockTypeaLiteralinstead ofEnum(@DarkLight1337 #27658) - [compile] Add fallback path to AOT compile when serialization fails. (@zhxchen17 #27350)
- Add load pattern configuration guide to benchmarks (@mpashkovskii #26886)
- [Misc] Make reorder batch also separate extends (@LucasWilkinson #27367)
- [Test] Batch Invariant: Unit test using parameterized backend (@yewentao256 #27478)
- [Core] Scheduler: Publish connector events after output (@orozery #25875)
- [AsyncScheduling] Make async overlap work with logprobs (@njhill #27615)
- [Misc][qwen2_5_vl][torch.compile] Enable
supports_torch_compileon generic nn.Module and demonstrate speedup on Qwen Vision model (@Lucaskabela #23207) - [Bug] Fix deepep low latency use nvlink by default (@yewentao256 #27677)
- [Core] Early return in SlidingWindowManager.remove_skipped_blocks (@Jialin #27673)
- Install pre-built xformers-0.0.32.post2 built with pt-2.9.0 (@huydhn #27598)
- Revert "Install pre-built xformers-0.0.32.post2 built with pt-2.9.0" (@simon-mo #27714)
- [Build] Revert triton_kernels requirements (@varun-sundar-rabindranath #27659)
- [NIXL][XPU] update name of nixl wheel (@zhenwei-intel #27631)
- [Model] Fix Qwen3VL and Qwen3Omni after torch.compile changes (@lgeiger #27705)
- [KV cache] Fix lmcache connector (@Shaoting-Feng #27681)
- [CI/Build][Bugfix]Fix Quantized Models Test on AMD (@zhewenl #27712)
- [Bugfix] Fix non-contiguous tensor error in
rocm_unquantized_gemm_impl(@zhewenl #27605) - [Speculators] Move tests + fix integration (@dsikka #27308)
- [CI/Build] Move pre-commit only scripts to
tools/pre_commit(@DarkLight1337 #27657) - [perf] Enable concurrent execution of "shared_experts" and "selected_experts" in qwen3-next (@ZJY0516 #27578)
- [Bugfix] Fix modular kernel tests (@bnellnm #27707)
- [Frontend] [gpt-oss] Tool json call parsing error retry (@alecsolder #27675)
- [Frontend] [gpt-oss] Mcp type bug (@alecsolder #27689)
- [Fix] import get_kv_cache_torch_dtype error in vllm_v1_adapter.py (@KevinCheung2259 #27670)
- [Misc] Raise error for missing video metadata in
MultiModalDataParser(@Isotr0py #27664) - Feature/video support in random mm dataset (@BloodAxe #25963)
- [chore] Remove models weight on S3 logic (@khluu #27725)
- [VLM] Add Qwen3-VL generation test (@Isotr0py #25185)
- [CI/Build] Skip cpu offloading test on AMD (@zhewenl #27690)
- [Frontend] Add
vllm bench sweepto CLI (@DarkLight1337 #27639) - Fix MiniMax-M2 rmsnorm precision and remove useless code (@rogeryoungh #27627)
- [ROCm][Platform] Add MI308X device id in _ROCM_DEVICE_ID_NAME_MAP (@sammysun0711 #27623)
- [CI] Fix flaky
test_two_responses_with_same_prev_idtest (@NickLucche #27745) - [Chore] Optimize P2PNCCLEngine
http_address(@yewentao256 #27488) - [Core] Exposing engine sleep & wake_up state as prometheus metrics (@dumb0002 #24176)
- [FIXBUG] Qwen3VL hallucinations without Contiguous on Torch.SDPA (@JartX #27744)
use_aot_compileshould respectVLLM_DISABLE_COMPILE_CACHE(@BoyuanFeng #27698)- [CI/Build] Test torchrun with 8 cards (@22quinn #27548)
- [Bug] Raise error explicitly if using incompatible backend (@yewentao256 #27424)
- [KVConnector] Add metrics to Prometheus-Grafana dashboard (@NickLucche #26811)
- [Bug] Fix DeepEP low latency
assert self.batched_router_logits.size(-1) == full_router_logits.size(-1)Bug (@yewentao256 #27682) - [BugFix] Fix handling of resumed reqs in
SharedStorageConnector(@njhill #27719) - [Bug] Fix DBO IMA issue for DeepEPHT (@yewentao256 #27666)
- [Temp fix] Disable torch.compile for Qwen2.5 VL's VisionBlock temporarily. (@huachenheli #27760)
- [XPU][bugfix] fix rope for llama4 and deepseek (@yma11 #25145)
- [Bugfix] mamba-block-size is set for vision language model (@heheda12345 #27773)
- [XPU] Update latest IPEX 2.8 release (@jikunshang #27735)
- [BugFix] Handle unscheduled requests properly when async scheduling (@njhill #27756)
- [Feat] Adds runai distributed streamer (@bbartels #27230)
- kernels/moe test pruning (@kfhfar #27053)
- [BugFix] Reordering extend logic fix (@LucasWilkinson #27739)
- [Benchmark] Cleanup deprecated nightly benchmark and adjust the docstring for performance benchmark (@KuntaiDu #25786)
- Add more dims for batch invariant shims (@bwasti #27489)
- use stringData in secret yaml to store huggingface token (@yitingdc #25685)
- [CI/Build]Add eval config for Qwen3-235B-A22B-Instruct-2507-FP8 (@hl475 #27113)
- [BugFix][VL] Fix FA selection on Qwen2.5-VL (@zhewenl #27790)
- [V0 deprecation] Remove VLLM_USE_V1 usage in config module (@wangxiyuan #27784)
- [CI Failure] fix test_default_mm_loras (@hl475 #27795)
- [CI] Fix mypy for
vllm/v1/coreandvllm/v1/engine(@yewentao256 #27108) - [Bugfix] Improve GPU validation logging in Ray fallback scenarios (@sairampillai #25775)
- [Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. (@noooop #25524)
- [CI Failure] Fix test_kv_cache_model_load_and_run (@hl475 #27717)
- [Model] Introduce Kimi Linear to vLLM (@zhiyuan1i #27809)
- [KV offload] Enable CPU KV offload on CUDA alike Platforms (@zhewenl #27770)
- [Model][Ouro] Support Ouro Model (@FlamingoPg #27794)
- [Bugfix][CPU] Fix MRoPE dispatch on the CPU backend (@bigPYJ1151 #27800)
- [BugFix] Stopgap - Flashinfer Autotuner + GPT-OSS + DP/TP (@varun-sundar-rabindranath #27762)
- [Misc] Replace CUDA_VISIBLE_DEVICES in DP with torch.cuda.set_device for device selection on cuda-like devices (@ilmarkov #27564)
- [Docs] add Shanghai Meetup - 2025/10 (@kebe7jun #27545)
- Reapply "Install pre-built xformers-0.0.32.post2 built with pt-2.9.0" (@huydhn #27768)
- [MTP] Refactor mtp predictor to avoid d2h operation (@MengqingCao #27643)
- [Model] Use the same fused_moe configs for all H200 devices (@bufferoverflow #23642)
- [Bugfix] Fix 2 precommit issues - (mamba_block_size, kv_cache_config) (@tlrmchlsmth #27811)
- [Core][Bookkeeping] Update cu_num_accepted_tokens for all req_index (@Jialin #27629)
- [EP/DP][API Server] Enable DP-aware routing in OpenAI API requests (@Prowindy #24945)
- [Fix] Skip
record_sleep_statelogic inPrometheusStatsLoggerif not in dev mode (@SumanthRH #27789) - [Refactor] Remove
VLLM_DEEPEP_LOW_LATENCY_ALLOW_NVLINK(@yewentao256 #27750) - [Core][Perf] Only invoke save_new_computed_blocks when computed blocks are not empty (@Jialin #27799)
- [Feature] Batch invariant torch.compile (@PaulZhang12 #27660)
- [BugFix] Fix broken import in initialize_ray_cluster() (@njhill #27838)
- [Misc] Make all tool scripts executable (@MatthewBonanni #27831)
- [CI/Build][Intel] Enable performance benchmarks for Intel Gaudi 3 (@jakub-sochacki #26919)
- [CI Test] Add Scheduled Integration Test (@yewentao256 #27765)
- [benchmark] Make request IDs unique across clients by default (@eicherseiji #27723)
- [Hardware][Powerpc] Fix VLLM_CPU_OMP_THREADS_BIND="auto" low CPU utilization for Power (@Akashcodes732 #27734)
- [Kimi-Linear] Correct prefixes and add compatibility to AWQ quants (@toncao #27834)
- [Bugfix] Avoid too small block m/n for FlexAttention kernel option (@Isotr0py #27853)
- [BugFix] Don’t compute reorder threshold when there are no attention groups (@hl475 #27861)
- [Perf] Decouple torch op from GDA to leverage torch.compile (@ZJY0516 #27871)
- [CI/Build] Add gpt-oss LoRA test (@jeejeelee #27870)
- [Bugfix] Allow 64-bit integer values for LoRA IDs to avoid overflow/truncation (@shadeMe #27876)
- [Bugfix] Fix broken MRoPE for GLM-4.1V/GLM-4.5V (@Isotr0py #27860)
- [Bugfix] Missing NIXL metadata for handshake initialization if instance spans multi-node (@GuanLuo #26338)
- Docs update tpu install instructions (@RobMulla #27824)
- [bugfix] Missing cached item in beam search (@fake0fan #27874)
- fix incorrect type annotation in KimiMLP (@skyloevil #27885)
- Flashinfer_CUTLASS_MOE fuses quantization for TP (@wenscarl #27223)
- [Cleanup] Remove no-longer-used
SpeculativeConfig.enable_chunked_prefill(@njhill #27826) - [Feature] Pydantic validation for scheduler.py and structured_outputs.py (@vrdn-23 #26519)
- Add FLASHINFER_MLA to test_mla_backends and add B200 CI run (@MatthewBonanni #27663)
- Batch invariance doc (@bwasti #27839)
- [Hybrid] A simpler algorithm to find kernel_block_size (@heheda12345 #26476)
- [Core] Async scheduling + structured outputs compatibility (@njhill #26866)
- [Kernel] Enable FusedMoEModularKernel support bias (@jeejeelee #27754)
- [Bugfix] Fix KDA output (@jeejeelee #27905)
- [Multimodal][XPU]Enable vision attn backend for xpu platform (@yma11 #27525)
- Adding SplitK in fused_moe_lora kernel (@yugong333 #27818)
- [CI/Build] Bump transformers version (@DarkLight1337 #27528)
- [Bugfix] [Model] Missing MRoPE function definition from
KeyeForConditionalGeneration(@tjtanaa #27895) - [Add] cmdline argument parsing for KV cache offloading modules (@ApostaC #27621)
- feat(benchmarks): support HF model names in multi-turn benchmark (@ai-jz #27850)
- [Docs] Mock all imports for docs (@hmellor #27873)
- [V0 deprecation] Remove VLLM_USE_V1 usage in platform and v1 module (@wangxiyuan #27798)
- [Bugfix] DeepSeek V3.2 MTP metadata & CUDA graph issues (@xiaohajiayou #26779)
- [Bugfix] Python 3.10 compatibility for
Self(@DarkLight1337 #27918) - [Core][TPU] Support TPU Data Parallalism (@wenxindongwork #27365)
- [BugFix] Fix mixed penalties batch with async scheduling (@njhill #27910)
- Adds anthropic /v1/messages endpoint to openai api_server (@bbartels #27882)
- [KV offload] Offloading connector async scheduling support (@KevinCheung2259 #27648)
- [CI/Build] Fix flaky test_transcription_validation.py::test_basic_audio_gemma (@bbrowning #27924)
- [Bugfix] Fix Qwen Omni audio inference (@DarkLight1337 #27920)
- Performance fix MistralTokenizer: cache special ids and tokens (@juliendenize #27925)
- [V1] [Hybrid] Mamba1 Automatic Prefix Caching (@Josephasafg #26377)
- [Misc] Provide Siglip2 chat template (@DarkLight1337 #27939)
- [Bugfix][llm]: Abort orphaned requests when llm.chat() batch fails (@Flink-ddd #27420)
- [BugFix][LoRA] use adapter_id instead of id field of lora_request (@biswapanda #27728)
- [Frontend] Align finish_reason when tool is called with OpenAI (@n0gu-furiosa #25054)
- [Hybrid] Pass kernel block size to builders (@tdoublep #27753)
- [Bugfix] Padded Eagle Specdec with Chunked Prefill (@Flechman #26263)
- [XPU]Refine Dockerfile.xpu, avoid oneccl dependency issue (@jikunshang #27964)
- Add ORCA endpoint load metrics support (@efimki #24905)
- [CI/Build] Remove the flaky gpt-oss lora test (@jeejeelee #27966)
- [Model] Add PaddleOCR-VL Model Support (@zhang-prog #27758)
- Early exit for MoE LoRA kernels (@gnovack #27131)
- [Bugfix] Skip gs:// model paths for speculator detection (@pwschuurman #27846)
- [BUG] Make 'binary' default option for saving torch compile artifacts when using standalone_compile (@ahao-anyscale #27616)
- [CI/Testing] Add basic single node dual batch overlap test (@LucasWilkinson #27235)
- [Spec Decode] Integrate Suffix Decoding from Arctic Inference (@aurickq #25784)
- [Feature][Benchmarks] Support
infburstiness (@sducouedic #26941) - [Bugfix][Qwen][Multimodal] Move Qwen2_5_vl sdpa to custom op and reenable compile (@Lucaskabela #27764)
- [Bugfix] change FlashMLA reorder_batch_threshold (@MatthewBonanni #27777)
- [Docs] add runai_streamer_sharded to LoadConfig (@andyxning #27937)
- Add TP parameter to attention tests (@MatthewBonanni #27683)
- [Bugfix][plugin] fla crash on plugin (@ILikeIneine #27322)
- [Bugfix] Fix MoE Routing Simulation (@tlrmchlsmth #28002)
- Remove the tpu docker image nightly build. (@QiliangCui #27997)
- [Bugfix][ROCm] Fix ViT rotary embeddings for torch.compile compatibility on ROCm (@vllmellm #27748)
- [LoRA] Lora shrink swizzle (@li2haipeng #27694)
- [Refactor] Lazy import tool_parser (@chaunceyjiang #27974)
- [NIXL][XPU] Pin NIXL version to 0.7.0 (@zhenwei-intel #27849)
- [Metrics] Enable sleep state metric outside of dev mode (@markmc #27867)
- [Bug] Batch invariant: Fix flash attn MLA
RuntimeError: scheduler_metadata must have shape (metadata_size)(@yewentao256 #27884) - [CPU]Improve dynamic 4bit moe performance (@xiangze-arm #27240)
- [CI/Build] Update LM Eval Version in AMD CI (@zhewenl #27944)
- [KV Connector] Make KVCacheConfig an explicit constructor argument (@markmc #27887)
- [Model] fix ernie45 reasoning_parser (@CSWYF3634076 #27973)
- [CI/Build] Fix OpenAI API correctness on AMD CI (@zhewenl #28022)
- [BugFix][Performance] Restore flashinfer autotuning for all scenarios (@varun-sundar-rabindranath #27904)
- Load tuned fused_moe_lora shrink and expand kernel configs separately (@yugong333 #27435)
- Support using Int4PreshuffledTensor after loading (@jerryzh168 #26066)
- [Core] Enable StatLogger in LLMEngine (@zhuohan123 #28020)
- [Model][Bugfix] fix pipeline parallelism support for NemotronH (@tomeras91 #27968)
- [Model] add optimal triton fused moe configs for NemotronH MoE (@tomeras91 #27967)
- [Kernels] Isolate modular kernel code from FusedMoEMethodBase subclasses. (@bnellnm #27123)
- [BugFix] Fix incorrect preallocated sampled_token_ids tensor size (@njhill #28025)
- [Perf] SM100 - add swap AB optimization to CUTLASS FP8 GEMM (@LyrisZhong #27284)
- [PERF] Decouple projections from GDN custom op (@vadiklyutiy #27512)
- [model] Add support for openPangu_Ultra_MoE (@yt0428 #27521)
- [PerfFix] Avoid separate thread for MP executor shm spin (@njhill #28012)
- [AsyncScheduling] Don't schedule past request max_tokens (@njhill #27922)
- Remove deprecated
--rope-scalingand--rope-theta(@hmellor #28006) - [ROCm][Perf] New design on ROCm AITER MHA backend Implementation (@ganyi1996ppo #25763)
- Added disable rule to track files under benchmarks/lib (@nadavkluger #28048)
- [Multimodal] Make MediaConnector extensible. (@huachenheli #27759)
- [ROCm] gemm_a16w16 upstreaming (@maleksan85 #26969)
- Revert "[PERF] Decouple projections from GDN custom op" (@vadiklyutiy #28080)
- [Qwen3-Next] MOE configs for A100-SXM4-80GB TP4 TP8 (@toulzx #27740)
- [XPU] Add gpt-oss model support for Intel GPU (@jikunshang #27786)
- [CI/Build] Enable some fixed tests in AMD CI (@zhewenl #28078)
- [V0 deprecation] Remove VLLM_USE_V1 usage in most modules (@wangxiyuan #27955)
- [Bugfix] Fix encoder-only model support for transformers backend (@Isotr0py #28021)
- [BugFix] Fix DCP Assert (AssertionError: DCP not support reorder_batch_threshold > 1 now.) (@LucasWilkinson #28100)
- [Model, Core] Support Granite Speech & LoRA for STT (@alex-jw-brooks #24455)
- [Refactor] Lazy-loaded reasoning_parser (@chaunceyjiang #28092)
- [Refactor] to simplify and extract the shared logic between chat completion and responses (@chaunceyjiang #27961)
- [bugfix] fix wrong
dcp_local_seq_lenscalc (@pisceskkk #27518) - [Hybrid allocator + kv connector] revert connector test changes related to hybrid allocator (@KuntaiDu #28011)
- [Misc] fix import error for DeepSeekR1ReasoningParser (@chaunceyjiang #28114)
- Fix excessive logging noise by reducing the log level of the MinimaxM2ToolParser import success message (@minatoaquaMK2 #27635)
- Bugfix: Cutlass FP8 FusedMoE bad scaling factors (@amirkl94 #27255)
- [Graph Partition][Cache] Use inductor partition ops config (@BoyuanFeng #27702)
- [XPU] Enable custom routing functions in IPEX for Llama4 (@frost-intel #28004)
- add kimi reasoning parser (@MoyanZitto #28128)
- [DCP] check return_lse for all layers in dcp (@heheda12345 #27929)
- [BugFix] Support EP/DP + EPLB with MTP (@ilmarkov #25311)
- Enabling cooperative multi-gpu tests on multi-gpu nodes (@Alexei-V-Ivanov-AMD #27986)
- [ROCm][MLA] Support block-size > 1 for AITER MLA backend (@ganyi1996ppo #27224)
- [Bugfix] Validate custom logits processor xargs for online serving (@Isotr0py #27560)
- [misc] add vLLM Beijing Meetup (@jjzhang #28127)
- [Kernel] Fuse computation of g and beta for Gated Delta Net (@ZJY0516 #28095)
- [Core] add support for reasoning parser plugins (@walterbm #28075)
- [Bugfix] vLLM should check Inductor config for compile cache enablement status (@gmagogsfm #27637)
- [FlashInfer] Avoid FlashInfer block_size 16 + head_size 256 on blackwell (@heheda12345 #27994)
- [CI]: Add LMCache Unit Tests (@sammshen #27852)
- [Feature] Extend batch invariant torch.compile to B200 (@PaulZhang12 #27856)
- [Bugfix] Fix Qwen3-Reranker-8B load (@noooop #28117)
- [Docs] Clean up README_TUNING.md (@windsonsea #28088)
- [Hardware][IBM Z] Optimize s390x Dockerfile (@R3hankhan123 #28023)
- [Chore] Remove Nemotron-Nano-VL config copy (@Isotr0py #28126)
- [Docs] Add guide to debugging vLLM-torch.compile integration (@zou3519 #28094)
- [Feature]: Add corrupted request metric to V1 metrics system. (@atalhens #27306)
- [CI/Build] Update checking logic in cutlass_group_gemm_supported (@zhewenl #27948)
- [CI/Build] Fix
test_defaults_with_usage_contextin AMD CI (@zhewenl #27926) - [Core][Hybrid allocator + connector 2/n] Unify
remove_skipped_blocksbyget_last_useful_token(@KuntaiDu #25431) - [Debugging] Add annotation for easier trace analysis (@dayeol #22496)
- [PERF] Decouple projections from GDN custom op. Attempt 2 (@vadiklyutiy #28083)
- [Bug] Fix cpu disable shared_experts
VLLM_DISABLE_SHARED_EXPERTS_STREAM(@yewentao256 #28157) - [Bug] Fix env string
"0"same toTrue(@yewentao256 #28159) - [Feature] Enable TP + EP
shared_expertsoverlap with router, 3.7% E2E performance improvement (@yewentao256 #28164) - [CI Failure]
nm-testing/Qwen2-0.5B-Instruct-FP8-SkipQKVwas removed from HF. Skip it in tests (@vadiklyutiy #28170) - [Misc] Remove the duplicate code (@chaunceyjiang #28111)
- [Chore] Clean up deepseek v2/v3 config copy (@Isotr0py #28055)
- [Core][MM] Use non-blocking CPU-GPU copy of multimodal data (@lgeiger #28141)
- Make the cv2 dependency optional (@cmpute #27780)
- [CI] Add compile/test_multimodal_compile.py to CI (@gmagogsfm #28151)
- [flashinfer] fix FI all2all with FI cutlass moe (@mxz297 #28166)
- Patch Mistral Tokenizer (@juliendenize #28146)
- Fix hard-coded parameter name in gemma3n.py (@seungduk-yanolja #27946)
- [CPU] Enable torch profiling (@aditew01 #28130)
- [V0 deprecation]clean up is_v1_supported_oracle (@wangxiyuan #28116)
- [Bugfix][Kernel] fix merge attn states when both prefix and suffix are empty (@courage17340 #28181)
- [Frontend] OpenAI Responses API supports Tool/Function calling - non-harmony (@chaunceyjiang #26874)
- [CPU]Improve cpu fused moe perf (@xiangze-arm #27244)
- Disable nm-testing models with issues in CI (@mgoin #28206)
- [Docs] Switch to directory style URLs (@hmellor #28058)
- [Kernel][Model] Tune fused_moe Triton configs for MiniMax-M2 on H100 (@minatoaquaMK2 #28200)
- [Doc] Add Arm CPUs are on the list of supported targets in vLLM (@milpuz01 #26018)
- [HARDWARE][CPU] Add Option for Disabling Binding to Specific CPU Cores (@StanHatko #27953)
- [Frontend] Fix logging format when enable response logging (@esmeetu #28049)
- CODEOWNERS: Add myself as reviewer on security docs (@russellb #28216)
- [Structured outputs] Upgrade llguidance to 1.3.0 (@andylolu2 #28039)
- Add llama 4 scaling support (@juliendenize #28145)
- [Chore] eliminate duplicated and unconditional object serialization in anthropic messages api (@vicoooo26 #27792)
- [ROCm] triton fp8 kernel (@maleksan85 #27058)
- [Doc]: Make extraInit containers fully configurable in helm chart (@HanFa #27497)
- [Test] Add non-MoE DP test coverage (@MatthewBonanni #28235)
- [BugFix] Fix FusedMoELoRA + ModularKernel Integration (@varun-sundar-rabindranath #28237)
- Fix failing test for CRadio (@BloodAxe #27738)
- Speed up mm processor kwargs per request by spliting dynamic and static kwargs (@LJH-LBJ #26483)
- [Multimodal][torch.compile] Add compilation config field for turning off ViT/MM compile (@Lucaskabela #28242)
- [CI/Build] Loosen STT LoRA Translate Check (Flaky Test) (@alex-jw-brooks #28247)
- Add runai model streamer e2e test for GCS (@amacaskill #28079)
- Fix issues from #28242 (@hmellor #28257)
- [amd][gptoss] Perf gain because of block alignment (@smitkadvani #28024)
- [Bug] Fix missing token_ids for reasoning parser models in chat completions #28246 (@baonudesifeizhai #28256)
- [CI] Reduce Blackwell Fusion test runtime by filtering tests and only run all tests in nightly (@Copilot #28074)
- [Kernel] LoRA triton kernels support PDL (@jeejeelee #27402)
- [Perf] Introduce FlattenLogprobs to store logprobs results to reduce GC overhead (@Jialin #28171)
- [FixBug]Aeala/ShareGPT_Vicuna_unfiltered marked as multimodal benchmark (@princepride #28265)
- [CPU]Avoid repeated random sample compile (@xiangze-arm #28260)
- [Misc][Model][Refactor] Pass the prefix into Linear layers (@MengqingCao #28259)
- [fix] Revert "fixing mm placeholder replacement issue with gemma3" (@khluu #28285)
- [Core][MM] Add mechanism to configure multimodal fields which should stay on CPU (@lgeiger #28168)
- [Bugfix] Use latency MOE backend as default for Flashinfer and other misc fixes (@pavanimajety #27439)
- [CLI] add --max-tokens to
vllm complete(@Iceber #28109) - [Feature] Default
ignore_eosTrue forrandomdataset (@yewentao256 #28227) - [Log] update shm wait time msg (@BoyuanFeng #28255)
- Revert "[PerfFix] Avoid separate thread for MP executor shm spin (#28012)" (@NickLucche #28289)
- [README] Add Arm CPUs to the list of supported targets (@fadara01 #28290)
- [doc] add guide about the provided PTX was compiled with an unsupported toolchain (@youkaichao #28305)
- [Build] Fix release pipeline failing annotation (@simon-mo #28272)
- [Bugfix] Fix and add tests for GptOss reasoning parser (@benchislett #28000)
- [Core] Rework handling of async scheduling config (@njhill #28250)
- [PerfFix] Avoid separate thread for MP executor shm spin (take 2) (@njhill #28319)
- Update Flashinfer from
v0.4.1tov0.5.2(@hmellor #27952) - [XPU] Enable Expert parallel for MoE models (@jikunshang #28263)
- remove resolve_op_overloads and use splitting_ops directly (@BoyuanFeng #28081)
- [Bugfix][LoRA][Spec Decode] Support LoRA with speculative decoding (@xiaohongchen1991 #21068)
- Update gpu.rocm.inc.md to add support for AMD Ryzen AI MAX / AI 300 Series (gfx1151, gfx1150) (@hammmmy #28308)
- [Perf][DeepSeek] Add sigmoid+bias fusion to fused_grouped_topk from TRTLLM (@mgoin #28124)
- Bump arctic-inference requirement (@aurickq #28174)
- [bugfix] support eagle with lora cudagraph specialization (@gnovack #28318)
- [Model] Consolidate Deepseek-MoE implementation with DeepSeek-v2 (@Isotr0py #28101)
- Refactor CPU/GPU extension targets for CMake build (@ashahba #28026)
- [flashinfer][fix] do not check nvcc availability when using pre-downloaded cubins (@mxz297 #27990)
- [Attention] Remove max cudagraph size limit of 992 (@22quinn #27840)
reasoning_content->reasoning(@hmellor #27752)- [Bugfix] Update device name for H200 detection (@robertgshaw2-redhat #28349)
- [Bugfix] Spec decode + structured output + spec model max len edge case (@andylolu2 #28298)
- [DCP] Support dcp kv_cache interleave size > 1 (@zhangsicheng5 #26696)
- Enhance run_cluster.sh for multi-NIC support (@evberrypi #28328)
- [Feat] Drop-in Torch CUDA Profiler (@benchislett #27841)
- Remove setuptools upper bound constraint (<80) (@ColeMurray #28337)
- [Bugfix] Fix test fused quant layernorm tests (@ElizaWszola #27865)
- [Performance][gpt-oss] Revert gpt-oss max cudagraph size to 1024 (@mmangkad #28345)
- [chore] Move some wikimedia images to S3 (@khluu #28351)
- fix: close issue 28338 by fixed python version (@yihong0618 #28339)
- [Misc] fix typo and add detailed log (@andyxning #28178)
- [ROCm] Add env to enable/disable aiter triton gemm (@sarckk #28321)
- [Misc] Add some comments in qwen3-next (@ZJY0516 #28267)
- [CI] Fix flaky
test_eagle_correctnesstest (@NickLucche #28364) - [Core] Simplify async KV output aggregation (@njhill #28327)
- [Core] Separate out attention metadata building logic from prepare inputs (@LucasWilkinson #26764)
- [BugFix] Fix cu_num_generated_tokens slicing logic in LogprobsLists.slice() method (@usberkeley #28214)
- [CI/Build] Temporary fix to LM Eval Small Models (@zhewenl #28324)
- [Kernel] Fix fused_gdn_gating (@ZJY0516 #28343)
- [ROCm][Platform] Add RX7900XTX device id in _ROCM_DEVICE_ID_NAME_MAP (@JartX #28279)
- [CI] lora/test_mixtral.py : Add additional expected outputs due to flakiness (@varun-sundar-rabindranath #28322)
- [Hardware][AMD][Model] Add Triton MoE tuning support and optimized configs for Qwen3 omni for MI308X (@sammysun0711 #28373)
- [V0 deprecation] Remove no longer used
get_metadata_cls(@LucasWilkinson #28370) - Restore PlaMo2 unit test as
pfnet/plamo-2-1bnow supportstransformers >=4.56(@Alnusjaponica #28019) - [Metrics] Refactor LoRA state tracking (@markmc #26801)
- [bugfix] fix siglip batch text output error (@piood #28365)
- [Fix] optimize visual token mask with caching and multi-token support (@bo-ke #28374)
- Add @tjtanaa to codeowner for ROCm and multi-modal (@tjtanaa #28360)
- [Rocm][fused_moe][fp4] view weight to torch.float4_e2m1fn_x2 when running aiter fused moe for fp4 model (@zejunchen-zejun #27474)
- [Kernel] Optimization of the mm_k operator. (@caozuoba #28280)
- [RFC][ROCm][AITER] Keep all AITER kernels in
_aiter_opsclass like_custom_opsand_ipex_ops(@vllmellm #24490) - [V0 Deprecation] Remove unused
context_lenandseq_lenfrom M-RoPE (@DarkLight1337 #28395) - [Bugfix] Fix persistent_masked_m_silu_mul_quant tests (@varun-sundar-rabindranath #28366)
- [Performance] Support FP8 flashinfer TRTLLM MOE on Qwen3 and Qwen-3next (@jiahanc #27492)
- [Bugfix] Fix llguidance backend, rollback when EOS was encountered (@Flechman #25905)
- [FA/Chore] Bump FA version for FP8 two-level accumulation (@jmkuebler #27889)
- [Bugfix][EPLB] Disabled shared expert overlap when EPLB is enabled (@SageMoore #28377)
- [Misc] Add more scoping for improved trace (@frank-wei #28329)
- [BugFix] Fix DeepGEMM over-allocating workspace (@LucasWilkinson #28254)
- [Frontend][2/n] remove empty content from _parse_tool_calls_from_content (@qandrew #28331)
- [CI] Fix Plugin Tests Tests (@robertgshaw2-redhat #28413)
- [ROCm] Add missing gemm_a8w8_blockscale import (@sarckk #28378)
- [PERF] Allreduce fusion. Support torch native matching. Tuning of the thresholds (@ilmarkov #24248)
- [Perf] Move gc.freeze logic from EngineCoreProc to EngineCore for better coverage (@Jialin #27896)
- [Bugfix] Ensure calculated KV scales are applied in attention. (@adabeyta #27232)
- [Test] Remove old non-varlen FA2 test (@MatthewBonanni #28420)
- [Feature] Refactor batch invariant fp8 DeepGEMM (@yewentao256 #27606)
- [CI/Test Fix] Fix CP tests on Blackwell (@LucasWilkinson #28404)
- [Feature] Add env var
VLLM_MOE_USE_DEEP_GEMM(@yewentao256 #28422) - Only register rocm_aiter_ops if aiter is found (@mgoin #28428)
- Fix rotary embedding benchmark script (@xyang16 #28323)
- [Misc] FlattenLogprobs -> FlatLogprobs (@zhuohan123 #28335)
- [Frontend] Add sagemaker_standards dynamic lora adapter and stateful session management decorators to vLLM OpenAI API server (@zhaozuy #27892)
- [Bugfix] Fix Stream Sync for Shared Expert Overlap (@robertgshaw2-redhat #28430)
- [Doc] Sleep mode documentation (@iAmir97 #28357)
- [BugFix] Avoid calling KV connector layer APIs when metadata is unset (@sdavidbd #28253)
- [Bugfix] Fix max image size for PaddleOCR-VL (@ywang96 #28442)
- [EPLB] Refactor balance_packing to use numpy and optimize GPU-CPU transfers in EPLB (@SageMoore #28369)
- [Bugfix] fix qwen3-next crash (@ZJY0516 #28202)
- [BugFix] 'DeepseekV2Config' object has no attribute 'use_mla'` (@faaany #28387)
- [Model][Qwen3VL] Slighly speedup
fast_pos_embed_interpolate(@lgeiger #28434) - Multi turn benchmark progress bar for synthetic conversation generation (@segevido #28394)
- [CI] Add mergify rules for
nvidialabel (@mgoin #28417) - [Attention] Refactor CUDA attention backend selection logic (@MatthewBonanni #24794)
- Fix Fused MoE LoRA Triton kernel bug (@chaojun-zhang #28450)
- [Model] Pass
mm_featuresdirectly intoget_mrope_input_positions(@DarkLight1337 #28399) - Add request timeout override for multi-turn benchmarks (@segevido #28386)
- [Docs] Fix grammar in CPU installation guide (@maryamtahhan #28461)
- [Kernels] Split up fused_moe/layer.py, isolate more modular kernel code (@bnellnm #28064)
- [BugFix] Fix Failing Ruff Check (@jvlunteren #28469)
- Add @markmc to CODEOWNERS for Observability (@markmc #28457)
- [BugFix] Fix RuntimeError in PixtralHFAttention on CPU/XPU (@faaany #28444)
- [BugFix] Add test_outputs.py to CI pipeline (@usberkeley #28466)
- [Doc] Fix typo in serving docs (@the-codeboy #28474)
- Remove weight_scale.T special case for SM90 Block FP8 CUTLASS kernel (@mgoin #28431)
- [NIXL] Generalize block-first backend layouts (FlashInfer-like) (@NickLucche #28282)
- [Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model (@izhuhaoran #27165)
- [ROCm][Quantization] extend AMD Quark to support mixed-precision quantized model (@xuebwang-amd #24239)
- [Quantization] fix attention quantization of gpt_oss model (@xuebwang-amd #27334)
- [CI/Build] Refactor Attention backend for test_prefix_prefill from xformers to SDPA (@zhewenl #28424)
- Prefer FlashAttention MLA as default over FlashMLA (@MatthewBonanni #27363)
- [Kernel] Optimize rms_norm kernel (@xyang16 #27931)
- [BugFix] Fix Siglip2Attention on XPU (@faaany #28448)
- [Misc] Remove unused attention prefix prefill ops functions (@lgeiger #26971)
- [Perf] Use np.ndarray instead of list[list[int]] to reduce GC overhead (@Jialin #28245)
- [V0 deprecation] Clean up num_prefill_tokens logic for V0 (@gcanlin #28203)
- [Misc] fix typo in DCP comment (@Livinfly #28389)
- [LoRA][1/N]Remove LoRA extra vocab (@jeejeelee #28382)
- [TPU] Rename path to tpu platform (@kyuyeunk #28452)
- [Misc] Cleanup Executor interface (@wangxiyuan #28441)
- Add Zurich vLLM Meetup (@mgoin #28488)
- [Bugfix] Disable shared expert overlap if Marlin MoE is used (@mgoin #28410)
- [Feature] Allow configuring FlashInfer workspace size (@maxyanghu #28269)
- Use FLASHINFER MLA backend when testing fp8_kv_scale_compile (@adabeyta #28491)
- [BugFix] Graceful handling of torch symm mem errors. (@ilmarkov #27671)
- [Frontend] Change CompilationMode to a proper Enum (@gmagogsfm #28165)
- [Performance] Cache loaded custom logitsprocs to avoid overheads (@Isotr0py #28462)
- [[V0 deprecation]]Remove VLLM_USE_V1 env (@wangxiyuan #28204)
- [CPU] Refactor CPU attention backend (@bigPYJ1151 #27954)
VLLM_USE_TRITON_FLASH_ATTNV0 variable deprecation (@AndreasKaratzas #27611)- [Model][Qwen3VL] Simplify
get_mrope_input_positionsusing numpy (@lgeiger #28302) - [Core] Encoder separation for Encode-Prefill-Decode Disaggregation (@fake0fan #25233)
- [BugFix] Add fallback path in
apply_rotary_pos_emb_flashattnfor non-cuda platforms (@faaany #28447) - [Benchmark] Add retry support to fix workload bias in multi-turn benchmark (@ai-jz #28493)
- [Core] Cache
vllm_is_batch_invariant(@lgeiger #28304) - [CI/Build] Fix crash due to removed VLLM_USE_V1 attribute in EPD (@fake0fan #28521)
- [CI] Introduce autorun_on_main feature (@hl475 #27836)
- [BugFix]: --enable-lora with model granite-4.0-micro crash (@yyzxw #27733)
- [Model] fix glm4_moe_mtp load weights with GLM-4.6 checkpoint. (@wuyaoxuehun #27597)
- [XPU]Fix crash due to removed VLLM_USE_V1 attribute (@chaojun-zhang #28520)
- [KVConnector] Enable get_block_ids_with_load_errors() in LMCache connector (@ziruiliu #27978)
- add cpu option for p/d in nixl_connector (@ZhengHongming888 #28356)
- [ROCm] [Bugfix] Fix
fused_qknorm_rope_kernelrocm compatibility (@tjtanaa #28500) - [Bugfix] Fix gpt_oss packed_modules_mapping (@jeejeelee #28536)
- [V0 deprecation] Deprecate use_v1 parameter (@wangxiyuan #28112)
- Fix pre-commit (and XPU) on
main(@hmellor #28556) - [Performance][Hopper] Avoid M dim padding to 4x for most cases (due to cuda graphs paddings) (@alexm-redhat #28492)
- [Refactor] Remove redundant TP gather/split in split_qkv in QwenVL (@gcanlin #28271)
- [Misc] Refactor Attention kv transfer methods into decorator (@NickLucche #27816)
- Remove deprecated fields from
CompilationConfig(@hmellor #27593) - [Perf] Refactor cudagraph_support to enable full CUDA graphs for spec decoding with FlashInfer (@benchislett #28479)
- Implement ARC KV cache eviction policy (@albertoperdomo2 #27039)
- [EPLB][ROCm]: support EPBL for ROCm backend (@PerryZhang01 #27731)
- [Model] [Config] Correctly identify granite-4.0-micro as non-hybrid model (@tdoublep #28563)
- [CI] Skip "Multi-Modal Models Test (Extended) 3" test that's broken in current Transformers (@hmellor #28559)
- [KV connector][WIP] KV cache proxy based on LMCache multi-process mode (@ApostaC #27902)
- [BugFix] Priority scheduling and spec tokens preemption (@andylolu2 #28558)
- [Misc]Fix typo in llm_engine.py (@frank-wei #28584)
- [Performance][B200] Fix deepgemm prologue (@varun-sundar-rabindranath #27897)
- [ROCM] Fix ROCm warnings, environment flag access, and GEMM kernel naming for consistency in
_aiter_ops.py(@vllmellm #28464) - [TPU] Support GCS path in VLLM_TORCH_PROFILER_DIR (@QiliangCui #28487)
- [Bugfix] Adjust Marlin CUDA arch selection to 8.0+PTX;9.0+PTX (@mgoin #28294)
- [Core][AMD] Migrate fully transparent sleep mode to ROCm platform (@HollowMan6 #12695)
- [MoE][Kernel][Perf] Improve Shared Expert Stream Overlap (@alexm-redhat #28406)
- Skip models that cannot currently init on Transformers v5 (@hmellor #28471)
- [Docs] Update meetups.md description (@mgoin #28583)
- [ROCm][Bugfix] Revert removing setuptools version restriction (@gshtras #28592)
- [platform] Move get_cu_count to utils (@wangxiyuan #27005)
- [Bugfix] Fix SM100 gpt-oss regression due to faulty attn sink support (@mgoin #28561)
- [BugFix] Fix
mm_encoder_attn_backendarg type checking (@njhill #28599) - [Docs] Add some details about what the MoE block needs for the Transformers backend (@hmellor #28588)
- Rename clashing method names for vLLM model protocol (@hmellor #27583)
- [n-gen] DO NOT repeatedly return finished child requests (@Jialin #28591)
- [Frontend] split append tool output (@qandrew #28333)
- [Frontend][responsesAPI][1/n] convert responses API tool input to chat completions tool format (@qandrew #28231)
- [BugFix][ROCm] Fix
get_cu_countmissing variable error (@ganyi1996ppo #28608) - [XPU] Support Triton path for LoRA operations on XPU (@faaany #28511)
- Support DeepEP for Kimi-k2-thinking through enabling gemm selection for compressed-tensor marlin wna16 (@luccafong #28574)
- [build][cmake]: Bundle static ACL and torch libgomp for CPU extension builds (@Radu2k #28059)
- [ROCm][BugFix] Remove the usage of
device_infofrom aiter (@ganyi1996ppo #28383) - [Bugfix] Prevent crash on empty grammar string (@tjandy98 #28210)
- Use official xformers-0.0.33 built for PT 2.9 (@huydhn #28600)
- Add NUMA node validation for CPU thread binding (@usberkeley #28555)
- [Bugfix] fix kimi-linear crash (@ZJY0516 #28445)
- [Frontend] supports interleaved thinking (@chaunceyjiang #28531)
- Support all interleaved layer types (@sarckk #28485)
- Fix: Correctly filter special tokens in benchmark_prefix_caching (@dw2761 #28615)
- [BugFix] Fix type error when assign a trition kernel tensor to a torch.nn.Parameter (@liuzijing2014 #28603)
- Fix io processor pooling #28273 (@baonudesifeizhai #28484)
- [XPU] add sym params to IPEXConfig (@zufangzhu #28611)
- [Bugfix] Fix FPS value type for Qwen2.5-Omni video processing (@faaany #28630)
- [Hardware][PowerPC] Fix fp16 compilation error for Power in cpu attention backend and bump oneDNN version (@Akashcodes732 #28535)
- [ROCm][BugFix]Fix
get_cu_countin rocm_aiter_fa.py (@ganyi1996ppo #28618) - [CI/Build] Install uv for AMD MI300: Language Models Tests (Hybrid) %N (@amdfaa #28142)
- [CI Failure] Fix backend selection for encoder-only models (@hl475 #28534)
- [BugFix] DeepSeek-OCR: apply NoRepeatNGramLogitsProcessor to greedy path (@YuanpingSong #28617)
- Fix
get_num_expertswhen config sets it explicitly toNone(@hmellor #28652) - [Misc] Turn off encoder torch compile by default (@ywang96 #28634)
- Rewrite C++ meta funcs to Python (@janeyx99 #28595)
- [BugFix] Ensure
EngineArgs.create_engine_configis idempotent (@njhill #28515) - [TPU] patch TPU wheel build script to resolve metadata issue (@jcyang43 #27279)
- [Performance][B200] silu_mul_quant: pack scales in int32 (@varun-sundar-rabindranath #28358)
- [Bugfix] Fix validate model input for decoder models (@yannicks1 #27099)
- [Attention][Bugfix] Fix FA sink support (@MatthewBonanni #28660)
- [Perf] Support stream interval for reducing host overhead (@elvischenv #27869)
- [bugfix] correct local_chunk_len for DCP in reorg_kvcache with long context (@pisceskkk #28526)
- [Bugfix] Eliminate tuple inputs to submodules in graph partitioning (@gmagogsfm #28533)
- [Bugfix] [CPU] bump torch to 2.9.0 for Darwin to fix segmentation fault (@kebe7jun #27791)
- [Misc] Update CODEOWNERS for simon-mo and comaniac (@simon-mo #28675)
- [CI] Bug: Fix ci entrypoint pooling (@yewentao256 #28684)
- [KV Connector] Test async mode in scheduler tests (@markmc #28550)
- Mirrored test group definitions for AMD (2025-11-11) (@Alexei-V-Ivanov-AMD #28573)
- [quantization][config] enable override existing quant_config (@ILikeIneine #28510)
- [ROCm] Bump up the version of amd-smi to 6.4.3 (@SageMoore #28680)
- [CPU][Bugfix] Fix Apple Silicon M1 compilation failure (@mgoin #28681)
- [ci][amd] fix basic models extra init test (@bradleyhd #28676)
- [Misc] Remove
warn_for_unimplemented_methods(@DarkLight1337 #28613) - [XPU][CI]disable lm cache uts (@jikunshang #28696)
- [Misc] Update xformers to 0.33.0.post1 (@ywang96 #28678)
- [Misc] add ignore mapper for quark quantization (@haoyangli-amd #28275)
- [Bugfix][CI/Test][Spec Decode] Fix illegal memory access in offline_inference/spec_decode.py (Issue 27619) (@rasmith #28432)
- [BugFix][CI/Build][ROCM] Fix import error and apply assert in appropriate case in test_struct_output_generate (@rasmith #28311)
- use default CCL_ZE_IPC_EXCHANGE (@yma11 #28700)
- [Bugfix] fix dots.ocr pp support (@ZJY0516 #28705)
- [BugFix] Fix multi-modal async scheduling race condition (@njhill #28706)
- Add output token counting to gsm8k eval (@mgoin #28594)
- [Minor] avoid register new custom and just import silly_attn (@BoyuanFeng #28578)
- [Misc] fix comment in test_envs (@xingliu14 #28529)
- [feat]: log number of preempted requests (@610lyn #28522)
- [Frontend] Added chat-style multimodal support to /classify. (@WorldExplored #27516)
- [Model][MM] Extract conv layer as CustomOp (@shen-shanshan #28455)
- [DCP] Support Decode Context Parallel (DCP) for GQA with Flashinfer (@gjc0824 #25438)
- Fix KV sharing fast prefill with cudagraph enabled (@sarckk #28537)
- [BugFix] Fix FA3 IMA with FULL_AND_PIECEWISE and cascade attention (default) (@LucasWilkinson #28702)
- [Doc] Fix macOS installation dependency resolution issue (@shahfasal #26721)
- [Model] Fix bailing_moe accuracy problem (@zhaozx-cn #28277)
- [Bugfix][Nixl] Fix kernel physical<>logical block_size issue (@NickLucche #28677)
- [Config] Clean up SchedulerConfig initialization (@DarkLight1337 #28665)
- [Kernels] Enable FlashInfer FP8 Blockscale on SM90 (for TEP DSR1) (@djmmoss #27134)
- [Fix] improve aspect ratio in dummy image generation and add common VLM tests for PaddleOCR-VL (@dongbo910220 #28711)
- [Docs] Update the name of
Transformers backend->Transformers modeling backend(@hmellor #28725) - [CI][CPU] Smoke test for Apple Silicon using GHA MacOS runner (@mgoin #28688)
- [DisaggEverything] Tokens in<>out
/generateendpoint (@NickLucche #24261) - [Attention] Bump FA for removed method (@MatthewBonanni #28429)
- Fix typo in comment: existance -> existence (@OthmanMohammad #28737)
- Remove audio optional dependency for mistral-common (@juliendenize #28722)
- [kernel] Improve FP8 PTPC on Hopper for larger shapes (@czhu-cohere #28692)
- docs(lora_resolvers): clarify multi-resolver order and storage path requirement (@wangchen615 #28153)
- LLaMA4 LoRA Adapter Enablement (@kfhfar #28602)
- [Bugfix] [ROCm] [AITER]: Fix aiter block quant not compatible with torch compile dynamo (@tjtanaa #28716)
- [Docs] Enable some more markdown lint rules for the docs (@hmellor #28731)
- [Chore] Rename
SchedulerConfig.chunked_prefill_enabled(@DarkLight1337 #28735) - [Bugfix] resolve Qwen3-VL GPTQModel quantized model loading failure (@GuanH #28663)
- [BugFix] Fix misprint introduced by modular_kernel refactoring. (@halyavin #28728)
- [ROCm][Bugfix] Fix compilation errors with fused_qknorm_rope_kernel.cu (@SageMoore #28682)
- [CI] Fix macos smoke test uv cache issue (@mgoin #28736)
- [Bugfix] TypeError: 'NoneType' object is not callable (@mostrowskix #27410)
- [ROCm][CI/Build] Change install location of uv (@gshtras #28741)
- Avoid bytecode hook and simplify TorchCompileWrapperWithCustomDipatch (@laithsakka #25110)
- [Bugfix] Fix incorrect use of hidden_states for shared_experts due to do_naive_dispatch_combine (@alexm-redhat #28740)
- [Bugfix] Fix ChunkedLocalAttention CUDA Graph setting (@benchislett #28739)
- [Hybrid] [Kernel] Fix chunk scan kernel when BLOCK_SIZE_DSTATE > 128 (@tdoublep #28295)
- [Log] Save profiler results to file instead of stdout (@rasmith #28144)
- [ROCm][CI/Build] Upgrade to ROCm 7.1 and AITER main (@gshtras #28753)
- [Test] Rework e2e async scheduling tests (@njhill #28744)
- [Core] Performance: Use list[np.ndarray] instead of list[list[int]] for output tokens for GC optimization (@Jialin #26368)
- [TPU] Fix import error in tpu launch (@QiliangCui #28758)
- [Model][Qwen3VL] Use
mm_positionto compute mrope positions (@lgeiger #28730) - [Bugfix] Build hadacore kernels on >SM90 (@mgoin #28748)
- Revert "[Core] Performance: Use list[np.ndarray] instead of list[list… (@njhill #28773)
- Fix IntermediateTensors initialization and add type hints (@OthmanMohammad #28743)
- [NIXL] heterogeneous block_size support (@xuechendi #26759)
- [Performance][DeepGEMM] Estimate expected_m (@varun-sundar-rabindranath #28694)
- [Redo] #26368 (@DarkLight1337 #28771)
- [RL] [V1] Remove unused device argument from reset_kv_cache (@zhuohan123 #28766)
- Use narrow over indexing in
hadacore_transformto prep for ABI stable (@janeyx99 #28756) - [Kernel][Moe Configs] llama4 maverick fp8 moe config tp8 on mi325 (@zhewenl #28709)
- [Misc] Make
SchedulerConfig.max_model_leninit-only (@DarkLight1337 #28733) - [PERF] Remove TRTLLM Gen attn kernel limitation
max_seq_len <=131072(@vadiklyutiy #28755) - [compile] Enable sequence parallelism matching w/o custom ops enabled (@angelayi #27126)
- Allow Gemma3 to take image embeddings (@tingtingtangmeta #28483)
- [Doc] Fix failing doc build (@DarkLight1337 #28772)
- [Model] Fix lmhead init bug of bailing_moe (@hwhaokun #28777)
- Add support for Eagle with separate lm-head and embed_tokens layers (@eldarkurtic #28549)
- [CI] Fix broken pipeline (@njhill #28781)
- [Model][Qwen3VL] Cache positional embedding indices (@lgeiger #28475)
- [Doc]: fix typos in various files (@didier-durand #28567)
- [BugFix] Fix
AssertionError: DCP not support reorder_batch_threshold > 1 now.(@LucasWilkinson #28751) - Adding a benchmark for batch invariance (@bwasti #28161)
- [Benchmark] Fix client seed synchronization in multi-turn benchmark (@ai-jz #28512)
- [Model] Allow users to control skip reading cache per request. (@noooop #28194)
- [V1] Support MP Executor for multi node distributed inference (@luccafong #23691)
- Fixed gpt-oss _load_weights_other() parameter position bug (@River12 #28715)
- [Bugfix] Fix host and port join for ipv6 in bench serve (@scottzh8 #28679)
- Fix gpt oss weight loading with EP + bf16 (@ashors1 #28765)
- [Doc]: fix typos in various files (@didier-durand #28811)
- fix comment typo (@andyxning #28802)
- [Model][QwenVL] Optimize
Qwen2_5_VisionAttentionq,k preparation (@lgeiger #28769) - Feature: Support Relu2 in FusedMoE fp8 cutlass path (@amirkl94 #27261)
- [BugFix] Fix async scheduling + chunked prefill + preemption (@njhill #28787)
- [Performance][Fix] update nvfp4 code to support renorm routing (@jiahanc #28569)
- [NIXL][XPU] update install script of NIXL (@zhenwei-intel #28778)
- [ROCm][Qwen3-32B] Fix AITER MHA accuracy issue cause by #25763 (@sammysun0711 #28670)
- [Bugfix][Model] Prevent special token leakage in KimiK2ToolParser streaming mode (@jscaldwell55 #28543)
- [Doc] Add llama4 LoRA tag (@jeejeelee #28825)
- [CPU][Bugfix] Fix _to_list in CPU model runner (@bigPYJ1151 #28824)
- [BugFix] Fix glm4_moe_mtp load weights bug (@wuyaoxuehun #28805)
- [Metrics] Fix KV cache usage percent metric multiproc (@jaywonchung #28792)
- [XPU] work around for sp, avoid custom op import error (@jikunshang #28822)
- [BugFix] Temporary fix for IMA with MTP = 2 and full-cg (@LucasWilkinson #28315)
- [Bugfix][Perf] Revert applying HF processor on text-only inputs for multimodal models (@ywang96 #28858)
- Cast return value to int64_t for cache size (@tiehexue #28814)
- [Bugfix] Fix GPT-OSS on AMD after #28603 (@zhewenl #28816)
- [Core] Async Scheduling X Spec Decoding Compatibility (@Ronald1995 #24799)
- [BugFix] Fix PP performance and PP kv connector output regression (@njhill #28768)
- [Quantization] [Eagle] Add complete quantization support to the draft model in Eagle (@shreyas269 #28435)
- [Test] Batch Invariant: Rename and organize tests (@yewentao256 #27421)
- [Model] Add Afmoe architecture implementation (@pranav4501 #28332)
- [BugFix] Corner case that could cause out-of-sync with external launcher mode and dp >1 (@bangshengtang #28774)
- [Misc] Fix wrong comment in scheduler (@zhuohan123 #28880)
- [Bugfix] Fix Kimi-K2 tool parser concatenated tool calls parsing (@bbartels #28831)
- Run macos smoke test workflow on main commit (@mgoin #28752)
- [ROCm][Quantization] add apply_vllm_mapper in quark config for models like gpt-oss (@xuebwang-amd #28638)
- [Refactor] Remove Unused Func in Batch Invariant (@yewentao256 #28881)
- [Bugfix] Fix wrong CLI defaults for dynamic
SchedulerConfigfields (@DarkLight1337 #28872) - [Doc]: fix typos in various files (@didier-durand #28863)
- [Misc] Remove unnecessary parentheses from log statements (@andyxning #28897)
- [CI] Fix async scheduling + spec decoding test flake (@njhill #28902)
- [MISC] Remove format.sh (@KuntaiDu #28906)
- [CI/Build] Replace wikipedia url with local server ones (@Isotr0py #28908)
- [BugFix] Fix PP/async scheduling with pooling models (@njhill #28899)
New Contributors
- @bwasti first commit is #25603
- @Renovamen first commit is #25796
- @patrick-toulme first commit is #25084
- @kingsmad first commit is #25825
- @yingjun-mou first commit is #25827
- @zhoukezi first commit is #25854
- @leejnau first commit is #25706
- @adabeyta first commit is #25513
- @acisseJZhong first commit is #25912
- @a120092009 first commit is #25942
- @Anionex first commit is #25354
- @DrStone1971 first commit is #25843
- @certainly-param first commit is #25935
- @natoscott first commit is #26007
- @kmaehashi first commit is #26005
- @leo-pony first commit is #25470
- @huijjj first commit is #24947
- @levunet first commit is #24768
- @Egor-Krivov first commit is #25668
- @sixiang-google first commit is #25992
- @astralord first commit is #26027
- @jasl first commit is #26098
- @nrghosh first commit is #26148
- @southfreebird first commit is #25974
- @soldni first commit is #26054
- @yuafng first commit is #26219
- @ILikeIneine first commit is #25823
- @jasonlizhengjian first commit is #25998
- @elieserr first commit is #26177
- @orangeng first commit is #26266
- @ymoslem first commit is #26258
- @abhisheksheth28 first commit is #25521
- @seven-mile first commit is #26231
- @cfRod first commit is #26289
- @atalhens first commit is #26265
- @gholmes829 first commit is #25164
- @dcampora first commit is #25945
- @antrec first commit is #26340
- @plliao first commit is #26325
- @morrison-turnansky first commit is #26113
- @isharif168 first commit is #26347
- @Barry-Delaney first commit is #25931
- @utkarshsharma1 first commit is #26279
- @Aydin-ab first commit is #25283
- @therealnaveenkamal first commit is #25103
- @QierLi first commit is #24926
- @zhiyuan1i first commit is #24486
- @iwzbi first commit is #16601
- @roikoren755 first commit is #25947
- @luis5tb first commit is #25593
- @wangxiongts first commit is #25550
- @sangho-vision first commit is #26563
- @muzian666 first commit is #26562
- @HsChen-sys first commit is #22100
- @FENP first commit is #26574
- @gjgjos first commit is #26339
- @andycandy first commit is #26629
- @aitsvet first commit is #26713
- @cyb70289 first commit is #26698
- @kfhfar first commit is #26538
- @n1ck-guo first commit is #24024
- @ryanli first commit is #26758
- @VladOS95-cyber first commit is #26726
- @zklapow first commit is #26818
- @HDCharles first commit is #26820
- @Dhruvilbhatt first commit is #26837
- @madongfly first commit is #26853
- @li2haipeng first commit is #26319
- @pdasigi first commit is #26143
- @cern1710 first commit is #26637
- @inc-jeong first commit is #26225
- @bogdanminko first commit is #27008
- @mandy-li first commit is #26883
- @kimbochen first commit is #26943
- @staghado first commit is #26916
- @rkarhila-amd first commit is #25586
- @hyongtao-code first commit is #27101
- @jianyuh first commit is #27159
- @uyzhang first commit is #27012
- @shivampr first commit is #26268
- @helunwencser first commit is #26832
- @dagrayvid first commit is #27196
- @ExtReMLapin first commit is #27253
- @ReinForce-II first commit is #26789
- @LiuLi1998 first commit is #22627
- @sagiahrac first commit is #27211
- @fangpings first commit is #27133
- @jonathanc-n first commit is #27372
- @bradleyhd first commit is #27124
- @Navya1707 first commit is #27156
- @piood first commit is #27324
- @xxxxyu first commit is #26092
- @usberkeley first commit is #27419
- @strinczer first commit is #26706
- @hjh0119 first commit is #27469
- @wpc first commit is #27328
- @yeshsurya first commit is #27188
- @rogeryoungh first commit is #27535
- @dcmaddix first commit is #27291
- @tingtingtangmeta first commit is #27538
- @minatoaquaMK2 first commit is #27323
- @wangln19 first commit is #27565
- @junpuf first commit is #27596
- @sammshen first commit is #27600
- @mpashkovskii first commit is #26886
- @KevinCheung2259 first commit is #27670
- @sammysun0711 first commit is #27623
- @dumb0002 first commit is #24176
- @sairampillai first commit is #25775
- @FlamingoPg first commit is #27794
- @SumanthRH first commit is #27789
- @PaulZhang12 first commit is #27660
- @jakub-sochacki first commit is #26919
- @RobMulla first commit is #27824
- @yugong333 first commit is #27818
- @ai-jz first commit is #27850
- @xiaohajiayou first commit is #26779
- @biswapanda first commit is #27728
- @efimki first commit is #24905
- @zhang-prog first commit is #27758
- @xiangze-arm first commit is #27240
- @yt0428 first commit is #27521
- @ganyi1996ppo first commit is #25763
- @nadavkluger first commit is #28048
- @toulzx first commit is #27740
- @frost-intel first commit is #28004
- @jjzhang first commit is #28127
- @walterbm first commit is #28075
- @dayeol first commit is #22496
- @cmpute first commit is #27780
- @seungduk-yanolja first commit is #27946
- @aditew01 first commit is #28130
- @milpuz01 first commit is #26018
- @StanHatko first commit is #27953
- @vicoooo26 first commit is #27792
- @HanFa first commit is #27497
- @amacaskill first commit is #28079
- @smitkadvani first commit is #28024
- @xiaohongchen1991 first commit is #21068
- @hammmmy first commit is #28308
- @ashahba first commit is #28026
- @zhangsicheng5 first commit is #26696
- @evberrypi first commit is #28328
- @ColeMurray first commit is #28337
- @bo-ke first commit is #28374
- @caozuoba first commit is #28280
- @zhaozuy first commit is #27892
- @maryamtahhan first commit is #28461
- @the-codeboy first commit is #28474
- @xuebwang-amd first commit is #24239
- @Livinfly first commit is #28389
- @AndreasKaratzas first commit is #27611
- @wuyaoxuehun first commit is #27597
- @ziruiliu first commit is #27978
- @ZhengHongming888 first commit is #28356
- @albertoperdomo2 first commit is #27039
- @PerryZhang01 first commit is #27731
- @Radu2k first commit is #28059
- @tjandy98 first commit is #28210
- @dw2761 first commit is #28615
- @zufangzhu first commit is #28611
- @amdfaa first commit is #28142
- @YuanpingSong first commit is #28617
- @janeyx99 first commit is #28595
- @xingliu14 first commit is #28529
- @610lyn first commit is #28522
- @WorldExplored first commit is #27516
- @gjc0824 first commit is #25438
- @shahfasal first commit is #26721
- @zhaozx-cn first commit is #28277
- @OthmanMohammad first commit is #28737
- @GuanH first commit is #28663
- @halyavin first commit is #28728
- @mostrowskix first commit is #27410
- @laithsakka first commit is #25110
- @hwhaokun first commit is #28777
- @River12 first commit is #28715
- @scottzh8 first commit is #28679
- @ashors1 first commit is #28765
- @jscaldwell55 first commit is #28543
- @tiehexue first commit is #28814
- @Ronald1995 first commit is #24799
- @shreyas269 first commit is #28435
- @pranav4501 first commit is #28332
Full Changelog: v0.11.0...v0.11.1