Release v0.11.1 · vllm-project/vllm

Highlights

This release includes 1456 commits from 449 contributors (184 new contributors)!

Key changes include:

PyTorch 2.9.0 + CUDA 12.9.1: Updated the default CUDA build to torch==2.9.0+cu129, enabling Inductor partitioning and landing multiple fixes in graph-partition rules and compile-cache integration.
Batch-invariant torch.compile: Generalized batch-invariant support across attention and MoE backends, with explicit support for DeepGEMM and FlashInfer on Hopper and Blackwell GPUs.
Robust async scheduling: Fixed several correctness and stability issues in async scheduling, especially when combined with chunked prefill, structured outputs, priority scheduling, MTP, and DeepEP / DCP. We expect --async-scheduling to be enabled by default in the next release.
Stronger scheduler + KV ecosystem: Improved test coverage in CI and made scheduler behavior more robust with KV connectors, prefix caching, and multi-node deployments.
Anthropic API Support: Added support for the /v1/messages endpoint, allowing users to interact with vllm serve using Anthropic-compatible clients.

Detailed release notes will be updated in the next few days.

What's Changed

[Bugfix] Improve GLM4 MoE Reasoning Parser's is_reasoning_end Condition (@frankwang28 #25355)
[Docs] Add Toronto Meetup (@mgoin #25773)
[CI] Add E2E Blackwell Quantized MoE Test (@mgoin #25723)
[V1] address post issues related to #20059 (part 1); cascade attention reenable by default (@fhl2000 #23046)
[CI] Fix FlashInfer AOT in release docker image (@mgoin #25730)
[spec decode] Consolidate speculative decode method name for MTP (@zixi-qi #25232)
Reduce the Cuda Graph memory footprint when running with DBO (@SageMoore #25779)
Kernel-override Determinism [1/n] (@bwasti #25603)
[Bugfix] Optimize CpuGpuBuffer initialization (@namanlalitnyu #25447)
[Spec decode] automatically disable mm for text-only draft models (@jmkuebler #25667)
[Core] Don't count preempted tokens in prefix cache hit rate (@zhuohan123 #25787)
Add option to restrict media domains (@russellb #25783)
Add flashinfer-build.sh and register precompiled cu128 wheel in Dockerfile (@mgoin #25782)
[Multimodal][Speculative Decoding]Eagle Eagle3 mm support, enablement on qwen2.5vl (@david6666666 #22872)
[Bugfix] Allow Only SDPA Backend for ViT on B200 for Qwen3-VL (@yewentao256 #25788)
[CI/Build] Consolidate model loader tests and requirements (@DarkLight1337 #25765)
[CI/Build] Add timing to Model Executor Test (@22quinn #25799)
[CI/Build] Reorganize root-level V1 tests (@DarkLight1337 #25767)
[Misc] Fix codeowners override for v1 sample and attention (@22quinn #25037)
[Misc] Update openai client example file for multimodal (@ywang96 #25795)
[Bugfix] Add missing image_size for phi4_multimodal (@Renovamen #25796)
[Bugfix] Merge MM embeddings by index instead of token IDs (@DarkLight1337 #16229)
Validate API tokens in constant time (@russellb #25781)
Add filtering for chat template kwargs (@russellb #25794)
Fix GPTQ model loading in Transformers backend (@hmellor #25770)
[Bugfix] Fix triton import precommit failure (@tlrmchlsmth #25803)
[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models (@tlrmchlsmth #24982)
[docs] Resolve transcriptions API TODO (@yyzxw #25446)
[env] default nixl side port conflicts with kv-event zmq port (@panpan0000 #25056)
[Core] Refactor self.model() to call a helper for subclassing. (@patrick-toulme #25084)
[torch.compile]: Add VLLM_DEBUG_DUMP_PATH environment variable (@ZJY0516 #25651)
[Bug]: Set LD_LIBRARY_PATH to include the 'standard' CUDA location (@smarterclayton #25766)
[Core] GC Debug callback (@Jialin #24829)
[Bugfix][NIXL] Fix Async Scheduler timeout issue (@NickLucche #25808)
[MM] Optimize memory profiling for scattered multimodal embeddings (@ywang96 #25810)
[Bugfix] Fix Qwen3-VL regression from #24982 (@ywang96 #25814)
[VLM] Update Qwen3-VL max_num_video_tokens calculation for configurable video profiling (@Isotr0py #25557)
Fix random dataset mismatched token length with config. (@weireweire #24937)
Update GLM-4.5 Doc transformers version (@zRzRzRzRzRzRzR #25830)
[Bugfix] fix Qwen3VLMoe load when pp > 1 (@JJJYmmm #25838)
Remove redundant cudagraph dispatcher warning (@mgoin #25841)
[Misc] fix tests failure by using current_platform (@kingsmad #25825)
[P/D] NIXL Updates (@robertgshaw2-redhat #25844)
Add Phi4FlashForCausalLM to _PREVIOUSLY_SUPPORTED_MODELS (@tdoublep #25832)
[XPU]Fix xpu spec decoding UTs, avoid using cuda graph (@jikunshang #25847)
[Bugfix] Fallback ViT attn backend to SDPA for blackwell (@ywang96 #25851)
[V0 Deprecation][Models] Remove all V0 condition for mm embeddings merge (@Isotr0py #25331)
[Misc] Remove more get_input_embeddings_v0 (@DarkLight1337 #25857)
update to latest deepgemm for dsv3.2 (@youkaichao #25871)
[Bugfix] Fix requirements paths in install instructions (@yingjun-mou #25827)
[Model][Bugfix] Fix issues in MiDashengLM implementation for quantized models (@zhoukezi #25854)
[torch.compile] serialize cudagraph_mode as its enum name instead of value (@ZJY0516 #25868)
[Cuda2CPU][P/D] Add cuda2cpu support in NixlConnector (@chenxi-yang #24690)
[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue (@rahul-tuli #25883)
[CI/Build] Include Transformers backend test in nightly transformers test (@Isotr0py #25885)
[Model] Remove MotifForCausalLM (@jeejeelee #25866)
[Bugfix] Use correct key "ignore" for config.json non-quantized layers (@leejnau #25706)
[BugFix][torch.compile] KV scale calculation issues with FP8 quantization (#21640) (@adabeyta #25513)
[Doc] Add documentation for vLLM continuous benchmarking and profiling (@namanlalitnyu #25819)
[Bugfix][ROCm] Fixing trying to import non-existent symbols from libnccl.so (@gshtras #25605)
[Kernel] Chunk-aligned mamba2 (@tdoublep #24683)
[Doc] Polish example for torchrun dp (@zhuohan123 #25899)
[NIXL] Increase default KV block eviction timeout on P (@NickLucche #25897)
[V0 Deprecation] Remove vllm.worker and update according imports (@aarnphm #25901)
Test Prompt Embeds/LoRA compatibility and Enable LoRA Support for OPT Models (@qthequartermasterman #25717)
[Bug] Fix Weight Loading for Block FP8 Cutlass SM90 (@yewentao256 #25909)
[Benchmark] Support benchmark throughput for external launcher DP (@zhuohan123 #25913)
MoveVllmConfig from config/__init__.py to config/vllm.py (@hmellor #25271)
[BugFix] Fix DP/EP hang (@LucasWilkinson #25906)
[BugFix] Pass config_format via try_get_generation_config (@acisseJZhong #25912)
[Model][Bugfix] Fix MiDashengLM audio encoder mask by removing incorrect logical_not (@zhoukezi #25925)
[Bugfix]: Clean up chunked prefill logging when using whisper (@simondanielsson #25075)
[New Model] DeepSeek-V3.2 (Rebased to Main) (@zyongye #25896)
[Doc] Add Cambricon MLU support (@a120092009 #25942)
Updated TRL integration docs (@sergiopaniego #25684)
[Bugfix][Model]fix ernie45 moe gate&bias dtype to float32 (@CSWYF3634076 #25936)
[Model] Move vision_feature_select_strategy into resolve_visual_encoder_outputs (@DarkLight1337 #25938)
[perf] Use CPU tensor to reduce GPU->CPU sync (@lhtin #25884)
[NIXL] Add support for MLA caches with different latent dim (@NickLucche #25902)
[CI] Move applicable tests to CPU (@rzabarazesh #24080)
[Fix] Improve CPU backend compatibility for RISC-V (@ihb2032 #25816)
[Kernel][Moe Configs] Add more tuned triton configs for ExpertsInt8 and FP8 (@Josephasafg #25858)
Add Hugging Face Inference Endpoints guide to Deployment docs (@sergiopaniego #25886)
[Bugfix][Model] Fix inference for Hunyuan dense models (@Anionex #25354)
[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging (@pavanimajety #25895)
[Bugfix] Token type and position embeddings fail to be applied to inputs_embeds (@DarkLight1337 #25922)
[bugfix][deepseek] fix flashmla kernel selection (@youkaichao #25956)
[Bug] Fix AttributeError: 'QKVParallelLinear' object has no attribute 'orig_dtype' (@yewentao256 #25958)
[Doc] Improve MM Pooling model documentation (@DarkLight1337 #25966)
[Docs] Add moe kernel features doc (@bnellnm #25297)
OffloadingConnector: Fix GPU block tracking bug (@orozery #25856)
[Llama4] [multimodal] Fix misplaced dtype cast of cos_sin_cache in Llama4VisionRotaryEmbedding (@cjackal #25889)
[Bench] Add DeepSeekV32 to MoE benchmark (@jeejeelee #25962)
[V1] [P/D] Add Support for KV Load Failure Recovery (@sdavidbd #19330)
Add explicit pooling classes for the Transformers backend (@hmellor #25322)
[Docs] Remove API Reference from search index (@hmellor #25949)
[gpt-oss] use vLLM instead of openai types for streaming (@qandrew #25186)
[Misc] Make EP kernels install script support uv (@LucasWilkinson #25785)
[Model] MTP fallback to eager for DeepSeek v32 (@luccafong #25982)
Update launch_bounds_utils.h for correct compile on Multiple Cuda Arch - PTXAS out of range Warning (@DrStone1971 #25843)
[Log] Optimize Log for FP8MOE (@yewentao256 #25709)
Fix INT8 quantization error on Blackwell GPUs (SM100+) (@certainly-param #25935)
[MM] Add text-only mode for Qwen3-VL (@ywang96 #26000)
[Bugfix] Fix __syncwarp on ROCM (@zhewenl #25996)
[BugFix] Fix default kv-cache-dtype default for DeepseekV3.2 (@LucasWilkinson #25988)
Update to Transformers v4.56.2 (@hmellor #24638)
[Misc]allow disable pynccl (@luccafong #25421)
[Doc] updating torch.compile doc link #25989)
[BugFix][MM] Fix Nonetype error when video is cache in qwen2.5-omni-thinker (@wwl2755 #26004)
[Misc] Factor out common _apply_feature_select_strategy (@DarkLight1337 #26003)
[CI] Only capture a single CUDA graph size in CI by default (@hmellor #25951)
[MISC] Fix misleading batch_size_capture_list when cuda_graph_sizes < 4 (@billishyahao #25829)
[Benchmark] Finish documented v0.11.0 deprecation of --endpoint-type (@natoscott #26007)
[Bugfix] Apply same sampling parameters for both n=1 and n>1 (@kmaehashi #26005)
[NVIDIA] Blackwell Family (@johnnynunez #24673)
Fix test_mamba_ssm_ssd.py due to missing _query_start_loc_to_chunk_indices_offsets (@hl475 #25995)
[CI] Tweaks to GPT-OSS Eval (Blackwell) for stability (@mgoin #26030)
[BugFix][DP/EP] Fix CUTLASS MLA hang under load (@LucasWilkinson #26026)
[ROCm][Build] Add support for AMD Ryzen AI MAX / AI 300 Series (@hyoon1 #25908)
[Bug] Fix Negative Cuda Memory Usage (@yewentao256 #25683)
[BugFix] ChunkedLocalAttention is currently not CG compatible (@LucasWilkinson #26034)
Support RL online quantization with torchao (@jerryzh168 #23014)
[ROCm][Bugfix] Add missing parameter to ROCm backend (@gshtras #26029)
[Misc] Make handling of SamplingParams clearer in n>1 case (@njhill #26032)
Run:ai model streamer add GCS package support (@pwschuurman #24909)
Update base image to 22.04 (jammy) (@huydhn #26065)
Change size of single CUDA graph for CI to 4 (@tdoublep #26089)
[FA/Chore] Bump vllm-flash-attention (@LucasWilkinson #25537)
[Model] Use merge_by_field_config for MM models (A-C) (@DarkLight1337 #26073)
[Model] Use merge_by_field_config for MM models (D-F) (@DarkLight1337 #26076)
[Platform][CI] Added OOT platform interface e2e test that running on Ascend NPU (@leo-pony #25470)
[Qwen][ROCm] Flash Attention Rotary Embeddings (@vllmellm #24642)
[CI] Add Blackwell DeepSeek FP8 FlashInfer MoE tests (@mgoin #26040)
[CI/Build] Replace vllm.entrypoints.openai.api_server entrypoint with vllm serve command (@DarkLight1337 #25967)
[BugFix] Fix FI accuracy issue when used for MLA prefill (@LucasWilkinson #26063)
[Small] Prevent bypassing media domain restriction via HTTP redirects (@huachenheli #26035)
[Deepseek v3.2] Support indexer prefill chunking (@heheda12345 #25999)
EAGLE 3: Fix preamble so that measured speedup over Eagle 1 becomes 32% instead of 5% on MTBench (@ekagra-ranjan #25916)
[Mamba][KVCacheManager] Simplify kv cache manage logic for mamba + MTP (@heheda12345 #25119)
[Perf] Fix and reapply move apply w8a8 block fp8 linear to class (@ElizaWszola #25696)
Fix MTP with deepep_low_latency (@MatthewBonanni #25904)
[Bugfix] Disable cascade attention with FlashInfer (@mgoin #26130)
[Log] Optimize DeepGEMM Missing Log (@yewentao256 #26106)
[Bug][Benchmark] Fix duplicate req in oversampling (@ekagra-ranjan #26140)
[Attention] Move Backend enum into registry (@MatthewBonanni #25893)
[CI/Build] Conditionally register cutlass_fp4_group_mm to fix building on Hopper (@mgoin #26138)
[DeepSeek] Improve performance of DS MLA cache kernel (@MatthewBonanni #26132)
[Bug]: Limit num_reqs in dummy_run when max_num_seqs is small (@benchislett #26144)
[gpt-oss] disable tool server initialization if no tool in request (@qandrew #25790)
[Build/CI] Revert back to Ubuntu 20.04, install python 3.12 with uv (@tlrmchlsmth #26103)
[ROCm] [VL] [Bugfix] Fix vit flash attn dispatcher logic for ROCm (@tjtanaa #26104)
[Bugfix] Fix import gemm_afp4wfp4 failure on AMD (@zhewenl #26068)
[Model] Use merge_by_field_config for MM models (G) (@DarkLight1337 #26117)
FusedMoE support for the Transformers backend (@hmellor #22650)
[BUG] Reorder model config creation (@ahao-anyscale #26124)
[Misc] Remove typing.List (@varun-sundar-rabindranath #26150)
[Input] Remove unused prompt field (@DarkLight1337 #26097)
[Perf] Optimize reshape_and_cache CUDA Kernel (@ZJY0516 #25955)
add(v1): RequestStatesStats to RequestOutput (@huijjj #24947)
[Model] Use merge_by_field_config for MM models (InternVL family) (@DarkLight1337 #26153)
[test utils] correct wrong typing (@yannicks1 #26159)
[CI] Fix distributed hybrid tests in CI (@tdoublep #26155)
[NIXL][Misc] Expose metrics from NIXL for logging to CLI (@NickLucche #25388)
[openai] Fix missing tool usage check (system message) (@levunet #24768)
[Multi Modal] Configurable MM Profiling (@wwl2755 #25631)
[Doc] Fixed shape description for fused_batched_moe.py (@Egor-Krivov #25668)
Quick fix for IMA with the Prefix Prefill kernel during graph capture (@SageMoore #25983)
[Renderer] Move Processor out of AsyncLLM (@KKSK-DON #24138)
Re-enable prefill of max model length (@yannicks1 #24446)
[backends][short_conv] CUDA graph piecewise edits (@paulpak58 #24215)
[Model] Supplement to PR 24862: Pass param prefix to LLMHead (@whx-sjtu #25805)
[CI/Build] do not enforce precompilation on tpu ci tests (@sixiang-google #25992)
[Model] Fixed stream generator for gpt-oss + spec-decoding (@astralord #26027)
[Renderer] Move Processor out of LLMEngine (@DarkLight1337 #26165)
Fix undefined symbol: cutlass_moe_mm_sm100 (@jasl #26098)
[BugFix][QWEN-VL]fix wrong apply_rotary_emb_torch selection introduced by #24642 (@xuechendi #26123)
Stop mergify from keeping stale PRs alive (@hmellor #26169)
Avoid division by zero in cache DS MLA kernel (@MatthewBonanni #26174)
Fix V1 engine serialization error with Ray distributed executor (@nrghosh #26148)
[Quantization/NVFP4] Speed up TRTLLM NVFP4 MOE weight loading and fix K/V scale loading for MLA Attn (@pavanimajety #25968)
[Perf] Remove hardcoded num_warps=1 (@chelsea0x3b #26183)
[Refactor] Optimize FP8 MOE Backend Choice and Log (@yewentao256 #26044)
[responsesAPI] add better error messaging for long prompts (@qandrew #25724)
[Bugfix] Relax tokenizer regex for mixtral to include 'tokenizer.model' (@BowenBao #25964)
[CI] Push multiarch manifests as nightly builds (@csahithi #25764)
[Misc] Add penalties sampling parameters to serve tool (@southfreebird #25974)
[BugFix] Fix de-functionalization pass for rotary_embedding (@angelayi #23953)
[CI] Fix Pre-commit Mypy Error (@yewentao256 #26181)
[GPTOSS][DP/EP][Marlin] Enable GPTOSS DP/EP using Marlin kernels (@varun-sundar-rabindranath #25488)
Fix issue of using only the part of video frame [Nemotron Nano] (@BloodAxe #26186)
[Bugfix] Fix qwen3 vl dummy data generation with overrides (@ywang96 #26193)
[BugFix] Use async Mistral Tokenizer in Chat Completions (@bbrowning #26134)
Add batch invariant kernel override for FlashInfer backend [2/n] (@bwasti #25769)
[cpu][perf] Accelerate unquantized-linear for AArch64 through oneDNN/ACL and weight prepack (@fadara01 #25948)
[V1] [Hybrid] Mamba2 Automatic Prefix Caching (@s3woz #25752)
Support expert parallel in Transformers backend (@hmellor #26162)
[Model] Support nested structures for TensorSchema (@DarkLight1337 #26212)
[Misc] Require merge_by_field_config argument (@DarkLight1337 #26214)
[Misc] Remove unused executor.apply_model (@DarkLight1337 #26215)
[CI Failure] fix_test_auto_prefix_cache_support (@hl475 #26053)
Revert "Add batch invariant kernel override for FlashInfer backend [2/n]" (@DarkLight1337 #26220)
Add Olmo 3 reasoning parser (@soldni #26054)
[Core] Enable decode of context length equal to max model length (@yannicks1 #26168)
[Bugfix] Fix _reqs_to_process leak on abort (@NickLucche #26012)
[Model] CLIP Embedding Support (@DarkLight1337 #26010)
Fix tensor device and dtype placement in Qwen2VL model (@yuafng #26219)
[V1] [Hybrid] Remove code to override default CUDA graph configuration (@tdoublep #26226)
[CPU] Refine batch reorder of CPU attention backend (@bigPYJ1151 #26096)
[Frontend] Cache chat template kwargs resolution (@Isotr0py #26227)
[Renderer] Clean up renderer code (@DarkLight1337 #26216)
[Model] Use merge_by_field_config for MM models (H-L) (@DarkLight1337 #26230)
[Easy] Add str repr for IterationStats (@22quinn #26232)
[Bugfix] Allow --skip-tokenizer-init with echo and return_token_ids (@DarkLight1337 #26238)
Add documentation for granite 4 tool calling (@maxdebayser #26175)
[Perf][Easy] Early stop in request_block_hasher (@Jialin #26112)
[Bugfix]: Assertion error when using FlashInfer backend (@simondanielsson #25933)
[Bugfix] Always apply MM processor even when no MM items are passed (@DarkLight1337 #26240)
[Bugfix][Hardware][RISC-V] Limit supported dtypes to float32 to avoid scheduler segfault (@ihb2032 #26228)
[Refactor][Kernel] support loading kernel from other place (@ILikeIneine #25823)
Convert formatting to use ruff instead of yapf + isort (@hmellor #26247)
Remove all references to yapf as it's no longer used (@hmellor #26251)
Remove all cases of fmt: on/off (@hmellor #26253)
fix(tests): Resolve late binding of loop variable in assert message lambda (@ihb2032 #26249)
Fix per file ruff ignores related to typing (@hmellor #26254)
Update ruff pre-commit hooks version (@hmellor #26255)
[CI] fix mamba kernel test (@ZJY0516 #26250)
[NVIDIA] flashinfer TRTLLM attention prefill token limit (@jasonlizhengjian #25998)
Fix per file ruff ignores related to simplification (@hmellor #26259)
[CI] Add Blackwell LM Eval Small Models test to nightly (@mgoin #26052)
[DOC] Update production-stack.md (@elieserr #26177)
[CI] Add comment about the single cudagraph capture size that is used (@tdoublep #26252)
[V1] [Hybrid] Some additional clean-up in Mamba2 prefix caching (@tdoublep #26222)
[Doc] Edited minor typo (@orangeng #26266)
[MISC] Add heheda12345 to CODEOWNERS of vllm/config/cache.py (@heheda12345 #26270)
[CI][gpt-oss] Enable python tool tests in CI (@wuhang2014 #24315)
Fix per file ruff ignores related to line length (@hmellor #26262)
Bump actions/stale from 10.0.0 to 10.1.0 (@dependabot[bot] #26272)
[Benchmarking] Add disable_shuffle option for dataset loading (@ymoslem #26258)
[Misc] Clean up unnecessary E501 ignore (@ywang96 #26274)
[Docs] Edit HF Inference Endpoints documentation (@ariG23498 #26275)
[Doc] add KAITO to integrations (@abhisheksheth28 #25521)
[Frontend] Consolidate tokenizer init code (@DarkLight1337 #26276)
[Model] Use merge_by_field_config for MM models (Llava family) (@DarkLight1337 #26280)
Support expert parallel load balancing in Transformers backend (@hmellor #26287)
[Bugfix] Fix mrope in Transformers Backend (@zucchini-nlp #26087)
Fix DotsOCR tensor type (@what-in-the-nim #26281)
[Model] EVS support for nano_nemotron_vl (@tomeras91 #26269)
[Attention] Remove unused reorder_batch method (@MatthewBonanni #24463)
[Tests] conftest: Extending VllmRunner and HfRunner to accept token_ids as input (@yannicks1 #26295)
[CI Bugfix] Make sure TRTLLM attention is available in test_blackwell_moe (@mgoin #26188)
Support llama3 eagle3 head with llama4 verifier (@rahul-tuli #25961)
[Misc] auto_tune: kill specific vllm process (@karan #26304)
[Bugfix][Spec Decode] Fix wrong valid_mask for padded speculation when chunked prefill occurs (@seven-mile #26231)
Add bias handling to CPUFusedMOE kernel (@cfRod #26289)
[Bugfix] Fix gemma3 with transformers backend (@zucchini-nlp #23178)
[Benchmark] Enable MM Embedding benchmarks (@DarkLight1337 #26310)
[Docs] Fix broken table in moe_kernel_features doc (@varun-sundar-rabindranath #26314)
[BugFix] Pad input buffers in _dummy_run (@varun-sundar-rabindranath #26209)
[Bugfix] Allow skipping MoE in NVFP4 (fix for MTP) (@benchislett #25987)
[ROCm] Split AITER unified attention into its own backend (@gshtras #25507)
[Perf] Add decode full-graph support to FlashInfer-MLA backend (@benchislett #26313)
[Misc] Define EP kernel arch list in Dockerfile (@simon-mo #25635)
[Docs][DBO] Add initial doc that describes the DBO implementation (@SageMoore #26024)
[Core] Simplify the Dp padding/should ubatch coordination logic (@SageMoore #25768)
[UX] Support nested dicts in hf_overrides (@mgoin #25727)
[BUG] Fix file parsing for load_format runai_streamer_sharded (@ahao-anyscale #26324)
[Model] Define merge_by_field_config MM interface (U-Z) (@ayushsatyam146 #26261)
[Deprecation] Deprecate LLM.set_tokenizer (@DarkLight1337 #26333)
[responsesAPI][bugfix] serialize harmony messages (@qandrew #26185)
[Model] Define merge_by_field_config MM interface (R-T) (@ayushsatyam146 #26260)
[BugFix] Update KV block hash type from BlockHash to ExternalBlockHash in kv_events_subscriber - #26264 (@atalhens #26265)
[V0 Deprecation] Remove VLLM_USE_V1 from docs and scripts (@DarkLight1337 #26336)
Optimize KV cache distribution for asymmetric pipeline parallelism (@gholmes829 #25164)
Add topk logits torch op for DS3.2. (@dcampora #25945)
Add TRL example notebook to RLHF docs (@sergiopaniego #26346)
[Docs] add docs for cuda graph v1 (@fhl2000 #24374)
[Model] Use merge_by_field_config for MM models (Ovis family) (@Isotr0py #26308)
[Feature][OCP MX] Support mxfp6 and mixed mxfp6-mxfp4 (@fxmarty-amd #21166)
[Model] Add support for ModernBertForTokenClassification (@antrec #26340)
[Misc] Move LRUCache into its own file (@DarkLight1337 #26342)
[V0 Deprecation] Remove VLLM_USE_V1 from tests (@DarkLight1337 #26341)
[Model] Lfm2Moe (@paulpak58 #26344)
[ci] Rename test_mxfp4_moe.py to test_ocp_mx_moe.py (@fxmarty-amd #26364)
[CI] Add Qwen3 MoE NVFP4 to Blackwell lm-eval (@mgoin #26316)
[deepseek] add EP8 FusedMOE config for H200 and B200 (@heheda12345 #26331)
[Bug] Fix Shape Validation for Fallback while Enabling E8M0 for DeepGEMM (@yewentao256 #26322)
[Bugfix] Add missing sink tensor into flash attn cascade attn implementation (@plliao #26325)
[Frontend] CompilationConfig overhaul (#20283): deprecate use_inductor in favor of backend, simplify custom_ops (@morrison-turnansky #26113)
[V1] Logit processors for rejection sampler (@southfreebird #19482)
[Spec Decode] Enable efficient speculative decoding with FlashInfer-MLA (@benchislett #25984)
[TPU] update TPU benchmark threshold (@jcyang43 #25713)
Add more libraries to rlhf.md (@mgoin #26374)
[Bugfix] Fix MTP+FlashInfer crash when trtllm kernels are available but disabled (@benchislett #26361)
Revert #24446 and #26168 (@tdoublep #26332)
[Misc] Clean up cruft from previous FlashMLA sparse implementation (@LucasWilkinson #26125)
[torchao] safetensors integration (@liangel-02 #25969)
Add SwigluOAI implementation for CPUFusedMOE (@isharif168 #26347)
[Core] Simplify setting new_token_ids in CachedRequestData (@njhill #26388)
fix(v1/kv_cache): resolve async KV transfer bug in cascade attention (@ayushsatyam146 #23485)
Add gather_indexer_k_quant_cache kernel (@Barry-Delaney #25931)
[Bugfix] Incorrect MM data format in vllm bench throughput (@DarkLight1337 #26395)
fix[DP][v1]: Prevent hangs from mismatched worker configurations (@ayushsatyam146 #26218)
[TPU] Rename tpu_commons to tpu_inference (@utkarshsharma1 #26279)
[Feature] Enable E8M0 by Default on Hopper for DeepGEMM, 5% E2E throughput improvement (@yewentao256 #26197)
[Misc] add usedforsecurity=False in md5 hash call (@dtrifiro #26357)
[Model] Allow passing custom number of max tiles to Nano 2 VL (@BloodAxe #26403)
[Docs] Have mergify leave a comment with the docs preview link (@hmellor #26412)
[CI] Pooling models mteb test disable enforce_eager (@noooop #26408)
[Benchmarks] Add support for Qwen 3 VL MoE tuning (@lgeiger #26419)
Tidy vllm/config/__init__.py to only add classes and functions (@hmellor #26405)
[NIXL][non-cuda] Add install script for nixl with non-cuda ucx (@xuechendi #25959)
[Refactor] Refactor FP8 & INT8 Quant Folder inside w8a8 (@yewentao256 #25293)
[CI Failure] Fix pre-commit issue for install_nixl_from_source_ubuntu.py (@mgoin #26424)
[Bugfix] Fix vllm bench ... on CPU-only head nodes (@Aydin-ab #25283)
[Bug] Fix DeepGEMM Attention Test (@yewentao256 #26423)
[Benchmarks] Fix imports in FP8 tuning script (@lgeiger #26407)
[Bug] Fix Test in Batch Invariant (@yewentao256 #26128)
Remove Python 3.9 support ahead of PyTorch 2.9 in v0.11.1 (@hmellor #26416)
[Feature] Change cache.py with pydantic validation (@vrdn-23 #26390)
[Attention] Implement universal BACKEND_MAP (@MatthewBonanni #25900)
[Bugfix][Flashinfer] fix VLLM_USE_TRTLLM_ATTENTION issue for models with diff hyperparameters (@elvischenv #25924)
[BugFix] Fix failing test quantization/test_compressed_tensors.py::test_compressed_tensors_fp8_block_enabled (@morrison-turnansky #26436)
[Kernel] Centralize platform kernel import in current_platform.import_kernels (@NickLucche #26286)
[Models] Improve iteration over layers (@lgeiger #26425)
[Bugfix] Respect min_tokens in scheduler stop check (@elaineyz #26317)
[Kernels] Modular kernel refactor (@bnellnm #24812)
[Attention] Register FLASHMLA_SPARSE (@MatthewBonanni #26441)
Separate MLAAttention class from Attention (@therealnaveenkamal #25103)
[Misc] Redact ray runtime env before logging (@ruisearch42 #26302)
[Bugfix] Set the minimum python version for gpt-oss (@jeejeelee #26392)
[Minor] Change warning->warning_once in preprocess (@zhuohan123 #26455)
[Bugfix] Catch and log invalid token ids in detokenizer #2 (@njhill #26445)
[Bugfix] Incorrect another MM data format in vllm bench throughput (@huydhn #26462)
[Hardware][AMD] Enable FlexAttention backend on ROCm (@mawong-amd #26439)
[MM][Doc] Add documentation for configurable mm profiling (@wwl2755 #26200)
[Core][KVConnector] Propagate all tokens on resumed preemptions (@QierLi #24926)
[Hybrid]: Decouple Kernel Block Size from KV Page Size (@zhiyuan1i #24486)
[CI/Build] Fix model nightly tests (@DarkLight1337 #26466)
[Core] Relax the LoRA max rank (@jeejeelee #26461)
Update Dockerfile and install runai-model-streamer[gcs] package (@pwschuurman #26464)
Bump Flashinfer to v0.4.0 (@elvischenv #26326)
[Model] Gemma3: Fix GGUF loading and quantization (@lucianommartins #26189)
Enable RMSNorm substitution for Transformers backend (@hmellor #26353)
Add: Support for multiple hidden layers in Eagle3 (@rahul-tuli #26164)
[torchao] Add support for ModuleFqnToConfig using regex (@jerryzh168 #26001)
[Misc] Misc code simplifications (@njhill #26450)
[doc] add Volcengine as a compute sponsor (@youkaichao #26477)
[Feature] Use pydantic validation in lora.py and load.py configs (@simondanielsson #26413)
[Misc] Upgrade more code to Python 3.10 (@DarkLight1337 #26463)
[Bugfix] Fix SHM cache initialization (@DarkLight1337 #26427)
[Models][Qwen3VL] Optimise _validate_and_reshape_mm_tensor (@lgeiger #26426)
[Bugfix] Move current_platform import to avoid python import cache. (@iwzbi #16601)
[V0 deprecation] Remove QKVCrossParallelLinear implementation (@Isotr0py #26475)
[Feature] Use pydantic validation in parallel.py config (@simondanielsson #26417)
Revert #26113 "[Frontend] CompilationConfig overhaul (#20283): deprecate use_inductor in favor of backend, simplify custom_ops" (@ZJY0516 #26472)
Upgrade Pydantic to v2.12.0 and remove hack for Python 3.13 (@hmellor #26481)
[Models][Qwen] Replace pad with cat for better performance (@lgeiger #26486)
[Attention][DCP] Support DCP with query length > 1 (MTP) with FA3 (@minosfuture #25049)
[Model] Apply shared experts overlap optimization to all models with shared experts (@bnellnm #26145)
[BUGFIX] Add cu_tokens_across_sp to DPMetadata (@SageMoore #26457)
[Bugfix] Enable padded FP4 quantization (@roikoren755 #25947)
[Bugfix] Disable moe inplace for torch >= 2.9 (@bnellnm #26497)
[Flashinfer][gpt-oss] Support FP8-qkv Flashinfer TRTLLM Sinks Attention (@elvischenv #25674)
[Core] Remove unused prev_sampled_token_ids_invalid_indices input batch field (@njhill #26514)
[UX] Add FlashInfer as default CUDA dependency (@mgoin #26443)
[Bugfix] Fix CUDA graph selection bug in FlashInfer at high concurrency (@benchislett #26499)
[Bug] Fix modular_kernel: ZeroDivisionError: integer division or modulo by zero (@yewentao256 #26528)
[CI] Fix Pre-commit Issue Cannot determine type of "rank" and "world_size" (@yewentao256 #26448)
Refactor MistralTokenizer (@juliendenize #26358)
[DP][ray] Support different VLLM_RAY_DP_PACK_STRATEGY (@ruisearch42 #23849)
[Core] Small simplification in GPUModelRunner._update_states() (@njhill #26508)
[Chore]: One pythonic tool parser test uses the wrong parser (@bbrowning #26515)
[Spec-Decode] Support piecewise cudagraphs for Eagle head (@LucasWilkinson #25109)
fix test_simple_inductor_graph_partition (@BoyuanFeng #26522)
[deepseek] kernel block size for UniformTypeKVCacheSpecs (@heheda12345 #26559)
[Metrics] Log multi-modal cache stats and fix reset (@DarkLight1337 #26285)
[GPT-OSS] Add support for arrays at tool message content (@luis5tb #25593)
Remove LoRA bias support (@ashwin-phadke #25807)
[CI] fix ruff format (@chaunceyjiang #26579)
[bugfix][DCP] fix block_size of hash in DCP prefix caching (@heheda12345 #26296)
[NIXL] Ignore abort on already-finished request (@markmc #25067)
[Bugfix] Convert untraceable GroupShape to list for AMD impl (@Lucaskabela #26535)
[BugFix] Fix noop elimination edge case (@andylolu2 #26394)
[CI] fix test_run_batch.py::test_completions - AssertionError (@chaunceyjiang #26578)
[BugFix][torch.compile] Fix fused_scaled_matmul_reduce_scatter signature for PyTorch 2.8 (@jasonlizhengjian #26038)
Added test_top_k_per_row to test-pipeline.yaml. (@dcampora #26569)
[Bugfix] Make DP padding optional in coordinate_batch_across_dp (@SageMoore #26375)
Silu v2 (@elvircrn #25074)
[Metrics] Add test for multi-modal cache stats logging (@markmc #26588)
[torch.compile] Make inductor partition rules respect splitting_ops #25691 (@baonudesifeizhai #25845)
[Bugfix] fixed top_logprobs: -1 does not appear to work as intended (@chaunceyjiang #26470)
[Model][Qwen3VL] Compute cu_seqlens on CPU to remove (@lgeiger #26496)
[Model] Add FlexOlmo model implementation (@2015aroras #24923)
[Transform] [Quantization] Add QuTLASS support to vLLM (@LopezCastroRoberto #24440)
Add Qwen3-Omni moe thinker (@wangxiongts #25550)
Update pre-commit hook versions (@hmellor #26591)
Update CUDA architecture list in build pipeline for 12.9.1 wheels (@wseaton #26592)
Fix some typing issues found by mypy==1.18.2 (@hmellor #26596)
[BUG] Qwen3-next MTP. Fix attn metadata build bug (@vadiklyutiy #26564)
[BugFix] Fix async scheduling + request preemption (@njhill #26385)
Cache the environment variable check for batch invariance (@bwasti #26510)
AOT Compilation for torch.compile (Bundled) (@zhxchen17 #24274)
[BugFix] Make penalties and bad_words work with async scheduling (@njhill #26467)
[Frontend] Improve the performance of is_reasoning_end (@chaunceyjiang #25735)
[CI/Build] Fix ppc64le CPU build and tests (@npanpaliya #22443)
[XPU] Upgrade NIXL to remove CUDA dependency (@zhenwei-intel #26570)
[MM] Move Qwen3Omni MRoPE impl to model file (@ywang96 #26608)
[Bugfix][Multi Modal] Fix incorrect Molmo image processing (@sangho-vision #26563)
[Refactor]: Use M-RoPE interface directly while defining model class instead of maintaining model specific M-RoPE implementation in mrope.py (@divyanshsinghvi #24172)
fix(nix): Allow local oneDNN path to fix vLLM CPU build failure (@ihb2032 #26401)
Add EAGLE-3 Speculative Decoding Support for Qwen3 MoE (@rahul-tuli #26485)
[CPU] fix the issue when the node is '-' cause json decode error. (@muzian666 #26562)
[Refactor]Reduce duplicate code in serving_chat (@chaunceyjiang #26627)
[compile] Add patched_fused_scaled_matmul_reduce_scatter (@angelayi #26604)
[Bugfix][Qwen3VL] fix deepstack in qwen3vl (@JJJYmmm #26626)
[Bugfix] Fix qwen-moe packed_modules_mapping (@jeejeelee #26634)
[Benchmark] Support Infinity API (@DarkLight1337 #26641)
CP: make correct_attn_out robust to 4‑D views and fix Triton arg binding (@hl475 #26509)
[compile] Fix inductor partition config (@angelayi #26645)
[EPLB] Support ernie4.5-moe (@HsChen-sys #22100)
Add @noooop to codeowner for pooling models (@noooop #26652)
[PERF] [Qwen3-next] Speed up gated RMSNorm (@vadiklyutiy #26207)
[MISC] Rename the torch profiler filename as instance_id+rank_id for merging the Profiler results of each Rank (@noooop #25867)
[Bugfix][CI/Build] Fix failing Mteb CI (@Isotr0py #26638)
[Bugfix][DCP] Set default CUDAGraphMode to PIECEWISE for DCP (@FENP #26574)
[TEST][BUG FIX] Fix DP GPU_ID issue (@xuechendi #26442)
Update Optional[x] -> x | None and Union[x, y] to x | y (@hmellor #26633)
[Feature] Add support for naver/splade-v3 (BERT-based sparse embedding model) (@gjgjos #26339)
[Models][Qwen3VL] Speedup fast_pos_embed_interpolate (@lgeiger #26647)
[easy] fix pre commit error on trunk (@hl475 #26665)
[CI/Build] Add tool to build vllm-tpu wheel (@mgoin #19165)
[Misc] cache result of disable_inplace (@bnellnm #26666)
[Bugfix][Core]Fix block table out-of-range issue in priority scheduling (@quanliu1991 #26661)
[FIX] Throwing an exception when the model does not support pool tasks (#25840) (@yyzxw #25855)
docs: wrong command in structured_outputs README (@yihong0618 #26677)
[Model] Fix Skywork R1V mlp (@jeejeelee #26673)
[Model] Add reasoning_parser and tool_parser for Ernie45 thinking (@CSWYF3634076 #25027)
Ignore large reformatting PRs in git blame (@hmellor #26690)
[Model][0/N] Improve all pooling task | clean up (@noooop #25817)
[ResponseAPI] Simplify input/output message serialization (@Jialin #26620)
[Bugfix] Fix out of bound index issue for Jina-embedding-v3 RoPE with cuda graph (@Isotr0py #26687)
[unrevert] Add batch invariant kernel override for FlashInfer backend [2/n] (@bwasti #26373)
[Hardware][CPU] Disable torch.compile for RISC-V to prevent APIError (@ihb2032 #26693)
[FEATURE]: Use pydantic validation in multimodal.py config (@andycandy #26629)
[UX] Speedup DeepGEMM warmup with heuristics (@mgoin #25619)
[P/D] [NixlConnector] kv load recovery integration (@wseaton #26171)
[Misc] Separate prompt logging to debug (@aitsvet #26713)
[CI/Build] upgrade compressed-tensors to 0.12.2 to address LGPLv3 (@csy1204 #26501)
[Bugfix][Rocm] fix qr error when different inp shape (@haoyangli-amd #25892)
[Bugfix][Speculative Decoding] Extend Eagle quantization config fix to llama_eagle.py (@rahul-tuli #26590)
[Model] Use merge_by_field_config for MM models (M-N) (@DarkLight1337 #26710)
[Log] Optimize Startup Log (@yewentao256 #26601)
[CI][Release][Arm64]: Build arm64 release for gpu arch 8.9 (@cyb70289 #26698)
[Quantization] [Performance] Enable Marlin GEMM kernels for the calibration-free RTN-based quantization (@sakogan #26051)
[Frontend][1/N] Improve all pooling task | Support FP16 Embedding Base64 (Still uses fp32 by default). (@noooop #26414)
[CI] Fix mypy for vllm/distributed (@yewentao256 #26593)
[CI Perf]Prune Tests in kernel/mamba (@kfhfar #26538)
[Bug] Fix Assertion error DeepEP/csrc/kernels/intranode.cu:928: 'false and Unsupported type' (@yewentao256 #26532)
[FrontEnd] UNREVERT CompilationConfig overhaul (#20283): deprecate use_inductor in favor of backend, simplify custom_ops (@morrison-turnansky #26502)
Pruning kernel Core Tests (@kfhfar #26727)
[ResponseAPI] Further polish message serialization and unit tests (@Jialin #26728)
Add tests for chunked prefill and prefix cache with causal pooling models (@maxdebayser #26526)
[Misc][DP] support customized aggregated logger for dp (@luccafong #24354)
[UX] Replace VLLM_ALL2ALL_BACKEND with --all2all-backend (@mgoin #26732)
[compile] Enable sequence parallelism for full cuda graph without specifying compile sizes (@angelayi #26681)
[Easy] Fix env type check errors from VLLM_DEBUG_LOG_API_SERVER_RESPONSE (@Jialin #26742)
[build][torch.compile] upgrade depyf version (@youkaichao #26702)
[torch.compile] Unwrap fused_marlin_moe custom op (@varun-sundar-rabindranath #26739)
[Feature][Quantization] auto_round format add support for regex (@n1ck-guo #24024)
Add support for the /rerank endpoint in vllm bench serve (@maxdebayser #26602)
[Docs] Add a start tag to build.inc.md (@windsonsea #26747)
Fix lora tests failure in TPU CI due to the removal of LoRA bias (@vanbasten23 #26723)
[CI] [ROCm] Automate CC list for ROCm related issue (@vllmellm #26753)
Adding the test-amd.yaml for test definitions for the AMD backend. (alternative PR) (@Alexei-V-Ivanov-AMD #26718)
scheduler.py: Update the name of the default scheduler. (@ryanli #26758)
[Model][Bugfix]fix ernie45 load failed due to ernie45 eplb code (@CSWYF3634076 #26684)
[CI/Build] Use 127.0.0.1 instead of localhost in utils (@yeqcharlotte #26750)
fix(frontend): always include usage, when configured to do so (@max-wittig #20983)
[Plugin] Make plugin group clear (@wangxiyuan #26757)
[Bugfix] Standardize merging multimodal embeddings (@DarkLight1337 #26771)
[Model] Use merge_by_field_config for MM models (O-P) (@DarkLight1337 #26776)
[NIXL][HeteroTP]Enable KV transfer from HND prefill to NHD decode (@xuechendi #26556)
[Chore] Use max_transformers_version for Qwen-VL test (@DarkLight1337 #26792)
Don't allow typos to fix by default (@hmellor #26785)
[Doc] ruff format some Python examples (@DarkLight1337 #26767)
[CI] Fix test_tool_id_kimi_k2 (@chaunceyjiang #26787)
[Chore] Remove SupportsV0Only interface and update supported models docs (@DarkLight1337 #26783)
[Feature] Change vllm.py with pydantic validation (@VladOS95-cyber #26726)
[CI/Build] Cleanup LoRA test (@jeejeelee #26752)
[DCP] Support Decode Context Parallel (DCP) for GQA with FlashAttention (@FENP #24864)
Adjusted the model order of the model registration file (@princepride #26798)
use combo kernel to fuse qk-norm and qk-rope (@BoyuanFeng #26682)
[issues template] Encourage the author implement their own ideas (@noooop #26671)
[KVConnector][Metrics] Aggregate scheduler-side KVConnectorStats (@QierLi #26046)
[Feature][Responses API] Stream Function Call - harmony (@chaunceyjiang #24317)
Revert "[issues template] Encourage the author implement their own ideas" (@noooop #26814)
[Config] Remove Unused Environment Variable VLLM_DISABLE_PAD_FOR_CUDAGRAPH (@yewentao256 #26743)
Update coveragerc and add codecov.yml for path fixes (@rzabarazesh #26435)
[CI] Raise VLLM_MAX_SIZE_MB to 500 due to failing Build wheel - CUDA 12.9 (@mgoin #26722)
[Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVidia B200 (@zklapow #26818)
[CI Failure] Fix tests with missing TinyLlama-1.1B-Chat-v1.0-FP8-e2e (@mgoin #26816)
llama4_vision_rope: add HIP override to accept (q, k) and avoid (positions, q, k) mismatch (@hl475 #26790)
[Attention][Spec Decode] FlashMLA spec decode support (@MatthewBonanni #26541)
[Core] Reuse empty block lists whenever possible in KVCacheBlocks to mitigate GC costs (@Jialin #24964)
Notice for deprecation of AutoAWQ (@HDCharles #26820)
[Perf] Cache vllm.env.getattr result to avoid recomputation (@Jialin #26146)
Added MoE configs for llama 4, H200 device with tp=4/8 tuning (@Dhruvilbhatt #26837)
fix: response_format for completion (@Nan2018 #23212)
[Minor] Group async_scheduling related fields in model runner init (@njhill #26736)
remove attn output view kernel (@BoyuanFeng #26680)
[Core] Streamline some structured output related code (@njhill #26737)
[CI Failure] Fix torchao dep failure for Quantization Test (@mgoin #26824)
[frontend][gptoss] Add per turn stats into Harmony Context (@lacora #25061)
[WideEP][P/D] Add usage stats for DP+EP and KV Connector (@tlrmchlsmth #26836)
[torch.compile] Fix tests for torch==2.9 inductor partition (@ProExpertProg #26116)
[Core][Easy] Use envs.getattr for all Unify to environment variable access (@Jialin #26810)
[Bugfix]fix Qwen3 xml tool parser (@Zhikaiiii #26345)
[BUGFIX][NIXL] quick fix for 'assert self.connector_worker is not None' in get_kv_connector_stats (@xuechendi #26851)
Disable FlashInfer sampler by default (@mgoin #26859)
[Frontend][torch.compile] CompilationConfig Overhaul (#20283): name change compilation level to compilation mode, deprecation compilation level (@morrison-turnansky #26355)
[Bugfix] Fixes prefix-repetition benchmark script (@kouroshHakha #26828)
[Model] Add DeepSeek-V3.1 reasoning parser (split from PR #24972) (@taohui #25589)
[Docs] Move build.inc into arm.inc (@windsonsea #26862)
[CI/Build][Bugfix] fix qutlass cmake error when set QUTLASS_SRC_DIR (@izhuhaoran #26773)
[Feature] default --extra-body param to disable thinking in vllm bench serve (@lengrongfu #26784)
[BugFix] Patch inductor partitioning logic (@angelayi #26735)
[Bugfix] Fix qwen3-omni audio truncation issue (@Isotr0py #26815)
[Graph Partition] pass tests for decorator (@BoyuanFeng #26831)
[Bugfix][Multi Modal] Fix incorrect Molmo token processing (@sangho-vision #26873)
[DSA][MLA] Tiny refactor on DeepSeek to make it reusable for different backends (@MengqingCao #26656)
[Misc] Use helper function to generate dummy messages in OpenAI MM tests (@DarkLight1337 #26875)
[bugfix] Lazy import cv2 (@angelayi #26869)
[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather (@zyongye #26456)
[CI/Build] Add Qwen2.5-VL-7B-Instruct ChartQA Accuracy Tests in CI (@zhewenl #21810)
[CI] Fix mypy for vllm/executor (@yewentao256 #26845)
[Doc] ruff format remaining Python examples (@DarkLight1337 #26795)
[doc] add Context Parallel Deployment doc (@youkaichao #26877)
[Misc] Update TritonLanguagePlaceholder to have attributes that are used by Flash Linear Attention ops. (@madongfly #26853)
[Fix] Remove divisibility requirement between num_kv_heads and tp_size in bailing_moe (@ant-yy #26876)
[Easy] Get rid of unnecessary paraenthesis in kv_cache_manager (@Jialin #26842)
[Platform] allow platform to init dp group (@wangxiyuan #22243)
[Lora]Load tuned multi-lora kernel configs from json files (@li2haipeng #26319)
[Model][2/N] Improve all pooling task | Support multi-vector retrieval (@noooop #25370)
[Misc] Remove isort and yapf ignores (@DarkLight1337 #26888)
[Misc] rename torch_dtype to dtype (@wangxiyuan #26695)
chore: remove unused marker (@max-wittig #26890)
[BugFix] Patch inductor memory plan logic (@BoyuanFeng #26878)
[Chore] Separate out vllm.utils.func (@DarkLight1337 #26904)
[Chore] Separate out vllm.utils.async_utils (@DarkLight1337 #26913)
Lower severity of log when model info cache misses due to exception (@hmellor #26917)
Olmo 3 tool parser and tests (@pdasigi #26143)
[Feature]: Use pydantic validation in observability.py config (@cern1710 #26637)
[ModelOpt] Remove NVFP4 MoE K%16==0 constraint (@XiaobingSuper #26891)
[Chore] Clean up CODEOWNERS (@WoosukKwon #26923)
[NVIDIA] Add support for cudnn fp4 gemm via flashinfer (@kaixih #26107)
Vectorize RMS norm variance using vectorize_read_with_alignment (@bbeckca #26234)
support flashinfer_fp4 moe for 5090 gpu (@XiaobingSuper #26669)
[Bug] Temporally Disable VLLM_ALLREDUCE_USE_SYMM_MEM by Default (@yewentao256 #26925)
Move query quantization to attention layer for Flashinfer & Triton. (@adabeyta #26534)
Adjusting AMD test composition 2025-10-14 (@Alexei-V-Ivanov-AMD #26852)
[Qwen3-Next] Add tuned MoE config for Qwen3-Next FP8 on H100 tp2 (@felixzhu555 #26887)
[Bugfix] reasoning_parser parameter handling in run_batch.py (@inc-jeong #26225)
[ROCm][FEAT] Fuse DeepSeek shared experts into AITER fused_moe ops (@kliuae #24097)
[CI] Enable Blackwell Llama4 MoE tests (@mgoin #26731)
[BUG] Allow runai_streamer_sharded in config check (@ahao-anyscale #26958)
[bugfix] Fix SP + PP without specifying compile size (@angelayi #26955)
[BugFix] Work around graph partition x torch.compile cache issue (@zou3519 #26956)
[DOC][XPU]update feature parity with Intel GPU (@xuechendi #26954)
[Chore] Rename utils submodules (@DarkLight1337 #26920)
[PERF] Qwen3-next MTP speedup (change bool mask indexing to index_select / index_copy to reduce d2h) (@vadiklyutiy #26437)
Deepseek-v3 Batch Invariant on 8xH100 (@bwasti #26609)
[CI/Build] Update expected beam search output for Phi3V (@DarkLight1337 #26978)
[Hardware][CPU][PowerPC]Disable torch.compile() in toptopk sampling (@Akashcodes732 #26987)
[CI/Build] Fix AMD import failures in CI (@zhewenl #26841)
[Benchmark] Use truncation by default for pooling benchmarks (@DarkLight1337 #26992)
[Chore] Separate out vllm.utils.collections (@DarkLight1337 #26990)
[Model][Bugfix] fix ernie45 vl run failed from shared experts optimization (@CSWYF3634076 #26885)
Cleanup code after Python 3.10 upgrade (@lgeiger #26520)
[MISC] fix import violations for re and triton modules (@llsj14 #26654)
[Bugfix] Correct LayerNorm epsilon parameter in modernbert.py (@bogdanminko #27008)
[Benchmark] Show E2EL by default for pooling models (@DarkLight1337 #27014)
[Attention] Tune CUTLASS MLA num_splits (@MatthewBonanni #26846)
[NIXL] Improve request_finished() debug logs (@markmc #25665)
[docs] standardize Hugging Face env var to HF_TOKEN (deprecates HUGGING_FACE_HUB_TOKEN) (@yankay #27020)
[CI] Replace large models with tiny alternatives in tests (@tahsintunan #24057)
[Feature] Add process_weights_after_loading to AttentionImpl (@lengrongfu #26870)
[Model] Fix Qwen3VL mm mapping (@jeejeelee #27027)
Fix Qwen2.5 VL image grid docstring (@skyloevil #27033)
Support set in the CLI generation (@hmellor #27031)
[gpt-oss][1/N] EZ: refactor serving_responses for modularity (@qandrew #26948)
Support block size of 256 used by Intel HPU (@mandy-li #26883)
[Compressed Tensors] Always clone output for compile robustness (@kylesayrs #26849)
Adding Warmup to Benchmark Serving (@kimbochen #26943)
[Bug] Fix batch invariant test has to is (@yewentao256 #27032)
[GPTOSS][DP/EP][Marlin] Enable GPTOSS Batched DP/EP using Marlin kernels (@varun-sundar-rabindranath #25997)
[Feature] Migrate DeepGEMM API from get_m_alignment_for_contiguous_layout to get_mk_alignment_for_contiguous_layout (@yewentao256 #26935)
[CI] Prune Quantization Tests and skip compilation (@mgoin #27038)
[Bug] Add Assertion for random-input-len / random-output-len (@yewentao256 #26834)
[small][batch invariance] Rename the env and internal flags to simplify usage (@bwasti #26855)
Refactor Transformers backend to use mixins (@hmellor #26906)
[NVIDIA] [Perf] Update to leverage flashinfer trtllm FP4 MOE throughput kernel (@jiahanc #26714)
[torch.compile] Passing only necessary compilation config to inductor pass config (@luccafong #27041)
[Chore] Separate out vllm.utils.import_utils (@DarkLight1337 #27022)
[torch.compile] fix simple inductor graph partition test (@BoyuanFeng #27050)
Remove unused imports (@lgeiger #26972)
vllm bench serve shows num of failed requests (@tomasruizt #26478)
[Docs] Reduce custom syntax used in docs (@hmellor #27009)
[Perf] Exploit out-of-band buffers in shm_broadcast (@njhill #26961)
disable graph partition in custom op (@BoyuanFeng #26952)
[Bugfix][Qwen] fixes the weights dtype in qwen3_next: it is actually a bfloat16 (@sighingnow #27030)
[Core] Change execute_model_with_error_logging() to be a ctx manager (@njhill #27060)
[Bugfix] Fix ReplicatedLinearWithLoRA (@jeejeelee #27065)
[Kernel] Lazy import FlashInfer (@jeejeelee #26977)
[CI/Build] Update Llama4 eval yaml (@zhewenl #27070)
[Model] Always use Transformers backend for PaliGemma and Gemma3-MM (@DarkLight1337 #26715)
[Model] Add support for LightOnOCR (@staghado #26916)
[CI/Build] Update compressed tensor test path to fix CPU CI (@bigPYJ1151 #27068)
[Kernel][Performance] Fuse float cast and renormalize to topk softmax kernel (@izhuhaoran #26717)
[CI] fix docs build failed (@chaunceyjiang #27082)
Update troubleshooting.md and remind VLLM_TRACE_FUNCTION usage (@Prowindy #27069)
[VLM][Refactor] Remove useless func get_input_positions in MRotaryEmbedding (@MengqingCao #27088)
[Docs] Replace all explicit anchors with real links (@hmellor #27087)
[Docs] Replace rst style double-backtick with md single-backtick (@hmellor #27091)
[Model]Improve Qwen3VLMoeForConditionalGeneration packed_modules_mapping (@jeejeelee #27096)
[Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI350 and MI355 (@rkarhila-amd #25586)
Fix incorrect docstring for stop_profile() method (@hyongtao-code #27101)
[torch.compile] Enable attention and allreduce fusion without custom ops enabled (@ProExpertProg #24604)
[CI] Nixl integration tests (@NickLucche #27010)
[Data-parallel] Allow DP>1 for world_size > num_gpus on node (8) (@patrickvonplaten #26367)
[bugfix] Qwen3-VL fix video incorrect timestamp calculations while do_sample_frames=True (@wulipc #27104)
[CI] Remove forbidden slash (@NickLucche #27112)
[ROCM] MoE fp4 CK kernel (@maleksan85 #26545)
[ROCm][Bugfix][Model] Fix illegal memory access when running qwen3_moe models with rms_norm (Qwen3-235B-A22B, Qwen3-30B-A3B, etc.) (@rasmith #26192)
[Bugfix] [AITER] [ROCm] Fix Quark MoE Quant Config and AITER Fused MoE quant type logic (@vllmellm #27029)
[Chore] Remove unused PolyNorm layer (@Isotr0py #27110)
[Bugfix] Use PIECEWISE cudagraphs on Blackwell if max_model_len > 131072 (@mgoin #27114)
[Minor] Remove unnecessary error message (@zhuohan123 #27115)
[V1][Spec Decode] Fix greedy temperature detection after sampler refactor (@Pradyun92 #27077)
[Test] Make test_failure more stable for batch invariance (@yewentao256 #27054)
[BugFix][Core] Fix error when enable async-scheduling in multi-node env (@lhtin #25887)
[Perf] Add H100 fused MoE config (@skyloevil #25398)
[CI/Build] tests(v1): feed Triton attention the (num_blocks, 2, …) KV cache layout in backend-correctness tests (@hl475 #26663)
[GPT-OSS] Structure_Tag support for gpt-oss tool-call in cot (@Hanchenli #25515)
[Misc] Rev DeepEP (@varun-sundar-rabindranath #27122)
[DOC][FEATURES][CPU]update cpu feature for v1 (@xuechendi #27135)
[Test] Add test for /health endpoint on engine failure (@dongbo910220 #26074)
[Chore] Separate out vllm.utils.mem_utils (@iAmir97 #27143)
[Feature] Batch Invariant: Support DeepGEMM and Blackwell (@yewentao256 #27127)
[fix][cpu] fix prefill attention in CPU attention backend (@fadara01 #27035)
[Misc] Refactor get_kv_cache_spec into AttentionLayerBase (@NickLucche #26587)
[Models][QwenVL] Remove unnecessary .contiguous() calls (@lgeiger #27106)
[Chore] Clean up pytorch helper functions in vllm.utils (@Isotr0py #26908)
Fix incorrect string formatting in barrier timeout exceptions (@hyongtao-code #27149)
[Minor] Add some clarifying comments to recent changes (@njhill #27130)
[BugFix] Fix failing gemma-3-1b-it test: test_lm_eval_accuracy_v1_engine[google/gemma-3-1b-it] (@LucasWilkinson #27111)
[Chore] Separate out profiling utilities from vllm.utils (@dongbo910220 #27150)
[BugFix] fix graph partition signature (@BoyuanFeng #27139)
[BugFix] Disable fp8 kv-cache by default for DeepSeek V3.2 (@LucasWilkinson #27121)
[V1][Metrics][Plugin] Add plugin support for custom StatLoggerBase implementations (@ptovam #22456)
[Minor] Remove unused env variable (@WoosukKwon #27161)
[BugFix] Fix lazy imports involving outlines_core (@22quinn #27158)
[Chore] Separate out hashing utilities from vllm.utils (@dongbo910220 #27151)
[Benchmark] Convenience script for multiple parameter combinations (@DarkLight1337 #27085)
output type conversion fix (@jianyuh #27159)
[Chore] Separate out vllm.utils.network_utils (@iAmir97 #27164)
[Misc] Move utils to avoid conflicts with stdlib, and move tests (@DarkLight1337 #27169)
[Bugfix] Fix error with penalties when speculative decoding and structural output are enabled (@southfreebird #26586)
Fix typo in ValueError message: use kv_role instead of kv_disagg_role (@hyongtao-code #27166)
[Model][VLM] Support Bee-8B Model (@uyzhang #27012)
[LoRA] LoRA cuda graph specialization (@andylolu2 #25914)
[Kernel] Accelerate solve_tril with TMA (@ZJY0516 #26746)
AArch64 CPU Docker pipeline #26931)
Nemotron Nano V2 VL + EVS Video Support (@BloodAxe #27107)
[Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on H100 (FP8/BF16) (@shivampr #26268)
[Bugfix][CI] Fix Distributed Tests (4 GPUs) async_sched+ray test (@NickLucche #27195)
[Feature][Quantization] auto_round support for mixed bits quantization (@n1ck-guo #23812)
[ROCm] enable some tests in entrypoints test groups on AMD (@Concurrensee #26725)
[ez] add uv lock to gitignore (@qandrew #27212)
[Quantization] Automatically infer AWQ modules_to_not_convert field (@Isotr0py #26909)
[V0 Deprecation] Remove V0 metrics code (@njhill #27215)
[cpu] Dispatch un-quantized linear to oneDNN/ACL by default for AArch64 (@fadara01 #27183)
create is_in_the_same_node on cpu (@helunwencser #26832)
[Frontend] Enforce tokenize=False when applying chat template (@russellb #27205)
[Feature][Kernel]FusedMoE LoRA (@wcwuwc #21229)
[BugFix] GPT-OSS Attention DP + MoE TP weight loading issue (@nvpohanh #24032)
[ModelOpt] Load w13/w2_input_scale for all experts, nvfp4 (@wenscarl #26135)
[Bugfix] Fix gpt-oss w4a8 DP/EP on B200 (@varun-sundar-rabindranath #26729)
[Bugfix] Fix broken MTP weight loading for FP8 KV Scales (@benchislett #27227)
[Fix][Spec Decode] Fix llama4 draft loading with different quantization (@linzebing #27136)
[Nixl] Minor refactor to handshake related metadata (@NickLucche #26410)
[MM][Core] Decouple ViT backend from LM backend (@ywang96 #27061)
[Deepseek v3.2] Optimize top_k_per_row (@dcampora #26763)
[Chore] Separate out NCCL utilities from vllm.utils (@dongbo910220 #27197)
[CI] Install pre-release version of apache-tvm-ffi for flashinfer (@hmellor #27262)
[ROCM] Enable CompressedTensorsWNA16 (@JartX #27187)
Add @pavanimajety to .github/codeowners (@pavanimajety #27213)
[ROCm] Update Triton, Torch, and AITER branches for ROCm base Dockerfile (@micah-wil #27206)
[Feature] Batch Invariant for R1 TP 8 on Blackwell (@yewentao256 #27229)
[Bugfix][P/D] Reduce num_threads used by nixl ucx backend (@dagrayvid #27196)
[V0 Deprecation] Remove V0 executors (@njhill #27142)
[Bugfix] fixes the decoding metadata of dense mla's fp8 kvcache. (@sighingnow #27144)
Update PyTorch to 2.9.0+cu129 (@huydhn #24994)
[Performance] Dual stream execution of "shared_experts" and "selected_experts" inside FusedMoE (@alexm-redhat #26440)
Updated xgrammar backend to not deny supported string formats (@ExtReMLapin #27253)
[Bugfix] skip cuda graph for drafter when running with eager (@benchislett #26821)
[P/D] KVConnector for decode benchmarking (@tlrmchlsmth #25986)
[Deepseek v3.2] Remove extra logics in indexer (@IwakuraRein #26465)
[DOC] [ROCm] Add ROCm quickstart guide (@vllmellm #26505)
[CI] Nixl integration tests DP-EP (@NickLucche #27199)
[Benchmark] Add plot utility for parameter sweep (@DarkLight1337 #27168)
[torch.compile] Enable silu_mul_fp8_quant fusion without custom ops enabled (@ZJY0516 #27146)
[1/N][Platform] Cleanup useless function (@wangxiyuan #26982)
Update release pipeline for PyTorch 2.9.0 (@huydhn #27303)
Remove last level references not removed #26355 (@hmellor #27260)
fixed reasoning streaming with tool_choice="required" (@ExtReMLapin #24108)
[Frontend][3/N] Improve all pooling task | Support binary embedding response (@noooop #27066)
[Bugfix][CPU] Disable dual stream execution for experts on CPU (@bigPYJ1151 #27320)
[Bug] Raise error for LLM(data_parallel_size=k) single-process DP Usage (@yewentao256 #27282)
Bugfix - pass 'max_num_tokens_padded' into 'moe_lora_align_block_size' (@gnovack #27311)
[Core] Handle MoE LoRA edge cases (@jeejeelee #27335)
[docs] Update v1 metrics design doc (@markmc #27332)
Mirroring changes in test-pipeline.yaml into test-amd.yaml (@Alexei-V-Ivanov-AMD #27242)
[Chore] Separate out optional dependency checks from vllm.utils (@dongbo910220 #27207)
[Model] Upstream Deepseek-OCR model (@Isotr0py #27247)
[NIXL] Terminate handshake listener thread in shutdown (@markmc #26404)
[Bug] Fix DeepSeek-V2.5-1210-FP8 issue (@yewentao256 #27267)
[bugfix] remove unused parameters to reduce unnecessary vram usage (@ReinForce-II #26789)
[Bugfix] Add missing 'is_internal_router' attribute to FusedMoEWithLoRA (@jeejeelee #27351)
[NIXL] use Host buffer to support TP_ratio > 1 for XPU (@xuechendi #27140)
[Bugfix] Make get_mrope_input_positions instance methods (@DarkLight1337 #27342)
[Bugfix] Fix HF format InternVL large variants video processing (@Isotr0py #27330)
[Frontend] Require flag for loading text and image embeds (@russellb #27204)
[P/D] Dynamic kv_output_aggregator collect size (@NickLucche #26734)
Support Anthropic API /v1/messages Endpoint (@LiuLi1998 #22627)
[Bugfix] Disable FlexAttention direct block mask building for encoder-only models (@Isotr0py #27344)
[Model] Revert PR #26715: Restore custom PaliGemma and Gemma3-MM impl… (@lucianommartins #27309)
[Doc] Fix numbering sequence in prefix caching (@gigit0000 #27357)
[Prefix Cache] Use LoRA name for consistent KV-cache block hashing (@sagiahrac #27211)
[Feature] publisher default set zmq in kv_event config (@lengrongfu #26915)
[BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (@Daisy-Ma-coder #27128)
[Chore] Separate out system utilities from vllm.utils (@dongbo910220 #27201)
[MLA] Bump FlashMLA (@MatthewBonanni #27354)
[Bugfix] Fix deepseek-ocr multi-image inference and add merge_by_field_config=True with tensor schema support (@Isotr0py #27361)
[Bugfix] Fix SLA tuner initialization (@DarkLight1337 #27355)
[Bugfix] Fix incorrect kv cache metrics in grafana.json (@fangpings #27133)
[Bugfix][Core] running queue index leakage exception (@CLFutureX #26754)
[CORE] Support Prefix Caching with Prompt Embeds (@qthequartermasterman #27219)
[V1][spec decode] return logprobs for spec decoding (@TheEpicDolphin #26060)
[Model] Add num_cached_tokens for PoolingRequestOutput (@noooop #27378)
[Chore] Remove duplicate has_ functions in vllm.utils (@jonathanc-n #27372)
[CI/Build] Fix Prithvi plugin test (@DarkLight1337 #27393)
[Bugfix] Fix args settings for guided decoding args (@luccafong #27375)
[CI/Build] Fix AMD CI: test_cpu_gpu.py (@zhewenl #27388)
add SLA information into comparison graph for vLLM Benchmark Suite (@louie-tsai #25525)
[CI] Reorganize entrypoints tests (@chaunceyjiang #27403)
[Metrics] [KVConnector] Add connector prefix cache hit rate stats (@ptovam #26245)
[Model] Add MoE support for NemotronH (@tomeras91 #25863)
Run mypy on the lowest supported Python version instead of system Python (@hmellor #27048)
[Bugfix] Honor --mm_encoder_attn_backend when used (@bradleyhd #27124)
[Feature] Pydantic validation for speculative.py (@Navya1707 #27156)
[Misc] Remove use of CUDA_VISIBLE_DEVICES for device selection (fix DP slow startup time &c) (@ilmarkov #26709)
[CI/Build] Remove unnecessary flags from test registry (@DarkLight1337 #27353)
[Frontend][4/N] Improve all pooling task | Add plugin pooling task (@noooop #26973)
Mirroring the test definitions (2025-10-22) (@Alexei-V-Ivanov-AMD #27362)
[Bugfix] Fix dp_chunking enablement logic in FusedMoE layer (@alexm-redhat #27220)
[Bugfix][ROCm][DeepSeek] Fix for forward_hip in rope for DeepSeek (@gshtras #27373)
[Bugfix] Fix AWQ marlin layer skipping (@Isotr0py #27416)
[Misc] Add triton_kernels dependency (@varun-sundar-rabindranath #27370)
[Chore] Separate out vllm.utils.platform_utils.py (@jonathanc-n #27374)
[Attention] Fix FlashMLA metadata builder arguments for q_len > 1 (@MatthewBonanni #27368)
[Bugfix][DP] Fix creating too many DP Placement Groups (@kebe7jun #26880)
[Model] Siglip Embedding Support (@piood #27324)
[Hardware][POWERPC] Disable oneDNN path in vllm/model_executor/layers/utils.py for Powerpc (@Akashcodes732 #27422)
Granite 4.0 quark quantization support (@xiao-llm #26944)
Fix pooling adapters for Transformers backend (@hmellor #27338)
[Kernel] Add GPTQv2 format support for low-bit or asymmetric quantization, by adapting gptq_gemm (@xxxxyu #26092)
[Misc] Add TPU usage report when using tpu_inference. (@hfan #27423)
[Bugfix][CI] Move resolving cudagraph_mode before initializing attn_metadata_builder (@fhl2000 #27427)
Fix EventPublisherFactory logic for disabled KV cache events (@usberkeley #27419)
[Chore] remove structural tags logging lines (@aarnphm #27451)
[Bugfix] Fix Pydantic union resolution for ResponseFunctionToolCall in Responses API (@strinczer #26706)
[Misc] Avoid "PyTorch non-writable tensors" warning in RayPPCommunicator (@ruisearch42 #27443)
[Docs] remove v1 column for embedding models (@piood #27446)
[MM][Bugfix] Replace PatchEmbed's conv3d to linear layer (@Isotr0py #27418)
[BugFix] Fix torchrun DP with LLM class (@22quinn #27395)
[Refactor] move tool parsing logic from protocol.py to the tool parser (@chaunceyjiang #27383)
[Benchmark] Enable benchmark to run with encoding_format="bytes" (@DarkLight1337 #27467)
Fix AArch64 CPU Docker pipeline #27331)
[MISC] cudagraph_capture_sizes related improvements (@fhl2000 #26016)
Fix test named tool use (@chaunceyjiang #27458)
[Doc] Fix minor issues in docs/design/metrics.md (@draftbk #27436)
[cpu][fix] Fix onednn_mm crash on consecutive matmuls with same M,K,N and different dtype (@fadara01 #27472)
[compile] Turn standalone_compile back on (@zou3519 #27460)
[NIXL][BUGFIX] delay done_recving queue cleanup to bottom of get_finished (@xuechendi #27297)
[Bugfix] Fix MultiConnector stats reconstruction across process boundaries (@kouroshHakha #27366)
[Attention] Add MLA prefill backend: trtllm_ragged_attention_deepseek (@minosfuture #26397)
[Bugfix] Fix interns1-vit qk norm code path (@Isotr0py #27480)
[CI/Build] Fix test_torch_utils in AMD CI (@zhewenl #27317)
[Document] Add ms-swift library to rlhf.md (@hjh0119 #27469)
[Perf][Async Scheduling] Remove CPU->GPU sync in dummy_run (@lhtin #27455)
[Distributed] Basic set of configuration for large EP deployment on GB200 (@wpc #27328)
[Log] Optimize Startup Log (@yewentao256 #26740)
[Misc][DP] Guard mxfp4 implementation selection (@varun-sundar-rabindranath #27484)
[KVConnector] Migrate the LMCache integration code to be vLLM native (@ApostaC #25542)
[CI] Add tests for cudagraph (@ZJY0516 #27391)
Revert "[Misc] Remove use of CUDA_VISIBLE_DEVICES for device selectio… (@zhuohan123 #27502)
[Core][Hybrid allocator + kv connector 1/n] Enable hybrid allocator + KV cache connector (@KuntaiDu #25712)
[Misc] Simplify max tokens in multimodal registry (@DarkLight1337 #27500)
[Attention] Add missing kv cache scale setup (@MatthewBonanni #27490)
[CI/Build] Refactor processing tests (@DarkLight1337 #27470)
[CI/Build] Use CPU for mm processing test on CI (@Isotr0py #27522)
[BUGFIX][ROCM] ViT FlashAttention on ROCm (no GFX9) and contiguous on qwen3vl ROCm TORCH_SDPA (@JartX #27190)
[Bugfix] Fix processor initialization for model from modelscope instead of HF (@lengrongfu #27461)
[Bugfix] fix empty prompts for async-engine mode in benchmark throughput (@luccafong #27494)
[Doc] Remove Molmo warning (@DarkLight1337 #27527)
[Doc] Fix links to GH projects (@DarkLight1337 #27530)
[Chore]:Extract math and argparse utilities to separate modules (@yeshsurya #27188)
Revert "[CI/Build] Use CPU for mm processing test on CI (#27522)" (@DarkLight1337 #27531)
[CI/Build] Update causal-conv1d installation (@DarkLight1337 #27529)
[Model][MiniMax-M2] Support MiniMax-M2 Model (@rogeryoungh #27535)
fix m2 test (@youkaichao #27536)
Fix MiniMax-M2 copyright (@rogeryoungh #27537)
[Model][Bugfix] fix ernie45 moe 300B SharedFusedMoE output tuple (@CSWYF3634076 #27316)
[Model] Use merge_by_field_config for MM models (Qwen series) (@DarkLight1337 #27546)
[Docs] reemove the incorrect enable_reasoning parameter (@yyzxw #27550)
[Performance][LoRA] add context varying params to 'do_not_specialize' in fused moe lora (@gnovack #27445)
[Model] Deprecate merge_by_field_config=False (@DarkLight1337 #27551)
[Doc] Slight improvement to M2 and beyond (@jeejeelee #27554)
[Kernel] Adding split_K implementation for fused_moe_lora (@dcmaddix #27291)
[Misc] Clean up utils (@DarkLight1337 #27552)
[Bugfix] Limit the default value of max_model_len when it is not specified by users (@shen-shanshan #27556)
[Bugfix] Fixed when return_token_ids=False, the first event still contains prompt_token_ids. (@chaunceyjiang #27561)
[cpu][perf] Fix low CPU utilization with VLLM_CPU_OMP_THREADS_BIND on AArch64 (@fadara01 #27415)
[Kernel] Enable moe LoRA kernel support FP16 (@jeejeelee #27468)
[Hybrid] Added supports_mamba_prefix_caching Protocol (@Josephasafg #27339)
[Model] Siglip2 Model Support (@piood #27566)
[Bugfix][LoRA][FusedMoE] Select MxFP4 Backend based on LoRA Enablement (@varun-sundar-rabindranath #27487)
fixing mm placeholder replacement issue with gemma3 (@tingtingtangmeta #27538)
[Chore]: Stream tokens vs characters in tool call parser tests (@bbrowning #26513)
[Misc] Clean up more utils (@DarkLight1337 #27567)
[ROCm] Update AITER branch for ROCm base docker (@micah-wil #27586)
Code quality improvements: version update, type annotation enhancement, and enum usage simplification (@usberkeley #27581)
[gpt-oss][2/N] Support input_messages in responsesRequest (@qandrew #26962)
[Bugfix][CI] Fix config resolving logic with remote models (@ywang96 #27610)
[Stability fix] turn off HMA allocator when connector is set (@KuntaiDu #27592)
[Bugfix] fixed inconsistent finish_reason handling between V0 and V1 engines (@chaunceyjiang #27555)
[ROCm] [Doc] Update ROCm installation docs (@vllmellm #27327)
[Hardware][AMD][Model] Triton MoE tuning configs for GLM-4.6 for MI300X (@minatoaquaMK2 #27323)
[Bugfix][CPU] Fallback oneDNN linear to torch linear to fix half gemm support on legecy platforms (@bigPYJ1151 #27526)
[Core][Bookkeeping Optimization] Update against numpy view of is_token_ids tensor (@Jialin #27618)
[CI/Build] Fix amd model executor test (@zhewenl #27612)
Fix a robust parsing issue in KimiK2ToolParser that causes IndexError (@wangln19 #27565)
[V0 Deprecation] Remove vestigial V0 logits_processors.py file (@njhill #27601)
[Bugfix] In LongRoPE, decide short vs long based on max_model_len (@MatthewBonanni #27431)
[Misc] Separate out utils.counter and move utils.Device to engine (@DarkLight1337 #27588)
[Bug] Fix shape issue for eplb expert weights (@yewentao256 #27589)
[compile] Add enable_prompt_embeds to compile hash. (@zhxchen17 #27285)
[Hybrid] Add mamba_block_size to Engine Args (@Josephasafg #27289)
[compile] Disable dynamo guards check for AOT compilation. (@zhxchen17 #27288)
fix: allow HuggingFace standard chat template params via **kwargs (@wangln19 #27622)
[Core] Enable async scheduling for external_launcher mode (@22quinn #27394)
[Bugfix][Frontend] validate arg priority in frontend LLM class before add request (@junpuf #27596)
[BugFix] Also consider RAY_EXPERIMENTAL_NOSET_* when storing compilation cache (@HollowMan6 #27294)
[nit]: lmcache integration import (@sammshen #27600)
[FLA] Introduce Kimi Delta Attention(KDA) to VLLM (@zhiyuan1i #27654)
[Bugfix] Fix allocation & free logic of SingleWriterShmRingBuffer (@imkero #27117)
[Bugfix][CI] Fix v1 attention backend tests and add CI coverage (@mmangkad #26597)
[Misc] Make LayerBlockType a Literal instead of Enum (@DarkLight1337 #27658)
[compile] Add fallback path to AOT compile when serialization fails. (@zhxchen17 #27350)
Add load pattern configuration guide to benchmarks (@mpashkovskii #26886)
[Misc] Make reorder batch also separate extends (@LucasWilkinson #27367)
[Test] Batch Invariant: Unit test using parameterized backend (@yewentao256 #27478)
[Core] Scheduler: Publish connector events after output (@orozery #25875)
[AsyncScheduling] Make async overlap work with logprobs (@njhill #27615)
[Misc][qwen2_5_vl][torch.compile] Enable supports_torch_compile on generic nn.Module and demonstrate speedup on Qwen Vision model (@Lucaskabela #23207)
[Bug] Fix deepep low latency use nvlink by default (@yewentao256 #27677)
[Core] Early return in SlidingWindowManager.remove_skipped_blocks (@Jialin #27673)
Install pre-built xformers-0.0.32.post2 built with pt-2.9.0 (@huydhn #27598)
Revert "Install pre-built xformers-0.0.32.post2 built with pt-2.9.0" (@simon-mo #27714)
[Build] Revert triton_kernels requirements (@varun-sundar-rabindranath #27659)
[NIXL][XPU] update name of nixl wheel (@zhenwei-intel #27631)
[Model] Fix Qwen3VL and Qwen3Omni after torch.compile changes (@lgeiger #27705)
[KV cache] Fix lmcache connector (@Shaoting-Feng #27681)
[CI/Build][Bugfix]Fix Quantized Models Test on AMD (@zhewenl #27712)
[Bugfix] Fix non-contiguous tensor error in rocm_unquantized_gemm_impl (@zhewenl #27605)
[Speculators] Move tests + fix integration (@dsikka #27308)
[CI/Build] Move pre-commit only scripts to tools/pre_commit (@DarkLight1337 #27657)
[perf] Enable concurrent execution of "shared_experts" and "selected_experts" in qwen3-next (@ZJY0516 #27578)
[Bugfix] Fix modular kernel tests (@bnellnm #27707)
[Frontend] [gpt-oss] Tool json call parsing error retry (@alecsolder #27675)
[Frontend] [gpt-oss] Mcp type bug (@alecsolder #27689)
[Fix] import get_kv_cache_torch_dtype error in vllm_v1_adapter.py (@KevinCheung2259 #27670)
[Misc] Raise error for missing video metadata in MultiModalDataParser (@Isotr0py #27664)
Feature/video support in random mm dataset (@BloodAxe #25963)
[chore] Remove models weight on S3 logic (@khluu #27725)
[VLM] Add Qwen3-VL generation test (@Isotr0py #25185)
[CI/Build] Skip cpu offloading test on AMD (@zhewenl #27690)
[Frontend] Add vllm bench sweep to CLI (@DarkLight1337 #27639)
Fix MiniMax-M2 rmsnorm precision and remove useless code (@rogeryoungh #27627)
[ROCm][Platform] Add MI308X device id in _ROCM_DEVICE_ID_NAME_MAP (@sammysun0711 #27623)
[CI] Fix flaky test_two_responses_with_same_prev_id test (@NickLucche #27745)
[Chore] Optimize P2PNCCLEngine http_address (@yewentao256 #27488)
[Core] Exposing engine sleep & wake_up state as prometheus metrics (@dumb0002 #24176)
[FIXBUG] Qwen3VL hallucinations without Contiguous on Torch.SDPA (@JartX #27744)
use_aot_compile should respect VLLM_DISABLE_COMPILE_CACHE (@BoyuanFeng #27698)
[CI/Build] Test torchrun with 8 cards (@22quinn #27548)
[Bug] Raise error explicitly if using incompatible backend (@yewentao256 #27424)
[KVConnector] Add metrics to Prometheus-Grafana dashboard (@NickLucche #26811)
[Bug] Fix DeepEP low latency assert self.batched_router_logits.size(-1) == full_router_logits.size(-1) Bug (@yewentao256 #27682)
[BugFix] Fix handling of resumed reqs in SharedStorageConnector (@njhill #27719)
[Bug] Fix DBO IMA issue for DeepEPHT (@yewentao256 #27666)
[Temp fix] Disable torch.compile for Qwen2.5 VL's VisionBlock temporarily. (@huachenheli #27760)
[XPU][bugfix] fix rope for llama4 and deepseek (@yma11 #25145)
[Bugfix] mamba-block-size is set for vision language model (@heheda12345 #27773)
[XPU] Update latest IPEX 2.8 release (@jikunshang #27735)
[BugFix] Handle unscheduled requests properly when async scheduling (@njhill #27756)
[Feat] Adds runai distributed streamer (@bbartels #27230)
kernels/moe test pruning (@kfhfar #27053)
[BugFix] Reordering extend logic fix (@LucasWilkinson #27739)
[Benchmark] Cleanup deprecated nightly benchmark and adjust the docstring for performance benchmark (@KuntaiDu #25786)
Add more dims for batch invariant shims (@bwasti #27489)
use stringData in secret yaml to store huggingface token (@yitingdc #25685)
[CI/Build]Add eval config for Qwen3-235B-A22B-Instruct-2507-FP8 (@hl475 #27113)
[BugFix][VL] Fix FA selection on Qwen2.5-VL (@zhewenl #27790)
[V0 deprecation] Remove VLLM_USE_V1 usage in config module (@wangxiyuan #27784)
[CI Failure] fix test_default_mm_loras (@hl475 #27795)
[CI] Fix mypy for vllm/v1/core and vllm/v1/engine (@yewentao256 #27108)
[Bugfix] Improve GPU validation logging in Ray fallback scenarios (@sairampillai #25775)
[Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. (@noooop #25524)
[CI Failure] Fix test_kv_cache_model_load_and_run (@hl475 #27717)
[Model] Introduce Kimi Linear to vLLM (@zhiyuan1i #27809)
[KV offload] Enable CPU KV offload on CUDA alike Platforms (@zhewenl #27770)
[Model][Ouro] Support Ouro Model (@FlamingoPg #27794)
[Bugfix][CPU] Fix MRoPE dispatch on the CPU backend (@bigPYJ1151 #27800)
[BugFix] Stopgap - Flashinfer Autotuner + GPT-OSS + DP/TP (@varun-sundar-rabindranath #27762)
[Misc] Replace CUDA_VISIBLE_DEVICES in DP with torch.cuda.set_device for device selection on cuda-like devices (@ilmarkov #27564)
[Docs] add Shanghai Meetup - 2025/10 (@kebe7jun #27545)
Reapply "Install pre-built xformers-0.0.32.post2 built with pt-2.9.0" (@huydhn #27768)
[MTP] Refactor mtp predictor to avoid d2h operation (@MengqingCao #27643)
[Model] Use the same fused_moe configs for all H200 devices (@bufferoverflow #23642)
[Bugfix] Fix 2 precommit issues - (mamba_block_size, kv_cache_config) (@tlrmchlsmth #27811)
[Core][Bookkeeping] Update cu_num_accepted_tokens for all req_index (@Jialin #27629)
[EP/DP][API Server] Enable DP-aware routing in OpenAI API requests (@Prowindy #24945)
[Fix] Skip record_sleep_state logic in PrometheusStatsLogger if not in dev mode (@SumanthRH #27789)
[Refactor] Remove VLLM_DEEPEP_LOW_LATENCY_ALLOW_NVLINK (@yewentao256 #27750)
[Core][Perf] Only invoke save_new_computed_blocks when computed blocks are not empty (@Jialin #27799)
[Feature] Batch invariant torch.compile (@PaulZhang12 #27660)
[BugFix] Fix broken import in initialize_ray_cluster() (@njhill #27838)
[Misc] Make all tool scripts executable (@MatthewBonanni #27831)
[CI/Build][Intel] Enable performance benchmarks for Intel Gaudi 3 (@jakub-sochacki #26919)
[CI Test] Add Scheduled Integration Test (@yewentao256 #27765)
[benchmark] Make request IDs unique across clients by default (@eicherseiji #27723)
[Hardware][Powerpc] Fix VLLM_CPU_OMP_THREADS_BIND="auto" low CPU utilization for Power (@Akashcodes732 #27734)
[Kimi-Linear] Correct prefixes and add compatibility to AWQ quants (@toncao #27834)
[Bugfix] Avoid too small block m/n for FlexAttention kernel option (@Isotr0py #27853)
[BugFix] Don’t compute reorder threshold when there are no attention groups (@hl475 #27861)
[Perf] Decouple torch op from GDA to leverage torch.compile (@ZJY0516 #27871)
[CI/Build] Add gpt-oss LoRA test (@jeejeelee #27870)
[Bugfix] Allow 64-bit integer values for LoRA IDs to avoid overflow/truncation (@shadeMe #27876)
[Bugfix] Fix broken MRoPE for GLM-4.1V/GLM-4.5V (@Isotr0py #27860)
[Bugfix] Missing NIXL metadata for handshake initialization if instance spans multi-node (@GuanLuo #26338)
Docs update tpu install instructions (@RobMulla #27824)
[bugfix] Missing cached item in beam search (@fake0fan #27874)
fix incorrect type annotation in KimiMLP (@skyloevil #27885)
Flashinfer_CUTLASS_MOE fuses quantization for TP (@wenscarl #27223)
[Cleanup] Remove no-longer-used SpeculativeConfig.enable_chunked_prefill (@njhill #27826)
[Feature] Pydantic validation for scheduler.py and structured_outputs.py (@vrdn-23 #26519)
Add FLASHINFER_MLA to test_mla_backends and add B200 CI run (@MatthewBonanni #27663)
Batch invariance doc (@bwasti #27839)
[Hybrid] A simpler algorithm to find kernel_block_size (@heheda12345 #26476)
[Core] Async scheduling + structured outputs compatibility (@njhill #26866)
[Kernel] Enable FusedMoEModularKernel support bias (@jeejeelee #27754)
[Bugfix] Fix KDA output (@jeejeelee #27905)
[Multimodal][XPU]Enable vision attn backend for xpu platform (@yma11 #27525)
Adding SplitK in fused_moe_lora kernel (@yugong333 #27818)
[CI/Build] Bump transformers version (@DarkLight1337 #27528)
[Bugfix] [Model] Missing MRoPE function definition from KeyeForConditionalGeneration (@tjtanaa #27895)
[Add] cmdline argument parsing for KV cache offloading modules (@ApostaC #27621)
feat(benchmarks): support HF model names in multi-turn benchmark (@ai-jz #27850)
[Docs] Mock all imports for docs (@hmellor #27873)
[V0 deprecation] Remove VLLM_USE_V1 usage in platform and v1 module (@wangxiyuan #27798)
[Bugfix] DeepSeek V3.2 MTP metadata & CUDA graph issues (@xiaohajiayou #26779)
[Bugfix] Python 3.10 compatibility for Self (@DarkLight1337 #27918)
[Core][TPU] Support TPU Data Parallalism (@wenxindongwork #27365)
[BugFix] Fix mixed penalties batch with async scheduling (@njhill #27910)
Adds anthropic /v1/messages endpoint to openai api_server (@bbartels #27882)
[KV offload] Offloading connector async scheduling support (@KevinCheung2259 #27648)
[CI/Build] Fix flaky test_transcription_validation.py::test_basic_audio_gemma (@bbrowning #27924)
[Bugfix] Fix Qwen Omni audio inference (@DarkLight1337 #27920)
Performance fix MistralTokenizer: cache special ids and tokens (@juliendenize #27925)
[V1] [Hybrid] Mamba1 Automatic Prefix Caching (@Josephasafg #26377)
[Misc] Provide Siglip2 chat template (@DarkLight1337 #27939)
[Bugfix][llm]: Abort orphaned requests when llm.chat() batch fails (@Flink-ddd #27420)
[BugFix][LoRA] use adapter_id instead of id field of lora_request (@biswapanda #27728)
[Frontend] Align finish_reason when tool is called with OpenAI (@n0gu-furiosa #25054)
[Hybrid] Pass kernel block size to builders (@tdoublep #27753)
[Bugfix] Padded Eagle Specdec with Chunked Prefill (@Flechman #26263)
[XPU]Refine Dockerfile.xpu, avoid oneccl dependency issue (@jikunshang #27964)
Add ORCA endpoint load metrics support (@efimki #24905)
[CI/Build] Remove the flaky gpt-oss lora test (@jeejeelee #27966)
[Model] Add PaddleOCR-VL Model Support (@zhang-prog #27758)
Early exit for MoE LoRA kernels (@gnovack #27131)
[Bugfix] Skip gs:// model paths for speculator detection (@pwschuurman #27846)
[BUG] Make 'binary' default option for saving torch compile artifacts when using standalone_compile (@ahao-anyscale #27616)
[CI/Testing] Add basic single node dual batch overlap test (@LucasWilkinson #27235)
[Spec Decode] Integrate Suffix Decoding from Arctic Inference (@aurickq #25784)
[Feature][Benchmarks] Support inf burstiness (@sducouedic #26941)
[Bugfix][Qwen][Multimodal] Move Qwen2_5_vl sdpa to custom op and reenable compile (@Lucaskabela #27764)
[Bugfix] change FlashMLA reorder_batch_threshold (@MatthewBonanni #27777)
[Docs] add runai_streamer_sharded to LoadConfig (@andyxning #27937)
Add TP parameter to attention tests (@MatthewBonanni #27683)
[Bugfix][plugin] fla crash on plugin (@ILikeIneine #27322)
[Bugfix] Fix MoE Routing Simulation (@tlrmchlsmth #28002)
Remove the tpu docker image nightly build. (@QiliangCui #27997)
[Bugfix][ROCm] Fix ViT rotary embeddings for torch.compile compatibility on ROCm (@vllmellm #27748)
[LoRA] Lora shrink swizzle (@li2haipeng #27694)
[Refactor] Lazy import tool_parser (@chaunceyjiang #27974)
[NIXL][XPU] Pin NIXL version to 0.7.0 (@zhenwei-intel #27849)
[Metrics] Enable sleep state metric outside of dev mode (@markmc #27867)
[Bug] Batch invariant: Fix flash attn MLA RuntimeError: scheduler_metadata must have shape (metadata_size) (@yewentao256 #27884)
[CPU]Improve dynamic 4bit moe performance (@xiangze-arm #27240)
[CI/Build] Update LM Eval Version in AMD CI (@zhewenl #27944)
[KV Connector] Make KVCacheConfig an explicit constructor argument (@markmc #27887)
[Model] fix ernie45 reasoning_parser (@CSWYF3634076 #27973)
[CI/Build] Fix OpenAI API correctness on AMD CI (@zhewenl #28022)
[BugFix][Performance] Restore flashinfer autotuning for all scenarios (@varun-sundar-rabindranath #27904)
Load tuned fused_moe_lora shrink and expand kernel configs separately (@yugong333 #27435)
Support using Int4PreshuffledTensor after loading (@jerryzh168 #26066)
[Core] Enable StatLogger in LLMEngine (@zhuohan123 #28020)
[Model][Bugfix] fix pipeline parallelism support for NemotronH (@tomeras91 #27968)
[Model] add optimal triton fused moe configs for NemotronH MoE (@tomeras91 #27967)
[Kernels] Isolate modular kernel code from FusedMoEMethodBase subclasses. (@bnellnm #27123)
[BugFix] Fix incorrect preallocated sampled_token_ids tensor size (@njhill #28025)
[Perf] SM100 - add swap AB optimization to CUTLASS FP8 GEMM (@LyrisZhong #27284)
[PERF] Decouple projections from GDN custom op (@vadiklyutiy #27512)
[model] Add support for openPangu_Ultra_MoE (@yt0428 #27521)
[PerfFix] Avoid separate thread for MP executor shm spin (@njhill #28012)
[AsyncScheduling] Don't schedule past request max_tokens (@njhill #27922)
Remove deprecated --rope-scaling and --rope-theta (@hmellor #28006)
[ROCm][Perf] New design on ROCm AITER MHA backend Implementation (@ganyi1996ppo #25763)
Added disable rule to track files under benchmarks/lib (@nadavkluger #28048)
[Multimodal] Make MediaConnector extensible. (@huachenheli #27759)
[ROCm] gemm_a16w16 upstreaming (@maleksan85 #26969)
Revert "[PERF] Decouple projections from GDN custom op" (@vadiklyutiy #28080)
[Qwen3-Next] MOE configs for A100-SXM4-80GB TP4 TP8 (@toulzx #27740)
[XPU] Add gpt-oss model support for Intel GPU (@jikunshang #27786)
[CI/Build] Enable some fixed tests in AMD CI (@zhewenl #28078)
[V0 deprecation] Remove VLLM_USE_V1 usage in most modules (@wangxiyuan #27955)
[Bugfix] Fix encoder-only model support for transformers backend (@Isotr0py #28021)
[BugFix] Fix DCP Assert (AssertionError: DCP not support reorder_batch_threshold > 1 now.) (@LucasWilkinson #28100)
[Model, Core] Support Granite Speech & LoRA for STT (@alex-jw-brooks #24455)
[Refactor] Lazy-loaded reasoning_parser (@chaunceyjiang #28092)
[Refactor] to simplify and extract the shared logic between chat completion and responses (@chaunceyjiang #27961)
[bugfix] fix wrong dcp_local_seq_lens calc (@pisceskkk #27518)
[Hybrid allocator + kv connector] revert connector test changes related to hybrid allocator (@KuntaiDu #28011)
[Misc] fix import error for DeepSeekR1ReasoningParser (@chaunceyjiang #28114)
Fix excessive logging noise by reducing the log level of the MinimaxM2ToolParser import success message (@minatoaquaMK2 #27635)
Bugfix: Cutlass FP8 FusedMoE bad scaling factors (@amirkl94 #27255)
[Graph Partition][Cache] Use inductor partition ops config (@BoyuanFeng #27702)
[XPU] Enable custom routing functions in IPEX for Llama4 (@frost-intel #28004)
add kimi reasoning parser (@MoyanZitto #28128)
[DCP] check return_lse for all layers in dcp (@heheda12345 #27929)
[BugFix] Support EP/DP + EPLB with MTP (@ilmarkov #25311)
Enabling cooperative multi-gpu tests on multi-gpu nodes (@Alexei-V-Ivanov-AMD #27986)
[ROCm][MLA] Support block-size > 1 for AITER MLA backend (@ganyi1996ppo #27224)
[Bugfix] Validate custom logits processor xargs for online serving (@Isotr0py #27560)
[misc] add vLLM Beijing Meetup (@jjzhang #28127)
[Kernel] Fuse computation of g and beta for Gated Delta Net (@ZJY0516 #28095)
[Core] add support for reasoning parser plugins (@walterbm #28075)
[Bugfix] vLLM should check Inductor config for compile cache enablement status (@gmagogsfm #27637)
[FlashInfer] Avoid FlashInfer block_size 16 + head_size 256 on blackwell (@heheda12345 #27994)
[CI]: Add LMCache Unit Tests (@sammshen #27852)
[Feature] Extend batch invariant torch.compile to B200 (@PaulZhang12 #27856)
[Bugfix] Fix Qwen3-Reranker-8B load (@noooop #28117)
[Docs] Clean up README_TUNING.md (@windsonsea #28088)
[Hardware][IBM Z] Optimize s390x Dockerfile (@R3hankhan123 #28023)
[Chore] Remove Nemotron-Nano-VL config copy (@Isotr0py #28126)
[Docs] Add guide to debugging vLLM-torch.compile integration (@zou3519 #28094)
[Feature]: Add corrupted request metric to V1 metrics system. (@atalhens #27306)
[CI/Build] Update checking logic in cutlass_group_gemm_supported (@zhewenl #27948)
[CI/Build] Fix test_defaults_with_usage_context in AMD CI (@zhewenl #27926)
[Core][Hybrid allocator + connector 2/n] Unify remove_skipped_blocks by get_last_useful_token (@KuntaiDu #25431)
[Debugging] Add annotation for easier trace analysis (@dayeol #22496)
[PERF] Decouple projections from GDN custom op. Attempt 2 (@vadiklyutiy #28083)
[Bug] Fix cpu disable shared_experts VLLM_DISABLE_SHARED_EXPERTS_STREAM (@yewentao256 #28157)
[Bug] Fix env string "0" same to True (@yewentao256 #28159)
[Feature] Enable TP + EP shared_experts overlap with router, 3.7% E2E performance improvement (@yewentao256 #28164)
[CI Failure] nm-testing/Qwen2-0.5B-Instruct-FP8-SkipQKV was removed from HF. Skip it in tests (@vadiklyutiy #28170)
[Misc] Remove the duplicate code (@chaunceyjiang #28111)
[Chore] Clean up deepseek v2/v3 config copy (@Isotr0py #28055)
[Core][MM] Use non-blocking CPU-GPU copy of multimodal data (@lgeiger #28141)
Make the cv2 dependency optional (@cmpute #27780)
[CI] Add compile/test_multimodal_compile.py to CI (@gmagogsfm #28151)
[flashinfer] fix FI all2all with FI cutlass moe (@mxz297 #28166)
Patch Mistral Tokenizer (@juliendenize #28146)
Fix hard-coded parameter name in gemma3n.py (@seungduk-yanolja #27946)
[CPU] Enable torch profiling (@aditew01 #28130)
[V0 deprecation]clean up is_v1_supported_oracle (@wangxiyuan #28116)
[Bugfix][Kernel] fix merge attn states when both prefix and suffix are empty (@courage17340 #28181)
[Frontend] OpenAI Responses API supports Tool/Function calling - non-harmony (@chaunceyjiang #26874)
[CPU]Improve cpu fused moe perf (@xiangze-arm #27244)
Disable nm-testing models with issues in CI (@mgoin #28206)
[Docs] Switch to directory style URLs (@hmellor #28058)
[Kernel][Model] Tune fused_moe Triton configs for MiniMax-M2 on H100 (@minatoaquaMK2 #28200)
[Doc] Add Arm CPUs are on the list of supported targets in vLLM (@milpuz01 #26018)
[HARDWARE][CPU] Add Option for Disabling Binding to Specific CPU Cores (@StanHatko #27953)
[Frontend] Fix logging format when enable response logging (@esmeetu #28049)
CODEOWNERS: Add myself as reviewer on security docs (@russellb #28216)
[Structured outputs] Upgrade llguidance to 1.3.0 (@andylolu2 #28039)
Add llama 4 scaling support (@juliendenize #28145)
[Chore] eliminate duplicated and unconditional object serialization in anthropic messages api (@vicoooo26 #27792)
[ROCm] triton fp8 kernel (@maleksan85 #27058)
[Doc]: Make extraInit containers fully configurable in helm chart (@HanFa #27497)
[Test] Add non-MoE DP test coverage (@MatthewBonanni #28235)
[BugFix] Fix FusedMoELoRA + ModularKernel Integration (@varun-sundar-rabindranath #28237)
Fix failing test for CRadio (@BloodAxe #27738)
Speed up mm processor kwargs per request by spliting dynamic and static kwargs (@LJH-LBJ #26483)
[Multimodal][torch.compile] Add compilation config field for turning off ViT/MM compile (@Lucaskabela #28242)
[CI/Build] Loosen STT LoRA Translate Check (Flaky Test) (@alex-jw-brooks #28247)
Add runai model streamer e2e test for GCS (@amacaskill #28079)
Fix issues from #28242 (@hmellor #28257)
[amd][gptoss] Perf gain because of block alignment (@smitkadvani #28024)
[Bug] Fix missing token_ids for reasoning parser models in chat completions #28246 (@baonudesifeizhai #28256)
[CI] Reduce Blackwell Fusion test runtime by filtering tests and only run all tests in nightly (@Copilot #28074)
[Kernel] LoRA triton kernels support PDL (@jeejeelee #27402)
[Perf] Introduce FlattenLogprobs to store logprobs results to reduce GC overhead (@Jialin #28171)
[FixBug]Aeala/ShareGPT_Vicuna_unfiltered marked as multimodal benchmark (@princepride #28265)
[CPU]Avoid repeated random sample compile (@xiangze-arm #28260)
[Misc][Model][Refactor] Pass the prefix into Linear layers (@MengqingCao #28259)
[fix] Revert "fixing mm placeholder replacement issue with gemma3" (@khluu #28285)
[Core][MM] Add mechanism to configure multimodal fields which should stay on CPU (@lgeiger #28168)
[Bugfix] Use latency MOE backend as default for Flashinfer and other misc fixes (@pavanimajety #27439)
[CLI] add --max-tokens to vllm complete (@Iceber #28109)
[Feature] Default ignore_eos True for random dataset (@yewentao256 #28227)
[Log] update shm wait time msg (@BoyuanFeng #28255)
Revert "[PerfFix] Avoid separate thread for MP executor shm spin (#28012)" (@NickLucche #28289)
[README] Add Arm CPUs to the list of supported targets (@fadara01 #28290)
[doc] add guide about the provided PTX was compiled with an unsupported toolchain (@youkaichao #28305)
[Build] Fix release pipeline failing annotation (@simon-mo #28272)
[Bugfix] Fix and add tests for GptOss reasoning parser (@benchislett #28000)
[Core] Rework handling of async scheduling config (@njhill #28250)
[PerfFix] Avoid separate thread for MP executor shm spin (take 2) (@njhill #28319)
Update Flashinfer from v0.4.1 to v0.5.2 (@hmellor #27952)
[XPU] Enable Expert parallel for MoE models (@jikunshang #28263)
remove resolve_op_overloads and use splitting_ops directly (@BoyuanFeng #28081)
[Bugfix][LoRA][Spec Decode] Support LoRA with speculative decoding (@xiaohongchen1991 #21068)
Update gpu.rocm.inc.md to add support for AMD Ryzen AI MAX / AI 300 Series (gfx1151, gfx1150) (@hammmmy #28308)
[Perf][DeepSeek] Add sigmoid+bias fusion to fused_grouped_topk from TRTLLM (@mgoin #28124)
Bump arctic-inference requirement (@aurickq #28174)
[bugfix] support eagle with lora cudagraph specialization (@gnovack #28318)
[Model] Consolidate Deepseek-MoE implementation with DeepSeek-v2 (@Isotr0py #28101)
Refactor CPU/GPU extension targets for CMake build (@ashahba #28026)
[flashinfer][fix] do not check nvcc availability when using pre-downloaded cubins (@mxz297 #27990)
[Attention] Remove max cudagraph size limit of 992 (@22quinn #27840)
reasoning_content -> reasoning (@hmellor #27752)
[Bugfix] Update device name for H200 detection (@robertgshaw2-redhat #28349)
[Bugfix] Spec decode + structured output + spec model max len edge case (@andylolu2 #28298)
[DCP] Support dcp kv_cache interleave size > 1 (@zhangsicheng5 #26696)
Enhance run_cluster.sh for multi-NIC support (@evberrypi #28328)
[Feat] Drop-in Torch CUDA Profiler (@benchislett #27841)
Remove setuptools upper bound constraint (<80) (@ColeMurray #28337)
[Bugfix] Fix test fused quant layernorm tests (@ElizaWszola #27865)
[Performance][gpt-oss] Revert gpt-oss max cudagraph size to 1024 (@mmangkad #28345)
[chore] Move some wikimedia images to S3 (@khluu #28351)
fix: close issue 28338 by fixed python version (@yihong0618 #28339)
[Misc] fix typo and add detailed log (@andyxning #28178)
[ROCm] Add env to enable/disable aiter triton gemm (@sarckk #28321)
[Misc] Add some comments in qwen3-next (@ZJY0516 #28267)
[CI] Fix flaky test_eagle_correctness test (@NickLucche #28364)
[Core] Simplify async KV output aggregation (@njhill #28327)
[Core] Separate out attention metadata building logic from prepare inputs (@LucasWilkinson #26764)
[BugFix] Fix cu_num_generated_tokens slicing logic in LogprobsLists.slice() method (@usberkeley #28214)
[CI/Build] Temporary fix to LM Eval Small Models (@zhewenl #28324)
[Kernel] Fix fused_gdn_gating (@ZJY0516 #28343)
[ROCm][Platform] Add RX7900XTX device id in _ROCM_DEVICE_ID_NAME_MAP (@JartX #28279)
[CI] lora/test_mixtral.py : Add additional expected outputs due to flakiness (@varun-sundar-rabindranath #28322)
[Hardware][AMD][Model] Add Triton MoE tuning support and optimized configs for Qwen3 omni for MI308X (@sammysun0711 #28373)
[V0 deprecation] Remove no longer used get_metadata_cls (@LucasWilkinson #28370)
Restore PlaMo2 unit test as pfnet/plamo-2-1b now supports transformers >=4.56 (@Alnusjaponica #28019)
[Metrics] Refactor LoRA state tracking (@markmc #26801)
[bugfix] fix siglip batch text output error (@piood #28365)
[Fix] optimize visual token mask with caching and multi-token support (@bo-ke #28374)
Add @tjtanaa to codeowner for ROCm and multi-modal (@tjtanaa #28360)
[Rocm][fused_moe][fp4] view weight to torch.float4_e2m1fn_x2 when running aiter fused moe for fp4 model (@zejunchen-zejun #27474)
[Kernel] Optimization of the mm_k operator. (@caozuoba #28280)
[RFC][ROCm][AITER] Keep all AITER kernels in _aiter_ops class like _custom_ops and _ipex_ops (@vllmellm #24490)
[V0 Deprecation] Remove unused context_len and seq_len from M-RoPE (@DarkLight1337 #28395)
[Bugfix] Fix persistent_masked_m_silu_mul_quant tests (@varun-sundar-rabindranath #28366)
[Performance] Support FP8 flashinfer TRTLLM MOE on Qwen3 and Qwen-3next (@jiahanc #27492)
[Bugfix] Fix llguidance backend, rollback when EOS was encountered (@Flechman #25905)
[FA/Chore] Bump FA version for FP8 two-level accumulation (@jmkuebler #27889)
[Bugfix][EPLB] Disabled shared expert overlap when EPLB is enabled (@SageMoore #28377)
[Misc] Add more scoping for improved trace (@frank-wei #28329)
[BugFix] Fix DeepGEMM over-allocating workspace (@LucasWilkinson #28254)
[Frontend][2/n] remove empty content from _parse_tool_calls_from_content (@qandrew #28331)
[CI] Fix Plugin Tests Tests (@robertgshaw2-redhat #28413)
[ROCm] Add missing gemm_a8w8_blockscale import (@sarckk #28378)
[PERF] Allreduce fusion. Support torch native matching. Tuning of the thresholds (@ilmarkov #24248)
[Perf] Move gc.freeze logic from EngineCoreProc to EngineCore for better coverage (@Jialin #27896)
[Bugfix] Ensure calculated KV scales are applied in attention. (@adabeyta #27232)
[Test] Remove old non-varlen FA2 test (@MatthewBonanni #28420)
[Feature] Refactor batch invariant fp8 DeepGEMM (@yewentao256 #27606)
[CI/Test Fix] Fix CP tests on Blackwell (@LucasWilkinson #28404)
[Feature] Add env var VLLM_MOE_USE_DEEP_GEMM (@yewentao256 #28422)
Only register rocm_aiter_ops if aiter is found (@mgoin #28428)
Fix rotary embedding benchmark script (@xyang16 #28323)
[Misc] FlattenLogprobs -> FlatLogprobs (@zhuohan123 #28335)
[Frontend] Add sagemaker_standards dynamic lora adapter and stateful session management decorators to vLLM OpenAI API server (@zhaozuy #27892)
[Bugfix] Fix Stream Sync for Shared Expert Overlap (@robertgshaw2-redhat #28430)
[Doc] Sleep mode documentation (@iAmir97 #28357)
[BugFix] Avoid calling KV connector layer APIs when metadata is unset (@sdavidbd #28253)
[Bugfix] Fix max image size for PaddleOCR-VL (@ywang96 #28442)
[EPLB] Refactor balance_packing to use numpy and optimize GPU-CPU transfers in EPLB (@SageMoore #28369)
[Bugfix] fix qwen3-next crash (@ZJY0516 #28202)
[BugFix] 'DeepseekV2Config' object has no attribute 'use_mla'` (@faaany #28387)
[Model][Qwen3VL] Slighly speedup fast_pos_embed_interpolate (@lgeiger #28434)
Multi turn benchmark progress bar for synthetic conversation generation (@segevido #28394)
[CI] Add mergify rules for nvidia label (@mgoin #28417)
[Attention] Refactor CUDA attention backend selection logic (@MatthewBonanni #24794)
Fix Fused MoE LoRA Triton kernel bug (@chaojun-zhang #28450)
[Model] Pass mm_features directly into get_mrope_input_positions (@DarkLight1337 #28399)
Add request timeout override for multi-turn benchmarks (@segevido #28386)
[Docs] Fix grammar in CPU installation guide (@maryamtahhan #28461)
[Kernels] Split up fused_moe/layer.py, isolate more modular kernel code (@bnellnm #28064)
[BugFix] Fix Failing Ruff Check (@jvlunteren #28469)
Add @markmc to CODEOWNERS for Observability (@markmc #28457)
[BugFix] Fix RuntimeError in PixtralHFAttention on CPU/XPU (@faaany #28444)
[BugFix] Add test_outputs.py to CI pipeline (@usberkeley #28466)
[Doc] Fix typo in serving docs (@the-codeboy #28474)
Remove weight_scale.T special case for SM90 Block FP8 CUTLASS kernel (@mgoin #28431)
[NIXL] Generalize block-first backend layouts (FlashInfer-like) (@NickLucche #28282)
[Kernel][Perf] fuse QK Norm and RoPE into one cuda kernel for Qwen Model (@izhuhaoran #27165)
[ROCm][Quantization] extend AMD Quark to support mixed-precision quantized model (@xuebwang-amd #24239)
[Quantization] fix attention quantization of gpt_oss model (@xuebwang-amd #27334)
[CI/Build] Refactor Attention backend for test_prefix_prefill from xformers to SDPA (@zhewenl #28424)
Prefer FlashAttention MLA as default over FlashMLA (@MatthewBonanni #27363)
[Kernel] Optimize rms_norm kernel (@xyang16 #27931)
[BugFix] Fix Siglip2Attention on XPU (@faaany #28448)
[Misc] Remove unused attention prefix prefill ops functions (@lgeiger #26971)
[Perf] Use np.ndarray instead of list[list[int]] to reduce GC overhead (@Jialin #28245)
[V0 deprecation] Clean up num_prefill_tokens logic for V0 (@gcanlin #28203)
[Misc] fix typo in DCP comment (@Livinfly #28389)
[LoRA][1/N]Remove LoRA extra vocab (@jeejeelee #28382)
[TPU] Rename path to tpu platform (@kyuyeunk #28452)
[Misc] Cleanup Executor interface (@wangxiyuan #28441)
Add Zurich vLLM Meetup (@mgoin #28488)
[Bugfix] Disable shared expert overlap if Marlin MoE is used (@mgoin #28410)
[Feature] Allow configuring FlashInfer workspace size (@maxyanghu #28269)
Use FLASHINFER MLA backend when testing fp8_kv_scale_compile (@adabeyta #28491)
[BugFix] Graceful handling of torch symm mem errors. (@ilmarkov #27671)
[Frontend] Change CompilationMode to a proper Enum (@gmagogsfm #28165)
[Performance] Cache loaded custom logitsprocs to avoid overheads (@Isotr0py #28462)
[[V0 deprecation]]Remove VLLM_USE_V1 env (@wangxiyuan #28204)
[CPU] Refactor CPU attention backend (@bigPYJ1151 #27954)
VLLM_USE_TRITON_FLASH_ATTN V0 variable deprecation (@AndreasKaratzas #27611)
[Model][Qwen3VL] Simplify get_mrope_input_positions using numpy (@lgeiger #28302)
[Core] Encoder separation for Encode-Prefill-Decode Disaggregation (@fake0fan #25233)
[BugFix] Add fallback path in apply_rotary_pos_emb_flashattn for non-cuda platforms (@faaany #28447)
[Benchmark] Add retry support to fix workload bias in multi-turn benchmark (@ai-jz #28493)
[Core] Cache vllm_is_batch_invariant (@lgeiger #28304)
[CI/Build] Fix crash due to removed VLLM_USE_V1 attribute in EPD (@fake0fan #28521)
[CI] Introduce autorun_on_main feature (@hl475 #27836)
[BugFix]: --enable-lora with model granite-4.0-micro crash (@yyzxw #27733)
[Model] fix glm4_moe_mtp load weights with GLM-4.6 checkpoint. (@wuyaoxuehun #27597)
[XPU]Fix crash due to removed VLLM_USE_V1 attribute (@chaojun-zhang #28520)
[KVConnector] Enable get_block_ids_with_load_errors() in LMCache connector (@ziruiliu #27978)
add cpu option for p/d in nixl_connector (@ZhengHongming888 #28356)
[ROCm] [Bugfix] Fix fused_qknorm_rope_kernel rocm compatibility (@tjtanaa #28500)
[Bugfix] Fix gpt_oss packed_modules_mapping (@jeejeelee #28536)
[V0 deprecation] Deprecate use_v1 parameter (@wangxiyuan #28112)
Fix pre-commit (and XPU) on main (@hmellor #28556)
[Performance][Hopper] Avoid M dim padding to 4x for most cases (due to cuda graphs paddings) (@alexm-redhat #28492)
[Refactor] Remove redundant TP gather/split in split_qkv in QwenVL (@gcanlin #28271)
[Misc] Refactor Attention kv transfer methods into decorator (@NickLucche #27816)
Remove deprecated fields from CompilationConfig (@hmellor #27593)
[Perf] Refactor cudagraph_support to enable full CUDA graphs for spec decoding with FlashInfer (@benchislett #28479)
Implement ARC KV cache eviction policy (@albertoperdomo2 #27039)
[EPLB][ROCm]: support EPBL for ROCm backend (@PerryZhang01 #27731)
[Model] [Config] Correctly identify granite-4.0-micro as non-hybrid model (@tdoublep #28563)
[CI] Skip "Multi-Modal Models Test (Extended) 3" test that's broken in current Transformers (@hmellor #28559)
[KV connector][WIP] KV cache proxy based on LMCache multi-process mode (@ApostaC #27902)
[BugFix] Priority scheduling and spec tokens preemption (@andylolu2 #28558)
[Misc]Fix typo in llm_engine.py (@frank-wei #28584)
[Performance][B200] Fix deepgemm prologue (@varun-sundar-rabindranath #27897)
[ROCM] Fix ROCm warnings, environment flag access, and GEMM kernel naming for consistency in _aiter_ops.py (@vllmellm #28464)
[TPU] Support GCS path in VLLM_TORCH_PROFILER_DIR (@QiliangCui #28487)
[Bugfix] Adjust Marlin CUDA arch selection to 8.0+PTX;9.0+PTX (@mgoin #28294)
[Core][AMD] Migrate fully transparent sleep mode to ROCm platform (@HollowMan6 #12695)
[MoE][Kernel][Perf] Improve Shared Expert Stream Overlap (@alexm-redhat #28406)
Skip models that cannot currently init on Transformers v5 (@hmellor #28471)
[Docs] Update meetups.md description (@mgoin #28583)
[ROCm][Bugfix] Revert removing setuptools version restriction (@gshtras #28592)
[platform] Move get_cu_count to utils (@wangxiyuan #27005)
[Bugfix] Fix SM100 gpt-oss regression due to faulty attn sink support (@mgoin #28561)
[BugFix] Fix mm_encoder_attn_backend arg type checking (@njhill #28599)
[Docs] Add some details about what the MoE block needs for the Transformers backend (@hmellor #28588)
Rename clashing method names for vLLM model protocol (@hmellor #27583)
[n-gen] DO NOT repeatedly return finished child requests (@Jialin #28591)
[Frontend] split append tool output (@qandrew #28333)
[Frontend][responsesAPI][1/n] convert responses API tool input to chat completions tool format (@qandrew #28231)
[BugFix][ROCm] Fix get_cu_count missing variable error (@ganyi1996ppo #28608)
[XPU] Support Triton path for LoRA operations on XPU (@faaany #28511)
Support DeepEP for Kimi-k2-thinking through enabling gemm selection for compressed-tensor marlin wna16 (@luccafong #28574)
[build][cmake]: Bundle static ACL and torch libgomp for CPU extension builds (@Radu2k #28059)
[ROCm][BugFix] Remove the usage of device_info from aiter (@ganyi1996ppo #28383)
[Bugfix] Prevent crash on empty grammar string (@tjandy98 #28210)
Use official xformers-0.0.33 built for PT 2.9 (@huydhn #28600)
Add NUMA node validation for CPU thread binding (@usberkeley #28555)
[Bugfix] fix kimi-linear crash (@ZJY0516 #28445)
[Frontend] supports interleaved thinking (@chaunceyjiang #28531)
Support all interleaved layer types (@sarckk #28485)
Fix: Correctly filter special tokens in benchmark_prefix_caching (@dw2761 #28615)
[BugFix] Fix type error when assign a trition kernel tensor to a torch.nn.Parameter (@liuzijing2014 #28603)
Fix io processor pooling #28273 (@baonudesifeizhai #28484)
[XPU] add sym params to IPEXConfig (@zufangzhu #28611)
[Bugfix] Fix FPS value type for Qwen2.5-Omni video processing (@faaany #28630)
[Hardware][PowerPC] Fix fp16 compilation error for Power in cpu attention backend and bump oneDNN version (@Akashcodes732 #28535)
[ROCm][BugFix]Fix get_cu_count in rocm_aiter_fa.py (@ganyi1996ppo #28618)
[CI/Build] Install uv for AMD MI300: Language Models Tests (Hybrid) %N (@amdfaa #28142)
[CI Failure] Fix backend selection for encoder-only models (@hl475 #28534)
[BugFix] DeepSeek-OCR: apply NoRepeatNGramLogitsProcessor to greedy path (@YuanpingSong #28617)
Fix get_num_experts when config sets it explicitly to None (@hmellor #28652)
[Misc] Turn off encoder torch compile by default (@ywang96 #28634)
Rewrite C++ meta funcs to Python (@janeyx99 #28595)
[BugFix] Ensure EngineArgs.create_engine_config is idempotent (@njhill #28515)
[TPU] patch TPU wheel build script to resolve metadata issue (@jcyang43 #27279)
[Performance][B200] silu_mul_quant: pack scales in int32 (@varun-sundar-rabindranath #28358)
[Bugfix] Fix validate model input for decoder models (@yannicks1 #27099)
[Attention][Bugfix] Fix FA sink support (@MatthewBonanni #28660)
[Perf] Support stream interval for reducing host overhead (@elvischenv #27869)
[bugfix] correct local_chunk_len for DCP in reorg_kvcache with long context (@pisceskkk #28526)
[Bugfix] Eliminate tuple inputs to submodules in graph partitioning (@gmagogsfm #28533)
[Bugfix] [CPU] bump torch to 2.9.0 for Darwin to fix segmentation fault (@kebe7jun #27791)
[Misc] Update CODEOWNERS for simon-mo and comaniac (@simon-mo #28675)
[CI] Bug: Fix ci entrypoint pooling (@yewentao256 #28684)
[KV Connector] Test async mode in scheduler tests (@markmc #28550)
Mirrored test group definitions for AMD (2025-11-11) (@Alexei-V-Ivanov-AMD #28573)
[quantization][config] enable override existing quant_config (@ILikeIneine #28510)
[ROCm] Bump up the version of amd-smi to 6.4.3 (@SageMoore #28680)
[CPU][Bugfix] Fix Apple Silicon M1 compilation failure (@mgoin #28681)
[ci][amd] fix basic models extra init test (@bradleyhd #28676)
[Misc] Remove warn_for_unimplemented_methods (@DarkLight1337 #28613)
[XPU][CI]disable lm cache uts (@jikunshang #28696)
[Misc] Update xformers to 0.33.0.post1 (@ywang96 #28678)
[Misc] add ignore mapper for quark quantization (@haoyangli-amd #28275)
[Bugfix][CI/Test][Spec Decode] Fix illegal memory access in offline_inference/spec_decode.py (Issue 27619) (@rasmith #28432)
[BugFix][CI/Build][ROCM] Fix import error and apply assert in appropriate case in test_struct_output_generate (@rasmith #28311)
use default CCL_ZE_IPC_EXCHANGE (@yma11 #28700)
[Bugfix] fix dots.ocr pp support (@ZJY0516 #28705)
[BugFix] Fix multi-modal async scheduling race condition (@njhill #28706)
Add output token counting to gsm8k eval (@mgoin #28594)
[Minor] avoid register new custom and just import silly_attn (@BoyuanFeng #28578)
[Misc] fix comment in test_envs (@xingliu14 #28529)
[feat]: log number of preempted requests (@610lyn #28522)
[Frontend] Added chat-style multimodal support to /classify. (@WorldExplored #27516)
[Model][MM] Extract conv layer as CustomOp (@shen-shanshan #28455)
[DCP] Support Decode Context Parallel (DCP) for GQA with Flashinfer (@gjc0824 #25438)
Fix KV sharing fast prefill with cudagraph enabled (@sarckk #28537)
[BugFix] Fix FA3 IMA with FULL_AND_PIECEWISE and cascade attention (default) (@LucasWilkinson #28702)
[Doc] Fix macOS installation dependency resolution issue (@shahfasal #26721)
[Model] Fix bailing_moe accuracy problem (@zhaozx-cn #28277)
[Bugfix][Nixl] Fix kernel physical<>logical block_size issue (@NickLucche #28677)
[Config] Clean up SchedulerConfig initialization (@DarkLight1337 #28665)
[Kernels] Enable FlashInfer FP8 Blockscale on SM90 (for TEP DSR1) (@djmmoss #27134)
[Fix] improve aspect ratio in dummy image generation and add common VLM tests for PaddleOCR-VL (@dongbo910220 #28711)
[Docs] Update the name of Transformers backend -> Transformers modeling backend (@hmellor #28725)
[CI][CPU] Smoke test for Apple Silicon using GHA MacOS runner (@mgoin #28688)
[DisaggEverything] Tokens in<>out /generate endpoint (@NickLucche #24261)
[Attention] Bump FA for removed method (@MatthewBonanni #28429)
Fix typo in comment: existance -> existence (@OthmanMohammad #28737)
Remove audio optional dependency for mistral-common (@juliendenize #28722)
[kernel] Improve FP8 PTPC on Hopper for larger shapes (@czhu-cohere #28692)
docs(lora_resolvers): clarify multi-resolver order and storage path requirement (@wangchen615 #28153)
LLaMA4 LoRA Adapter Enablement (@kfhfar #28602)
[Bugfix] [ROCm] [AITER]: Fix aiter block quant not compatible with torch compile dynamo (@tjtanaa #28716)
[Docs] Enable some more markdown lint rules for the docs (@hmellor #28731)
[Chore] Rename SchedulerConfig.chunked_prefill_enabled (@DarkLight1337 #28735)
[Bugfix] resolve Qwen3-VL GPTQModel quantized model loading failure (@GuanH #28663)
[BugFix] Fix misprint introduced by modular_kernel refactoring. (@halyavin #28728)
[ROCm][Bugfix] Fix compilation errors with fused_qknorm_rope_kernel.cu (@SageMoore #28682)
[CI] Fix macos smoke test uv cache issue (@mgoin #28736)
[Bugfix] TypeError: 'NoneType' object is not callable (@mostrowskix #27410)
[ROCm][CI/Build] Change install location of uv (@gshtras #28741)
Avoid bytecode hook and simplify TorchCompileWrapperWithCustomDipatch (@laithsakka #25110)
[Bugfix] Fix incorrect use of hidden_states for shared_experts due to do_naive_dispatch_combine (@alexm-redhat #28740)
[Bugfix] Fix ChunkedLocalAttention CUDA Graph setting (@benchislett #28739)
[Hybrid] [Kernel] Fix chunk scan kernel when BLOCK_SIZE_DSTATE > 128 (@tdoublep #28295)
[Log] Save profiler results to file instead of stdout (@rasmith #28144)
[ROCm][CI/Build] Upgrade to ROCm 7.1 and AITER main (@gshtras #28753)
[Test] Rework e2e async scheduling tests (@njhill #28744)
[Core] Performance: Use list[np.ndarray] instead of list[list[int]] for output tokens for GC optimization (@Jialin #26368)
[TPU] Fix import error in tpu launch (@QiliangCui #28758)
[Model][Qwen3VL] Use mm_position to compute mrope positions (@lgeiger #28730)
[Bugfix] Build hadacore kernels on >SM90 (@mgoin #28748)
Revert "[Core] Performance: Use list[np.ndarray] instead of list[list… (@njhill #28773)
Fix IntermediateTensors initialization and add type hints (@OthmanMohammad #28743)
[NIXL] heterogeneous block_size support (@xuechendi #26759)
[Performance][DeepGEMM] Estimate expected_m (@varun-sundar-rabindranath #28694)
[Redo] #26368 (@DarkLight1337 #28771)
[RL] [V1] Remove unused device argument from reset_kv_cache (@zhuohan123 #28766)
Use narrow over indexing in hadacore_transform to prep for ABI stable (@janeyx99 #28756)
[Kernel][Moe Configs] llama4 maverick fp8 moe config tp8 on mi325 (@zhewenl #28709)
[Misc] Make SchedulerConfig.max_model_len init-only (@DarkLight1337 #28733)
[PERF] Remove TRTLLM Gen attn kernel limitation max_seq_len <=131072 (@vadiklyutiy #28755)
[compile] Enable sequence parallelism matching w/o custom ops enabled (@angelayi #27126)
Allow Gemma3 to take image embeddings (@tingtingtangmeta #28483)
[Doc] Fix failing doc build (@DarkLight1337 #28772)
[Model] Fix lmhead init bug of bailing_moe (@hwhaokun #28777)
Add support for Eagle with separate lm-head and embed_tokens layers (@eldarkurtic #28549)
[CI] Fix broken pipeline (@njhill #28781)
[Model][Qwen3VL] Cache positional embedding indices (@lgeiger #28475)
[Doc]: fix typos in various files (@didier-durand #28567)
[BugFix] Fix AssertionError: DCP not support reorder_batch_threshold > 1 now. (@LucasWilkinson #28751)
Adding a benchmark for batch invariance (@bwasti #28161)
[Benchmark] Fix client seed synchronization in multi-turn benchmark (@ai-jz #28512)
[Model] Allow users to control skip reading cache per request. (@noooop #28194)
[V1] Support MP Executor for multi node distributed inference (@luccafong #23691)
Fixed gpt-oss _load_weights_other() parameter position bug (@River12 #28715)
[Bugfix] Fix host and port join for ipv6 in bench serve (@scottzh8 #28679)
Fix gpt oss weight loading with EP + bf16 (@ashors1 #28765)
[Doc]: fix typos in various files (@didier-durand #28811)
fix comment typo (@andyxning #28802)
[Model][QwenVL] Optimize Qwen2_5_VisionAttention q,k preparation (@lgeiger #28769)
Feature: Support Relu2 in FusedMoE fp8 cutlass path (@amirkl94 #27261)
[BugFix] Fix async scheduling + chunked prefill + preemption (@njhill #28787)
[Performance][Fix] update nvfp4 code to support renorm routing (@jiahanc #28569)
[NIXL][XPU] update install script of NIXL (@zhenwei-intel #28778)
[ROCm][Qwen3-32B] Fix AITER MHA accuracy issue cause by #25763 (@sammysun0711 #28670)
[Bugfix][Model] Prevent special token leakage in KimiK2ToolParser streaming mode (@jscaldwell55 #28543)
[Doc] Add llama4 LoRA tag (@jeejeelee #28825)
[CPU][Bugfix] Fix _to_list in CPU model runner (@bigPYJ1151 #28824)
[BugFix] Fix glm4_moe_mtp load weights bug (@wuyaoxuehun #28805)
[Metrics] Fix KV cache usage percent metric multiproc (@jaywonchung #28792)
[XPU] work around for sp, avoid custom op import error (@jikunshang #28822)
[BugFix] Temporary fix for IMA with MTP = 2 and full-cg (@LucasWilkinson #28315)
[Bugfix][Perf] Revert applying HF processor on text-only inputs for multimodal models (@ywang96 #28858)
Cast return value to int64_t for cache size (@tiehexue #28814)
[Bugfix] Fix GPT-OSS on AMD after #28603 (@zhewenl #28816)
[Core] Async Scheduling X Spec Decoding Compatibility (@Ronald1995 #24799)
[BugFix] Fix PP performance and PP kv connector output regression (@njhill #28768)
[Quantization] [Eagle] Add complete quantization support to the draft model in Eagle (@shreyas269 #28435)
[Test] Batch Invariant: Rename and organize tests (@yewentao256 #27421)
[Model] Add Afmoe architecture implementation (@pranav4501 #28332)
[BugFix] Corner case that could cause out-of-sync with external launcher mode and dp >1 (@bangshengtang #28774)
[Misc] Fix wrong comment in scheduler (@zhuohan123 #28880)
[Bugfix] Fix Kimi-K2 tool parser concatenated tool calls parsing (@bbartels #28831)
Run macos smoke test workflow on main commit (@mgoin #28752)
[ROCm][Quantization] add apply_vllm_mapper in quark config for models like gpt-oss (@xuebwang-amd #28638)
[Refactor] Remove Unused Func in Batch Invariant (@yewentao256 #28881)
[Bugfix] Fix wrong CLI defaults for dynamic SchedulerConfig fields (@DarkLight1337 #28872)
[Doc]: fix typos in various files (@didier-durand #28863)
[Misc] Remove unnecessary parentheses from log statements (@andyxning #28897)
[CI] Fix async scheduling + spec decoding test flake (@njhill #28902)
[MISC] Remove format.sh (@KuntaiDu #28906)
[CI/Build] Replace wikipedia url with local server ones (@Isotr0py #28908)
[BugFix] Fix PP/async scheduling with pooling models (@njhill #28899)

New Contributors

@bwasti first commit is #25603
@Renovamen first commit is #25796
@patrick-toulme first commit is #25084
@kingsmad first commit is #25825
@yingjun-mou first commit is #25827
@zhoukezi first commit is #25854
@leejnau first commit is #25706
@adabeyta first commit is #25513
@acisseJZhong first commit is #25912
@a120092009 first commit is #25942
@Anionex first commit is #25354
@DrStone1971 first commit is #25843
@certainly-param first commit is #25935
@natoscott first commit is #26007
@kmaehashi first commit is #26005
@leo-pony first commit is #25470
@huijjj first commit is #24947
@levunet first commit is #24768
@Egor-Krivov first commit is #25668
@sixiang-google first commit is #25992
@astralord first commit is #26027
@jasl first commit is #26098
@nrghosh first commit is #26148
@southfreebird first commit is #25974
@soldni first commit is #26054
@yuafng first commit is #26219
@ILikeIneine first commit is #25823
@jasonlizhengjian first commit is #25998
@elieserr first commit is #26177
@orangeng first commit is #26266
@ymoslem first commit is #26258
@abhisheksheth28 first commit is #25521
@seven-mile first commit is #26231
@cfRod first commit is #26289
@atalhens first commit is #26265
@gholmes829 first commit is #25164
@dcampora first commit is #25945
@antrec first commit is #26340
@plliao first commit is #26325
@morrison-turnansky first commit is #26113
@isharif168 first commit is #26347
@Barry-Delaney first commit is #25931
@utkarshsharma1 first commit is #26279
@Aydin-ab first commit is #25283
@therealnaveenkamal first commit is #25103
@QierLi first commit is #24926
@zhiyuan1i first commit is #24486
@iwzbi first commit is #16601
@roikoren755 first commit is #25947
@luis5tb first commit is #25593
@wangxiongts first commit is #25550
@sangho-vision first commit is #26563
@muzian666 first commit is #26562
@HsChen-sys first commit is #22100
@FENP first commit is #26574
@gjgjos first commit is #26339
@andycandy first commit is #26629
@aitsvet first commit is #26713
@cyb70289 first commit is #26698
@kfhfar first commit is #26538
@n1ck-guo first commit is #24024
@ryanli first commit is #26758
@VladOS95-cyber first commit is #26726
@zklapow first commit is #26818
@HDCharles first commit is #26820
@Dhruvilbhatt first commit is #26837
@madongfly first commit is #26853
@li2haipeng first commit is #26319
@pdasigi first commit is #26143
@cern1710 first commit is #26637
@inc-jeong first commit is #26225
@bogdanminko first commit is #27008
@mandy-li first commit is #26883
@kimbochen first commit is #26943
@staghado first commit is #26916
@rkarhila-amd first commit is #25586
@hyongtao-code first commit is #27101
@jianyuh first commit is #27159
@uyzhang first commit is #27012
@shivampr first commit is #26268
@helunwencser first commit is #26832
@dagrayvid first commit is #27196
@ExtReMLapin first commit is #27253
@ReinForce-II first commit is #26789
@LiuLi1998 first commit is #22627
@sagiahrac first commit is #27211
@fangpings first commit is #27133
@jonathanc-n first commit is #27372
@bradleyhd first commit is #27124
@Navya1707 first commit is #27156
@piood first commit is #27324
@xxxxyu first commit is #26092
@usberkeley first commit is #27419
@strinczer first commit is #26706
@hjh0119 first commit is #27469
@wpc first commit is #27328
@yeshsurya first commit is #27188
@rogeryoungh first commit is #27535
@dcmaddix first commit is #27291
@tingtingtangmeta first commit is #27538
@minatoaquaMK2 first commit is #27323
@wangln19 first commit is #27565
@junpuf first commit is #27596
@sammshen first commit is #27600
@mpashkovskii first commit is #26886
@KevinCheung2259 first commit is #27670
@sammysun0711 first commit is #27623
@dumb0002 first commit is #24176
@sairampillai first commit is #25775
@FlamingoPg first commit is #27794
@SumanthRH first commit is #27789
@PaulZhang12 first commit is #27660
@jakub-sochacki first commit is #26919
@RobMulla first commit is #27824
@yugong333 first commit is #27818
@ai-jz first commit is #27850
@xiaohajiayou first commit is #26779
@biswapanda first commit is #27728
@efimki first commit is #24905
@zhang-prog first commit is #27758
@xiangze-arm first commit is #27240
@yt0428 first commit is #27521
@ganyi1996ppo first commit is #25763
@nadavkluger first commit is #28048
@toulzx first commit is #27740
@frost-intel first commit is #28004
@jjzhang first commit is #28127
@walterbm first commit is #28075
@dayeol first commit is #22496
@cmpute first commit is #27780
@seungduk-yanolja first commit is #27946
@aditew01 first commit is #28130
@milpuz01 first commit is #26018
@StanHatko first commit is #27953
@vicoooo26 first commit is #27792
@HanFa first commit is #27497
@amacaskill first commit is #28079
@smitkadvani first commit is #28024
@xiaohongchen1991 first commit is #21068
@hammmmy first commit is #28308
@ashahba first commit is #28026
@zhangsicheng5 first commit is #26696
@evberrypi first commit is #28328
@ColeMurray first commit is #28337
@bo-ke first commit is #28374
@caozuoba first commit is #28280
@zhaozuy first commit is #27892
@maryamtahhan first commit is #28461
@the-codeboy first commit is #28474
@xuebwang-amd first commit is #24239
@Livinfly first commit is #28389
@AndreasKaratzas first commit is #27611
@wuyaoxuehun first commit is #27597
@ziruiliu first commit is #27978
@ZhengHongming888 first commit is #28356
@albertoperdomo2 first commit is #27039
@PerryZhang01 first commit is #27731
@Radu2k first commit is #28059
@tjandy98 first commit is #28210
@dw2761 first commit is #28615
@zufangzhu first commit is #28611
@amdfaa first commit is #28142
@YuanpingSong first commit is #28617
@janeyx99 first commit is #28595
@xingliu14 first commit is #28529
@610lyn first commit is #28522
@WorldExplored first commit is #27516
@gjc0824 first commit is #25438
@shahfasal first commit is #26721
@zhaozx-cn first commit is #28277
@OthmanMohammad first commit is #28737
@GuanH first commit is #28663
@halyavin first commit is #28728
@mostrowskix first commit is #27410
@laithsakka first commit is #25110
@hwhaokun first commit is #28777
@River12 first commit is #28715
@scottzh8 first commit is #28679
@ashors1 first commit is #28765
@jscaldwell55 first commit is #28543
@tiehexue first commit is #28814
@Ronald1995 first commit is #24799
@shreyas269 first commit is #28435
@pranav4501 first commit is #28332

Full Changelog: v0.11.0...v0.11.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

v0.11.1

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

What's Changed

New Contributors

Contributors

Uh oh!