Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1011 commits
Select commit Hold shift + click to select a range
4ae77df
[Frontend][1/n] Make pooling entrypoints request schema consensus | C…
noooop Jan 16, 2026
14ce524
[CI] Breakup h200 tests (#30499)
LucasWilkinson Jan 16, 2026
03da3b5
[Bugfix] Refactor to support DP parallel in R3 (#32306)
xhx1022 Jan 16, 2026
b66b0d6
fix(rocm): Enable non-gated MoE (is_act_and_mul=False) support on ROC…
rabi Jan 16, 2026
b84c426
[ROCm][CI] Skip Qwen3-30B-A3B-MXFP4A16 Eval Test On Non-CUDA Platform…
micah-wil Jan 16, 2026
180e981
[Chore] Replace swish with silu (#32459)
DarkLight1337 Jan 16, 2026
6ca4f40
[CI][AMD] Skip test_permute_cols since the kernel is not used and not…
rasmith Jan 16, 2026
c9a5330
[EPLB][BugFix]Possible deadlock fix (#32418)
ilmarkov Jan 16, 2026
9fd918e
[CI] Update deepgemm to newer version (#32479)
yewentao256 Jan 16, 2026
7a10304
Atomics Reduce Counting Optimization for SplitK Skinny GEMMs. (#29843)
amd-hhashemi Jan 16, 2026
a884bc6
[LoRA] Update LoRA expand kernel heuristic (#32425)
xyang16 Jan 16, 2026
4c82b6f
[responsesAPI] allow tuning include_stop_str_in_output (#32383)
qandrew Jan 16, 2026
ca21288
[CI] Fix OOM in Hopper Fusion E2E Tests (H100) (#32489)
LucasWilkinson Jan 16, 2026
484e22b
[TPU][Core] Enable Pipeline Parallelism on TPU backend (#28506)
Chenyaaang Jan 16, 2026
5a3050a
[Docs][Governance] Add @robertshaw2-redhat to lead maintainers group …
simon-mo Jan 17, 2026
037a648
apply _validate_input to MistralTokenizer token-id chat prompts (#32448)
vanshilshah97 Jan 17, 2026
2e7c89e
Revert "[Attention][MLA] Make `FLASHINFER_MLA` the default MLA backen…
MatthewBonanni Jan 17, 2026
8e61425
[CI] Implement uploading to PyPI and GitHub in the release pipeline, …
Harry-Chen Jan 17, 2026
d3317bb
[Models] Lfm2Moe: minor name changes for resolving lora conflicts (#2…
paulpak58 Jan 17, 2026
1646fea
[Model] Molmo2: Enable quantized weight mapping for vision backbone (…
George-Polya Jan 17, 2026
2b99f21
[Misc] Fix typo: seperator -> separator in flashmla_sparse.py (#32411)
T1mn Jan 17, 2026
9e078d0
[CI/Build][Docker] Add centralized version manifest for Docker builds…
mritunjaysharma394 Jan 17, 2026
965765a
[build] fix cu130 related release pipeline steps and publish as night…
Harry-Chen Jan 18, 2026
3055232
[Feature] Add FIPS 140-3 compliant hash algorithm option for multimod…
karanb192 Jan 18, 2026
4147910
[Model Runner V2] Move mrope_positions buffer to MRopeState (#32532)
WoosukKwon Jan 18, 2026
4a6af88
[MoE Refactor] Move Test Impl into Test Dirs (#32129)
robertgshaw2-redhat Jan 18, 2026
8cc26ac
[Performance] Improve Triton prefill attention kernel's performance …
Isotr0py Jan 18, 2026
963dc0b
[Model Runner V2] Minor optimization for eagle input processing (#32535)
WoosukKwon Jan 18, 2026
fe36bf5
[Model] Remove the unnecessary dtype conversion in MiniCPM (#32523)
gcanlin Jan 18, 2026
c826c72
[Model] Support Step1 Model (#32511)
randzero Jan 18, 2026
38bf2ff
[Bugfix] Fix GLM-ASR audio encoder RoPE dim (#32540)
Isotr0py Jan 18, 2026
2f03035
"refactor: refactor_repeated_interfaces" (#32486)
tom-zju Jan 18, 2026
327a02d
[MoE Refactor] Separate Router into OO Classes (#30623)
bnellnm Jan 18, 2026
afc3622
[CI] Move Distributed Tests from H200 -> H100 (#32555)
robertgshaw2-redhat Jan 18, 2026
ba29ab4
Use the same memory for workspace13 and fused_output. (#31531)
halyavin Jan 18, 2026
5480c6b
[Doc] Correct comment for _jobs dict in OffloadingConnectorWorker (#3…
DemingCheng Jan 18, 2026
16de822
[Refactor] Remove unused file `pallas_kv_cache_update.py` (#32433)
yewentao256 Jan 18, 2026
eebc58d
[Refactor] Remove unused cutlass moe problem size function (#32047)
yewentao256 Jan 18, 2026
f5d1740
[Bugfix] Add OOT backend option (#32471)
iboiko-habana Jan 18, 2026
6101a26
[BUGFIX] Fix degenerate strides in TRTLLM query tensors for FlashInf…
vadiklyutiy Jan 19, 2026
bb1848c
[Model Runner V2] Support VLM (#32546)
WoosukKwon Jan 19, 2026
9a1f16d
[Model Runner V2] Refactor `update_states` (#32562)
WoosukKwon Jan 19, 2026
976af2f
[BugFix] Fix embed_input_ids argument error of QwenVLForConditionalGe…
honglyua-il Jan 19, 2026
7518a3d
[CI/Build] Use Common Event Map Fixture in Harmony / MCP Server Tests…
alex-jw-brooks Jan 19, 2026
3c8740a
[Frontend] Add render endpoints for prompt preprocessing (#32473)
hyeongyun0916 Jan 19, 2026
11bbf86
[CI][Hardware][AMD] Fix test_rotary_embedding_mla_cache_fused (#32408)
mawong-amd Jan 19, 2026
71832ba
[GLM-4.7] GLM Model support for GLM-Lite (#31386)
zRzRzRzRzRzRzR Jan 19, 2026
c0a350c
[ROCm][CI] Add ROCm attention backend support for EAGLE DP tests (#32…
AndreasKaratzas Jan 19, 2026
74c583b
[Core] Whisper support `torch.compile` (#30385)
NickLucche Jan 19, 2026
cdd03d2
[CI/Build] Fix dependency conflict between model-hosting-container-st…
DanielMe Jan 19, 2026
758df5a
[NIXL][Metrics] Track `nixl_num_kv_expired_reqs` metric in Prometheus…
NickLucche Jan 19, 2026
c88860d
[Frontend] Score entrypoint support data_1 & data_2 and queries & doc…
noooop Jan 19, 2026
aa7f37c
Add support for LoRA adapters in Nemotron-H models (#30802)
danisereb Jan 19, 2026
2636d76
[Misc] Remove unused ModelKeys (#32608)
jeejeelee Jan 19, 2026
cd3ac5b
support dynamic resolution image encoding for Nemotron Nano VL (#32121)
netanel-haber Jan 19, 2026
a0490be
[CI][amd] Revert NIXL connector change to avoid crash (#32570)
qli88 Jan 19, 2026
0727cc9
[BUGFIX] Fix `test_mla_backends.py`. Scale MLA projection weights to …
vadiklyutiy Jan 19, 2026
9d1e611
[CI] Add Helion as an optional dependency (#32482)
gmagogsfm Jan 19, 2026
7350331
[BugFix] Fix TRT-LLM NVFP4 DP/EP (#32349)
jiahanc Jan 19, 2026
73f2a81
docs: prefix caching seems quite outdated (#28784)
longregen Jan 19, 2026
4a5299c
feat: spec decode with draft models (#24322)
tomasruizt Jan 19, 2026
43fada5
[Model Runner V2] Refactor `dummy_run` (#32533)
WoosukKwon Jan 19, 2026
1a1fc3b
[Attention][MLA] Make FLASHINFER_MLA the default MLA backend on Black…
MatthewBonanni Jan 19, 2026
05dc4bf
[Model Runner V2] Initialized communication buffer for DP (#32624)
WoosukKwon Jan 20, 2026
12dab78
[Feat] allow inplace loading lora (#31326)
Jackmin801 Jan 20, 2026
7b7cdce
[Model Runner V2] Refactor get_cudagraph_and_dp_padding (#32625)
WoosukKwon Jan 20, 2026
6c01ffb
[Model Runner V2] Decouple temperature from penalties (#32629)
WoosukKwon Jan 20, 2026
4753f3b
[Model] Use context managers for encoder- and LM-only mode (#32605)
DarkLight1337 Jan 20, 2026
b75e85d
[1/N] Initialize MM components in context managers (A-D) (#32632)
DarkLight1337 Jan 20, 2026
e9c83cd
[Model Runner V2] Skip kernel launch for penalties & logit_bias (#32634)
WoosukKwon Jan 20, 2026
148117e
[Refactor] Make FP8 Linear Ops use kernel abstraction (#27814)
vllmellm Jan 20, 2026
e1a34c3
[2/N] Initialize MM components in context managers (E-H) (#32641)
DarkLight1337 Jan 20, 2026
8be263c
[Core] Cleanup shm based object store on engine shutdown (#32429)
walterbm Jan 20, 2026
7f1bcd1
[3/N] Initialize MM components in context managers (I-L) (#32650)
DarkLight1337 Jan 20, 2026
c4e5bdf
[Bugfix] Fix the fp8_mqa_logits dim mismatch (#32652)
chaunceyjiang Jan 20, 2026
bb91720
[Metrics] Complete removal of deprecated vllm:time_per_output_token_s…
carlory Jan 20, 2026
fda3f03
[4/N] Initialize MM components in context managers (M-P) (#32663)
DarkLight1337 Jan 20, 2026
13f6630
[XPU]Support AgRsAll2AllManager on XPU device (#32654)
ys950902 Jan 20, 2026
7901109
[Bugfix] Fix Off-by-one error in _num_tokens_to_min_blocks calculatio…
lingebeng Jan 20, 2026
4ca62a0
[PluggableLayer][1/N] Define PluggableLayer (#32331)
whx-sjtu Jan 20, 2026
6c97b9b
[Perf] Only clone when needed for `moe_permute` (#32273)
yewentao256 Jan 20, 2026
c025263
[Doc] [ROCm] Update ROCm getting started doc (#32580)
tjtanaa Jan 20, 2026
04a9e06
[Bugfix] fix the ima issue of qwen-vit (#32687)
JJJYmmm Jan 20, 2026
9ab4388
[Model Runner V2] Support FLASHINFER_MLA backend (#32709)
WoosukKwon Jan 20, 2026
09194b9
[Doc] Update docs for MM model development with context usage (#32691)
DarkLight1337 Jan 20, 2026
f0feb1c
Test: added acceptance length tests (#32030)
rahul-tuli Jan 20, 2026
193069d
[5/N] Initialize MM components in context managers (Q-Z) (#32695)
DarkLight1337 Jan 20, 2026
7c5dedc
[AOT compilation] support torch.compile inductor artifacts in VllmCom…
dolpm Jan 20, 2026
86c69dc
[Bugfix] Fix byte fallback handling when using outlines (#31391)
Alnusjaponica Jan 20, 2026
2261340
[Misc] Remove pad_for_cudagraphs from config (#30143)
LucasWilkinson Jan 20, 2026
9b67338
[Bugfix] Suppress log on non-ROCm platform (#32703)
tjtanaa Jan 20, 2026
22375f8
[ROCm][CI] Remove DS async eplb accuracy test from AMD CI (#32717)
micah-wil Jan 20, 2026
d2389c1
fp8 online quant: split out Fp8OnlineLinearMethod (#32189)
vkuzo Jan 20, 2026
c78ee24
Revert "[PluggableLayer][1/N] Define PluggableLayer" (#32725)
robertgshaw2-redhat Jan 21, 2026
7013e9a
OffloadingConnector: Prevent redundant loads (#29087)
orozery Jan 21, 2026
27b81e0
[Bugfix] Fix Granite Vision / Don't use Siglip Pooling Head Nested Mo…
alex-jw-brooks Jan 21, 2026
6f067b1
[Cleanup] Remove unused `KVConnectorModelRunnerMixin` methods (#32077)
njhill Jan 21, 2026
0900ced
Enable Eagle3 speculative decoding for Pixtral (LlavaForConditionalGe…
gopalsarda Jan 21, 2026
7ab80a8
Added qwen3 vision language moe support for speculative decoding (#32…
shanjiaz Jan 21, 2026
b4f64e5
Update FlashMLA (#32491)
LucasWilkinson Jan 21, 2026
27ca95b
[Bugfix] Fix Nemotron-Nano-v2-vlm static resolution (#32682)
netanel-haber Jan 21, 2026
360aa93
[Docs] Fix GitHub handle in governance process (#32582)
pacoxu Jan 21, 2026
f23fb5a
[Bugfix] Support HF sharded weights for Mistral3/Pixtral models (#32673)
ricky-chaoju Jan 21, 2026
c80f92c
[Documentation] Fix typo in `docs/design/torch_compile_multimodal.md`…
Lucaskabela Jan 21, 2026
6bb2bc7
[Bugfix] Force using spawn multiprocess method when it's the WSL plat…
jasonyanwenl Jan 21, 2026
7727ce3
[Model] Add Eagle2.5-8B Vision-Language Model support (#32456)
George-Polya Jan 21, 2026
e14467b
[bugfix] Aria model (#32727)
divakar-amd Jan 21, 2026
42135d6
[MoE Refactor] Oracle Select FP8+NVFP4 Kernels In Priority (#32414)
robertgshaw2-redhat Jan 21, 2026
cea3c75
[Quantization][Deprecation] Remove `DeepSpeedFp8` (#32679)
robertgshaw2-redhat Jan 21, 2026
85f55c9
[Quantization][Deprecation] Deprecate HQQ (#32681)
robertgshaw2-redhat Jan 21, 2026
6c20e89
[ROCm][Deepseekv3.2] Refactor Sparse Indexer as CustomOp (#29287)
ganyi1996ppo Jan 21, 2026
4e31b7f
[Quantization][Deprecation] Remove RTN (#32697)
robertgshaw2-redhat Jan 21, 2026
1861ae8
[PluggableLayer][1/N] Define PluggableLayer (Fix ci) (#32744)
whx-sjtu Jan 21, 2026
808d6fd
Bump Flashinfer to v0.6.1 (#30993)
elvischenv Jan 21, 2026
9b693d0
[Misc] Omit "disable NCCL for DP sync" startup log when not applicabl…
njhill Jan 21, 2026
e1da249
[Model Runner V2] Minor refactor for `compute_slot_mappings` (#32794)
WoosukKwon Jan 21, 2026
f999539
Add missing import of fused_topk to benchmark_moe (#32784)
danisereb Jan 21, 2026
180fba6
[ROCm] fix import for on_gfx9 (#32783)
divakar-amd Jan 21, 2026
24dc30f
[ModelRunner V2] Don't pin reused flashinfer tensors (#32799)
njhill Jan 21, 2026
e675dda
[Misc] Add Helion version check to collect_env (#32797)
gmagogsfm Jan 21, 2026
63227ac
[Kernel] Add topk_sigmoid kernel (#31246)
xyang16 Jan 21, 2026
408195e
[Model Runner V2] Refactor Prompt Logprobs (#32811)
WoosukKwon Jan 21, 2026
5e00b56
[Model Runner V2] Do not error on attention backends (#32820)
WoosukKwon Jan 22, 2026
6437ff1
[Deprecation] Remove deprecated environment variables (#32812)
yewentao256 Jan 22, 2026
c5487e2
[Bugfix] Fix potential EAGLE spec decode segfault during graph captur…
mawong-amd Jan 22, 2026
378385b
[EC Connector] Optimize remote cache check in scheduler (#32585)
knlnguyen1802 Jan 22, 2026
24a163e
Cleanup some huggingface_hub-related stuff (#32788)
Wauplin Jan 22, 2026
a1d8246
[Docs] Remove outdated async_scheduling limitation with speculative d…
ikaadil Jan 22, 2026
49d9653
[ROCm][CI] fix get_valid_backends (#32787)
divakar-amd Jan 22, 2026
889722f
[FlashMLA] Update FlashMLA to expose new arguments (#32810)
LucasWilkinson Jan 22, 2026
1579c9b
[Llama.py -> mistral.py] Extract mistral-only relevant code into sepa…
patrickvonplaten Jan 22, 2026
f5fdec8
Upgrade transformers-4.57.5 (#32287)
huydhn Jan 22, 2026
019e2c3
[ROCm][CI] Lower Acceptance Len Threshold For test_draft_model_quanti…
micah-wil Jan 22, 2026
eb1629d
[ROCm][CI] Fix AITER test flakiness by using explicit attention backe…
AndreasKaratzas Jan 22, 2026
a810299
[ROCm][CI][Docs] Add comment explaining TRITON_ATTN fallback for ROCm…
AndreasKaratzas Jan 22, 2026
1bf1a34
[bench] add start_times field to vllm bench serve json result (#32667)
kebe7jun Jan 22, 2026
2b8a38b
[Model] Extend `collect_children` and `no_init_weights` contexts (#32…
DarkLight1337 Jan 22, 2026
49a1262
[AMD][ROCm] MoRI EP: a high-performance all2all backend (#28664)
alexsun07 Jan 22, 2026
8ebf271
[Misc] Replace urllib's `urlparse` with urllib3's `parse_url` (#32746)
Isotr0py Jan 22, 2026
098b2d6
[Benchmark] Don't default to `temperature==0` in `vllm bench serve` (…
njhill Jan 22, 2026
64e3d67
Enable Cross layers KV cache layout at NIXL Connector (#30207)
liranschour Jan 22, 2026
328cbb2
[Frontend][2/n] Make pooling entrypoints request schema consensus | C…
noooop Jan 22, 2026
ea6102b
[Bugfix] Fix Whisper/encoder-decoder GPU memory leak (#32789)
NickLucche Jan 22, 2026
1752262
[CI] refactor release pipeline config into groups (#32833)
Harry-Chen Jan 22, 2026
841d53a
[Frontend] add prompt_cache_key for openresponses (#32824)
chaunceyjiang Jan 22, 2026
421012b
OffloadingConnector: Support kernel_block_size != block_size (#30692)
orozery Jan 22, 2026
d117a4d
[Frontend] Introduce Renderer for processing chat messages (using `Mo…
DarkLight1337 Jan 22, 2026
15e302d
[Misc][BE] Turn on strict type coverage for vllm/compilation (#31756)
Lucaskabela Jan 22, 2026
654a71f
[torch.compile] Improve Cold Start for MoEs (#32805)
zou3519 Jan 22, 2026
bc14663
[Cleanup] Move scheduler `get_routed_experts` logic to separate metho…
njhill Jan 22, 2026
444e2e7
[Misc] Bump opencv-python dependecy version to 4.13 (#32668)
Isotr0py Jan 22, 2026
ff365ee
Support bge-m3 sparse embeddings and colbert embeddings (#14526)
maxdebayser Jan 22, 2026
fc37187
[Bugfix] ModelScope is supported when downloading LORA models. (#32844)
AuYang261 Jan 22, 2026
c517d8c
[Hardware][AMD][CI][Bugfix] Fix regressions from deprecated env vars …
mawong-amd Jan 22, 2026
70917b1
[MISC] Add .cursor to .gitignore (#32868)
vadiklyutiy Jan 22, 2026
803e3f3
[UX] Default api_server_count to dp_size if not specified (#32525)
tlrmchlsmth Jan 22, 2026
3a63be0
Support custom URI schemes and trace handlers for profiler (#32393)
diviramon Jan 22, 2026
69d09fd
[Feature] Add --ssl-ciphers CLI argument for TLS cipher control (#30937)
ricky-chaoju Jan 22, 2026
300622e
[CI][Attention] Add more CI dependencies for attention tests (#32487)
MatthewBonanni Jan 22, 2026
744ef30
[CPU Backend] [Perf] Accelerate tensor-parallel/data-parallel inferen…
fadara01 Jan 22, 2026
955b43a
[Bugfix][Attention] Explicitly report support for kv_cache_dtype bflo…
MatthewBonanni Jan 22, 2026
44f08af
Add llmcompressor fp8 kv-cache quant (per-tensor and per-attn_head) (…
eldarkurtic Jan 22, 2026
f744810
[Refactor] Remove unused tpu files (#32610)
yewentao256 Jan 22, 2026
d08b356
[Perf] Create TMA-aligned input scale tensor for DeepGemm on Hopper (…
xyang16 Jan 22, 2026
fc56f4a
[BugFix] Fix invalid flashinfer_fused_moe_blockscale_fp8 op registrat…
fadara01 Jan 22, 2026
dc917cc
[MoE Refactor] Move `select_experts` from `FusedMoEQuantMethod` -> `F…
bnellnm Jan 22, 2026
7fe2558
[Misc] Log vLLM logo when starting server (#32796)
njhill Jan 23, 2026
f61c9da
[BugFix] deepseek_v32_encoding: Replace asserts with proper exception…
RishabhSaini Jan 23, 2026
5e4e0e5
[torch.compile] Compile `CustomOp.forward_native` for `SiluAndMul` an…
ProExpertProg Jan 23, 2026
7ef5873
[CI] Fix mypy for `vllm/v1/structured_output` (#32722)
yewentao256 Jan 23, 2026
fa6e599
[Bugfix] Fix _CPU_MOE_ACT AssertionError when vLLM config not set (#3…
karanb192 Jan 23, 2026
a8eb118
[CI][Models] Add VLM Support for Sequence Classification Conversion (…
AndreasKaratzas Jan 23, 2026
160c6fa
[Misc] Add `get_name` to missing AttentionBackends (#32698)
NickLucche Jan 23, 2026
5da4c7d
[CI/Build][CPU] Fix failed pooling tests and macos smoke test (#32907)
bigPYJ1151 Jan 23, 2026
3f3f895
[Voxtral] Add new streaming arch (#32861)
patrickvonplaten Jan 23, 2026
05f3d71
[Frontend][3/n] Make pooling entrypoints request schema consensus | E…
noooop Jan 23, 2026
aac0b81
[CPU Backend][BugFix] Fix failing CPU MoE test (#32876)
fadara01 Jan 23, 2026
243e78c
[Benchmark][Bugfix] Fix race condtion when starting server for sweep …
Isotr0py Jan 23, 2026
10e94c8
[CPU][Feat] Update PyTorch to v2.10 for CPU Backend (#32869)
fadara01 Jan 23, 2026
13d8746
[Feature]: Remove DtoH Copy for lfm2_vl On Default Stream (#32815)
tianshu-Michael-yu Jan 23, 2026
d95d650
[Bugfix] Fix getting vision features in Transformer Multimodal backen…
zucchini-nlp Jan 23, 2026
90c2007
[Bugfix] Disable tma_aligned_scales in test_fusions_e2e (#32916)
xyang16 Jan 23, 2026
7e22309
[Misc] Postpone torch_profiler deprecation (#32867)
NickLucche Jan 23, 2026
1fb648b
[Bugfix] Fix FP8 MoE EP Weight Loading for ModelOpt Llama4 (#32886)
baonudesifeizhai Jan 23, 2026
1cb4341
[ROCm][PD] Remove unused moriio connector proxy code (#32939)
markmc Jan 23, 2026
305e53a
[Hardware][AMD][CI][Bugfix] Fix Kernels Attention Cache test (#32904)
mawong-amd Jan 23, 2026
9b77bb7
[Frontend] add logprob, compression_rate to 'verbose_json' features (…
sangbumlikeagod Jan 23, 2026
bbbd696
[torch.compile][CI] Add back attn fusion on hopper/ada (#32940)
ProExpertProg Jan 23, 2026
fec9da0
[Model] Enable LoRA support for internvl2 (#32397)
MatteoFari Jan 23, 2026
5206e5e
[V1][Hybrid] Mamba Prefix Caching with align mode (#30877)
peakcrosser7 Jan 23, 2026
68b0a6c
[CI][torch nightlies] Use main Dockerfile with flags for nightly torc…
orionr Jan 23, 2026
2d6b537
[Bugfix][CI] Fix pre-commit (#32956)
MatthewBonanni Jan 23, 2026
8518b30
[Model Runner V2] Add KV Connector support (#32742)
njhill Jan 23, 2026
3a41459
[cudagraphs] Refactor cudagraph capture loop (#32946)
LucasWilkinson Jan 23, 2026
586a57a
fix: Add glm4_moe_lite to MLA detection (#32614)
marksverdhei Jan 23, 2026
dfab5f3
[Bug] Fix benchmark script `moe_permute_unpermute` (#32949)
yewentao256 Jan 23, 2026
6cc6d92
[CI][AMD][BugFix] Update wvSplitK (and other skinny_gemm wrappers) to…
rasmith Jan 23, 2026
4561f13
[Refactor] Rename `gptq_marlin` to `marlin` to match MoE (#32952)
mgoin Jan 23, 2026
37c9859
[Refactor] Clean up unused variables & func (#32692)
yewentao256 Jan 23, 2026
ebd0a17
[Bugfix] Fix missing is_layer_skipped check for FusedMoE in AWQConfig…
joninco Jan 23, 2026
136c499
[CI] fix version comparsion and exclusion patterns in upload-release-…
Harry-Chen Jan 23, 2026
0118cdc
[fix] add VLLM_OBJECT_STORAGE_SHM_BUFFER_NAME to compile factors (#32…
dolpm Jan 23, 2026
a28b94e
[Performance] Split FlashAttn attention and cache update (#25954)
ElizaWszola Jan 24, 2026
7e1f10d
[Core][Bugfix] allow graceful worker termination (#32965)
joerunde Jan 24, 2026
ecc3dd6
[Bugfix] Fix FusedMoE LoRA kernel offs_token out of bound value (#32279)
xyang16 Jan 24, 2026
97ef11d
[ROCm][ViT] Enable Flash Attention Triton backend on RDNA3/RDNA4 (#32…
monajafi-amd Jan 24, 2026
c0d8204
Auth_token added in documentation as it is required (#32988)
ruizcrp Jan 24, 2026
d0cbac5
[Dev UX] Add auto-detection for VLLM_PRECOMPILED_WHEEL_VARIANT during…
mgoin Jan 24, 2026
14d03b8
[Perf] Cache xpu_get_mem_info() result to avoid duplicate calls (#32983)
sjhddh Jan 24, 2026
0b9a735
[Tests] Clarify pytest skip reasons with actionable context (#32981)
sjhddh Jan 24, 2026
0ccecf8
[Tests] Standardize RNG seed utility across test files (#32982)
sjhddh Jan 24, 2026
5c86a89
[docs] Update governance process links (#32995)
esmeetu Jan 24, 2026
8edaf38
[Models] Add `SharedFusedMoE` support to Qwen3MoE (#32082)
Isotr0py Jan 24, 2026
81c2a88
[Doc] Ignore typo check on doc (#32999)
ywang96 Jan 24, 2026
06b557e
feat(benchmark): add encoder forward pass benchmarking to mm-processo…
reaganjlee Jan 24, 2026
51931c5
[UX] Deduplicate sampling parameter startup logs (#32953)
DarkLight1337 Jan 24, 2026
0f19427
[Perf] Cache exc.errors() result in validation exception handler (#32…
sjhddh Jan 24, 2026
6450b53
[Bugfix] Fix E2E latency calculation and add warmup support in mm_pro…
HirokenOvo Jan 24, 2026
9ad7f89
[Models]: Make Multimodal config implicit in ViT implementation (#31972)
Isotr0py Jan 24, 2026
bc0d291
feat: Complete LoRA support for MiniMaxM2 Fixes #32736 (#32763)
Chenhao-Guan Jan 24, 2026
5fa0f6e
[EncoderCacheManager] Remove unnecessary copy (#32800)
lgeiger Jan 24, 2026
1209b78
[Bugfix]: resolve torch.compile cache conflict between mm_encoder_tp_…
HirokenOvo Jan 24, 2026
719ac59
Update CPU doc according to feedback (#32963)
louie-tsai Jan 24, 2026
da5e7b1
[MLA] Fuse cat and qaunt for fp8 kv-cache (#32950)
LucasWilkinson Jan 24, 2026
cd775bd
[Tests] Replace flaky sleep with polling in test_background_cancel (#…
sjhddh Jan 24, 2026
17ab54d
[CPU Backend][BugFix] Fix failing Darwin pipelines (#33002)
fadara01 Jan 24, 2026
203d0bc
[CPU] Improve CPU Docker build (#30953)
maryamtahhan Jan 24, 2026
d4dbb7a
Using max_loras + 1 to construct grid in fused_moe_lora (#32277)
yugong333 Jan 24, 2026
91601ff
[Feature] add session based streaming input support to v1 (#28973)
joshuadeng Jan 24, 2026
1ebdff4
[DOC] [ROCm] Update doc for v0.14.1 (#32998)
tjtanaa Jan 25, 2026
fcb9df9
[Perf][Kernel] Optimize FP4 quantization kernels (SM100F) (#32520)
LopezCastroRoberto Jan 25, 2026
ff6c1da
[Docs] Fix Apple silicon include path in CPU installation docs (#32977)
sjhddh Jan 25, 2026
7e67df5
[Bugfix] fix encoder cache hang in Qwen3VL (#32684)
JJJYmmm Jan 25, 2026
73b2434
[BugFix] Add env variable to control PDL in LoRA (#32836)
jeejeelee Jan 25, 2026
151e545
[Doc] Add Qwen2.5 models to batch invariance tested models (#33016)
ZhanqiuHu Jan 25, 2026
a698e8e
[Model] Use mm_position to compute mrope positions for Qwen2.5-Omni (…
Etelis Jan 25, 2026
22aeb43
[Bugfix][VLM] Fix transformers backend embed_multimodal for Qwen2.5-V…
AndreasKaratzas Jan 26, 2026
edf927b
[Model Runner V2] Fix slot_mapping after #25954 (#33046)
WoosukKwon Jan 26, 2026
2f0d3ba
[Model Runner V2] Minor simplification for finish_requests (#33048)
WoosukKwon Jan 26, 2026
566cdb6
[CI] Fix MHA attention test failure (AttributeError when model_config…
LucasWilkinson Jan 26, 2026
105d104
[StepVL] support close img patch (#32923)
ltd0924 Jan 26, 2026
254db42
[Tests] Remove Duplicates (#33032)
robertgshaw2-redhat Jan 26, 2026
a9b53dd
[Model Runner V2] Add LoRAState to consolidate lora logic (#33062)
WoosukKwon Jan 26, 2026
ee484b3
Set splitk=1 for fused-moe-lora expand kernel (#32882)
dcmaddix Jan 26, 2026
11b5568
[Refactor] Use data parser for matching data items to multi-modal UUI…
DarkLight1337 Jan 26, 2026
cf1167e
[Bugfix] Fix Dtypes for Pynccl Wrapper (#33030)
robertgshaw2-redhat Jan 26, 2026
4676ee6
Revert "Expose ASM PA control and enable by default (#843)"
kliuae Jan 28, 2026
c8e6d5b
sync upstream
kliuae Jan 29, 2026
1dfbe1b
fix ptpcfp8 linear
kliuae Jan 30, 2026
9abc435
fix rmsnorm matcher
kliuae Jan 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 5 additions & 0 deletions .buildkite/lm-eval-harness/configs/models-small-rocm.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Qwen2.5-1.5B-Instruct.yaml
Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Qwen2.5-VL-3B-Instruct-FP8-dynamic.yaml
Qwen1.5-MoE-W4A16-compressed-tensors.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# We can use this script to compute baseline accuracy on chartqa for vllm.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.9
# pip install "lm-eval[api]>=0.4.9.2"

usage() {
echo``
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]
# pip install "lm-eval[api]>=0.4.9.2"

usage() {
echo``
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]
# pip install "lm-eval[api]>=0.4.9.2"

usage() {
echo``
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]
# pip install "lm-eval[api]>=0.4.9.2"

usage() {
echo``
Expand Down
1 change: 1 addition & 0 deletions .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ def launch_lm_eval(eval_config, tp_size):
f"add_bos_token=true,"
f"trust_remote_code={trust_remote_code},"
f"max_model_len={max_model_len},"
"allow_deprecated_quantization=True,"
)

env_vars = eval_config.get("env_vars", None)
Expand Down
26 changes: 7 additions & 19 deletions .buildkite/performance-benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ vLLM also maintains a continuous performance benchmark under [perf.vllm.ai](http

## Performance benchmark quick overview

**Benchmarking Coverage**: latency, throughput and fix-qps serving on B200, A100, H100, Intel® Xeon® Processors and Intel® Gaudi® 3 Accelerators with different models.
**Benchmarking Coverage**: latency, throughput and fix-qps serving on B200, A100, H100, Intel® Xeon® Processors, Intel® Gaudi® 3 Accelerators and Arm® Neoverse™ with different models.

**Benchmarking Duration**: about 1hr.

Expand All @@ -23,7 +23,7 @@ bash .buildkite/performance-benchmarks/scripts/run-performance-benchmarks.sh

Runtime environment variables:

- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
- `ON_CPU`: set the value to '1' on Intel® Xeon® and Arm® Neoverse™ Processors. Default value is 0.
- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
- `THROUGHPUT_JSON`: JSON file to use for the throughout tests. Default value is empty string (use default file).
Expand All @@ -34,8 +34,9 @@ Runtime environment variables:

See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
> NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead.
For Intel® Gaudi® 3 Accelerators, use `tests/latency-tests-hpu.json`, `tests/throughput-tests-hpu.json`, `tests/serving-tests-hpu.json` instead.
>
> For Intel® Gaudi® 3 Accelerators, use `tests/latency-tests-hpu.json`, `tests/throughput-tests-hpu.json`, `tests/serving-tests-hpu.json` instead.
> For Arm® Neoverse™, use `tests/latency-tests-arm64-cpu.json`, `tests/throughput-tests-arm64-cpu.json`, `tests/serving-tests-arm64-cpu.json` instead.

### Latency test

Here is an example of one test inside `latency-tests.json`:
Expand Down Expand Up @@ -175,19 +176,6 @@ If you do not see the table, please wait till the benchmark finish running.
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.

The `compare-json-results.py` helps to compare benchmark results JSON files converted using `convert-results-json-to-markdown.py`.
When run, benchmark script generates results under `benchmark/results` folder, along with the `benchmark_results.md` and `benchmark_results.json`.
`compare-json-results.py` compares two `benchmark_results.json` files and provides performance ratio e.g. for Output Tput, Median TTFT and Median TPOT.
If only one benchmark_results.json is passed, `compare-json-results.py` compares different TP and PP configurations in the benchmark_results.json instead.

Here is an example using the script to compare result_a and result_b with Model, Dataset name, input/output length, max concurrency and qps.
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`

| | Model | Dataset Name | Input Len | Output Len | # of max concurrency | qps | results_a/benchmark_results.json | results_b/benchmark_results.json | perf_ratio |
|----|---------------------------------------|--------|-----|-----|------|-----|-----------|----------|----------|
| 0 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | 1 | 142.633982 | 156.526018 | 1.097396 |
| 1 | meta-llama/Meta-Llama-3.1-8B-Instruct | random | 128 | 128 | 1000 | inf| 241.620334 | 294.018783 | 1.216863 |
#### Performance Results Comparison

A comparison diagram will be generated below the table.
Here is an example to compare between 96c/results_gnr_96c_091_tp2pp3 and 128c/results_gnr_128c_091_tp2pp3
<img width="1886" height="828" alt="image" src="https://github.com/user-attachments/assets/c02a43ef-25d0-4fd6-90e5-2169a28682dd" />
Follow the instructions in [performance results comparison](https://docs.vllm.ai/en/latest/benchmarking/dashboard/#performance-results-comparison) to analyze performance results and the sizing guide.
Loading
Loading