Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
443 commits
Select commit Hold shift + click to select a range
f1b286b
[TPU] Update ptxla nightly version to 20250724 (#21555)
Jul 26, 2025
7ae75fa
[Feature] Add support for MoE models in the calibration-free RTN-base…
sakogan Jul 26, 2025
62965de
[Model] Ultravox: Support Llama 4 and Gemma 3 backends (#17818)
farzadab Jul 26, 2025
97349fe
[Docs] add offline serving multi-modal video input expamle Qwen2.5-VL…
david6666666 Jul 26, 2025
a55c950
Correctly kill vLLM processes after finishing serving benchmarks (#21…
huydhn Jul 26, 2025
2f6e6b3
[Bugfix] Fix isinstance check for tensor types in _load_prompt_embeds…
Mitix-EPI Jul 26, 2025
7728dd7
[TPU][Test] Divide TPU v1 Test into 2 parts. (#21431)
QiliangCui Jul 26, 2025
875af38
Support Intern-S1 (#21628)
lvhan028 Jul 26, 2025
05c1126
[Misc] remove unused try-except in pooling config check (#21618)
reidliu41 Jul 26, 2025
e98def4
[Take 2] Correctly kill vLLM processes after benchmarks (#21646)
huydhn Jul 26, 2025
9d19728
Migrate AriaImagePixelInputs to TensorSchema for shape validation (#2…
bbeckca Jul 26, 2025
de10ff0
Migrate AyaVisionImagePixelInputs to TensorSchema for shape validatio…
bbeckca Jul 26, 2025
f27fdfc
[Bugfix] Investigate Qwen2-VL failing test (#21527)
Isotr0py Jul 26, 2025
1cd6eab
Support encoder-only models without KV-Cache (#21270)
maxdebayser Jul 26, 2025
c215f5c
[Bug] Fix `has_flashinfer_moe` Import Error when it is not installed …
yewentao256 Jul 26, 2025
a40a850
[Misc] Improve memory profiling debug message (#21429)
yeqcharlotte Jul 26, 2025
97d6c30
[BugFix] Fix shared storage connector load kv only load attention lay…
david6666666 Jul 26, 2025
56e544f
[Refactor] Remove `moe_align_block_size_triton` (#21335)
yewentao256 Jul 26, 2025
9094d11
[Bugfix][Apple Silicon] fix missing symbols when build from source on…
zhouyeju Jul 26, 2025
e7c4f9e
[CI/Build][Doc] Move existing benchmark scripts in CI/document/exampl…
yeqcharlotte Jul 26, 2025
de509ae
[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscal…
kaixih Jul 26, 2025
6c66f28
Remove xformers requirement for Mistral-format Pixtral and Mistral3 (…
wenchen76 Jul 26, 2025
c657369
support `torch.compile` for bailing moe (#21664)
jinzhen-lin Jul 26, 2025
ccf27cc
Migrate Blip2ImagePixelInputs and Blip2ImageEmbeddingInputs to Tensor…
bbeckca Jul 27, 2025
0b8caf9
Migrate DeepseekVL2ImageInputs to TensorSchema (#21658)
bbeckca Jul 27, 2025
3339cba
Migrate FuyuImagePatchInputs to TensorSchema (#21662)
bbeckca Jul 27, 2025
20950b2
Migrate ChameleonImagePixelInputs to TensorSchema (#21657)
bbeckca Jul 27, 2025
eed2f46
[VLM] Support HF format Phi-4-MM model (#17121)
Isotr0py Jul 27, 2025
971948b
Handle non-serializable objects in vllm bench (#21665)
huydhn Jul 27, 2025
01a395e
[CI/Build][Doc] Clean up more docs that point to old bench scripts (#…
yeqcharlotte Jul 27, 2025
a8936e5
Refactor: Remove numpy dependency from LoggingStatLogger (#20529)
skyloevil Jul 27, 2025
1cbf951
[Misc] add default value for file pattern arg (#21659)
andyxning Jul 27, 2025
5f8c9a4
Migrate Florence2ImagePixelInputs to TensorSchema (#21663)
bbeckca Jul 27, 2025
3d847a3
[VLM] Add video support for Intern-S1 (#21671)
Isotr0py Jul 27, 2025
bda9d05
[Refactor] Refactor MOE NVFP4 Code Base: ModelOpt + Compressed Tensor…
yewentao256 Jul 27, 2025
57c22e5
Fix CUDA permute/unpermute for use with DeepGemm Moe (#17934)
CalebDu Jul 27, 2025
a9b2a1d
[Misc] Refactor vllm config str (#21666)
andyxning Jul 27, 2025
8f605ee
[Attention] Make CutlassMLA the default backend for SM100 (blackwell)…
alexm-redhat Jul 27, 2025
86ae693
[Deprecation][2/N] Replace `--task` with `--runner` and `--convert` (…
DarkLight1337 Jul 28, 2025
82acf21
Fix typo for limit-mm-per-prompt in docs (#21697)
joa-stdn Jul 28, 2025
93269bb
Fix GLM tool parser (#21668)
zRzRzRzRzRzRzR Jul 28, 2025
04ff4be
[Misc] Add fused_moe configs for Qwen3-Coder-480B-A35B-Instruct-FP8 …
jeejeelee Jul 28, 2025
15a72ac
[V1] Exception Handling when Loading KV Cache from Remote Store (#21534)
liuyumoye Jul 28, 2025
c7ffe93
[Model] Support TP/PP/mamba2 kernel for PLaMo2 (#19674)
Alnusjaponica Jul 28, 2025
e626d28
[FEAT] [ROCm] [AITER]: Add AITER HIP block quant kernel (#21242)
tjtanaa Jul 28, 2025
d8937de
Migrate Gemma3ImagePixelInputs to TensorSchema (#21676)
bbeckca Jul 28, 2025
88e46c7
Migrate Glm4vImageInputs, Glm4vVideoInputs to TensorSchema (#21678)
bbeckca Jul 28, 2025
304dcdf
Migrate GLMVImagePixelInputs to TensorSchema (#21679)
bbeckca Jul 28, 2025
75856bc
Migrate GraniteSpeechAudioInputs to TensorSchema (#21682)
bbeckca Jul 28, 2025
3ea57a5
Migrate Idefics3ImagePixelInputs and Idefics3ImageEmbeddingInputs to …
bbeckca Jul 28, 2025
7656cf4
[Bugfix] [issue-21565] Fix the incompatibility issue with stream and …
hsliuustc0106 Jul 28, 2025
18cc33d
[bugfix] fix profile impact benchmark results (#21507)
lengrongfu Jul 28, 2025
139a97e
[Bugfix] Fix shape checking for Fuyu (#21709)
DarkLight1337 Jul 28, 2025
150d9e6
[Bugfix] fix max-file-size type from str to int (#21675)
andyxning Jul 28, 2025
139a7f0
[BugFix] Fix ChunkedLocalAttention when the hybrid kv-cache is disabl…
LucasWilkinson Jul 28, 2025
a6c0502
[v1][mamba] Added mamba_type into MambaSpec (#21715)
Josephasafg Jul 28, 2025
d128d0d
Migrate KeyeImageInputs and KeyeVideoInputs to TensorSchema (#21686)
bbeckca Jul 28, 2025
a4ed731
[Model] Prioritize Transformers fallback over suffix matching (#21719)
DarkLight1337 Jul 28, 2025
2cc5711
[feature] add log non default args in LLM (#21680)
lengrongfu Jul 28, 2025
1b769dc
[Bugfix] Fix Ernie4_5_MoeForCausalLM shared experts (#21717)
jeejeelee Jul 28, 2025
65e8466
[Bugfix] Fix environment variable setting in CPU Dockerfile (#21730)
bigPYJ1151 Jul 28, 2025
0ae970e
[Bugfix] Fix glm4.1v video_grid_thw tensor shape scheme (#21744)
Isotr0py Jul 28, 2025
63fe3a7
[PD] let p2p nccl toy proxy handle /chat/completions (#21734)
chaunceyjiang Jul 28, 2025
656c24f
[`Ernie 4.5`] Name Change for Base 0.3B Model (#21735)
vasqu Jul 28, 2025
9ace2ea
[Bugfix] Improve JSON extraction in LlamaToolParser (#19024)
key4ng Jul 28, 2025
1395dd9
[Docs] Add revision date to rendered docs (#21752)
hmellor Jul 28, 2025
bccc43c
[Bugfix]check health for engine core process exiting unexpectedly (#2…
wuhang2014 Jul 28, 2025
31084b3
[Bugfix][CI/Build] Update peft version in test requirement (#21729)
Isotr0py Jul 28, 2025
34a20c4
[Logs] Change flashinfer sampler logs to once (#21759)
mgoin Jul 28, 2025
0e18a5d
[Misc] Reduce logs for model resolution (#21765)
DarkLight1337 Jul 28, 2025
25708d3
[Bugfix] Mistral crashes on tool with no description (#21167)
HugoMichard Jul 28, 2025
04fe61a
[CI/Build] Fix plugin tests (#21758)
DarkLight1337 Jul 28, 2025
ec261b0
[XPU] IPEX-optimized Punica Wrapper on XPU (#21703)
chaojun-zhang Jul 28, 2025
e17a4d3
[Bugfix] Fix granite speech shape validation (#21762)
DarkLight1337 Jul 28, 2025
7d44c69
[P/D] Log warnings related to prefill KV expiry (#21753)
njhill Jul 28, 2025
94b71ae
Use `metavar` to list the choices for a CLI arg when custom values ar…
hmellor Jul 28, 2025
01c753e
update flashinfer to v0.2.9rc2 (#21701)
weireweire Jul 28, 2025
b361f14
[AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes …
rasmith Jul 28, 2025
e0e58f9
[Bug] Enforce contiguous input for `dynamic_scaled_fp8_quant` and `st…
yewentao256 Jul 28, 2025
9ba1c88
[AMD][CI/Build] Fix the AMD issue caused by inappropriate of symbol e…
houseroad Jul 28, 2025
b18b417
Revert "[V1] Exception Handling when Loading KV Cache from Remote Sto…
KuntaiDu Jul 28, 2025
c6f36cf
[Bugfix] DeepGEMM is not enabled on B200 due to `_lazy_init()` (#21472)
smarterclayton Jul 28, 2025
89ac266
[Feat]: Add support for Dynamic Quant 4 bit CPU kleidiai kernels (#17…
nikhil-arm Jul 28, 2025
8aa1485
[Perf] Disable chunked local attention by default with llama4 (#21761)
LucasWilkinson Jul 28, 2025
c6c9122
[Kernel] SM90 CUTLASS FP8 GEMM: add support for swap AB + kernel tuni…
LyrisZhong Jul 28, 2025
947e982
[Docs] Minimize spacing for supported_hardware.md table (#21779)
mgoin Jul 29, 2025
48b763d
[Refactor] Merge Compressed Tensor FP8 `CompressedTensorsW8A8Fp8MoEMe…
yewentao256 Jul 29, 2025
afa2607
[CI] Parallelize Kernels MoE Test (#21764)
mgoin Jul 29, 2025
e18f085
skip fusedmoe layer for start_load_kv (#21378)
calvin0327 Jul 29, 2025
12a223e
[AMD][CI/Build][Bugfix] Guarding CUDA specific functions by ifndef RO…
gshtras Jul 29, 2025
f1e2c09
Migrate InternVLImageInputs and InternVLVideoInputs to TensorSchema (…
bbeckca Jul 29, 2025
7234fe2
[Misc] Rework process titles (#21780)
njhill Jul 29, 2025
a248025
[Doc] Link to RFC for pooling optimizations (#21806)
DarkLight1337 Jul 29, 2025
a4528f0
[Model]: Fused MoE for nomic-embed-text-v2-moe (#18321)
Isotr0py Jul 29, 2025
37efc63
[V0 deprecation] Guided decoding (#21347)
rzabarazesh Jul 29, 2025
61a6905
[Model] Refactor JambaForCausalLM (#21394)
jeejeelee Jul 29, 2025
2470419
[Docs] Fix the outdated URL for installing from vLLM binaries (#21523)
yankay Jul 29, 2025
755fa8b
[KVCache] Make KVCacheSpec hashable (#21791)
heheda12345 Jul 29, 2025
ab71413
[Doc] Update compatibility matrix for pooling and multimodal models (…
DarkLight1337 Jul 29, 2025
04e3850
[Bugfix] VLLM_V1 supports passing other compilation levels (#19340)
zou3519 Jul 29, 2025
f693b06
[Docs] Merge design docs for a V1 only future (#21832)
hmellor Jul 29, 2025
759b87e
[TPU] Add an optimization doc on TPU (#21155)
bvrockwell Jul 29, 2025
ad341c5
[Bugfix]fix mixed bits and visual language model quantization in Auto…
wenhuach21 Jul 29, 2025
58b11b2
[Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backen…
elvischenv Jul 29, 2025
37f86d9
[Docs] use `uv` in GPU installation docs (#20277)
davidxia Jul 29, 2025
f03e9cf
[Doc] Add FusedMoE Modular Kernel Documentation (#21623)
varun-sundar-rabindranath Jul 29, 2025
7b49cb1
[Doc] update Contributing page's testing section (#18272)
davidxia Jul 29, 2025
a33ea28
Add `flashinfer_python` to CUDA wheel requirements (#21389)
mgoin Jul 29, 2025
a1873db
docker: docker-aware precompiled wheel support (#21127)
dougbtv Jul 29, 2025
176bbce
Revert "[AMD][CI/Build] Fix the AMD issue caused by inappropriate of …
gshtras Jul 29, 2025
9266d98
[BugFix] Fix interleaved sliding window not set for Gemma3n (#21863)
sarckk Jul 29, 2025
0d0cc9e
[ci] add b200 test placeholder (#21866)
simon-mo Jul 30, 2025
452b2a3
[ci] mark blackwell test optional for now (#21878)
simon-mo Jul 30, 2025
0e36abf
[Bugfix] Correct max tokens for non-contiguous embeds (#21798)
milesial Jul 30, 2025
555e722
[v1][attention] Support Hybrid Allocator + FlashInfer (#21412)
heheda12345 Jul 30, 2025
ba5c5e5
[Docs] Switch to better markdown linting pre-commit hook (#21851)
hmellor Jul 30, 2025
76080cf
[DOC] Fix path of v1 related figures (#21868)
heheda12345 Jul 30, 2025
fb58e3a
[Docs] Update docker.md with HF_TOKEN, new model, and podman fix (#21…
mgoin Jul 30, 2025
b917da4
Expose PyTorch profiler configuration to environment variables (#21803)
Csrayz Jul 30, 2025
fdde182
[Bugfix] Fix shape mismatch assertion error when loading Gemma3n mode…
sydarb Jul 30, 2025
b7b23da
[Bugfix] Fix comment typo of get_num_common_prefix_blocks() (#21827)
MingzhenHan Jul 30, 2025
44bc46d
[Bugfix] Actually disable processing cache when API server is scaled …
DarkLight1337 Jul 30, 2025
1b0a155
[Perf] Using `__nv_fp8_e4m3` instead of `c10::e4m3` for `per_token_gr…
yewentao256 Jul 30, 2025
65f311c
[Frontend] Add LLM.reward specific to reward models (#21720)
noooop Jul 30, 2025
05cbbe2
[XPU] use `ZE_AFFINITY_MASK` for device select on xpu (#21815)
jikunshang Jul 30, 2025
e3bc17c
Add @sighingnow as maintainer of qwen's related files. (#21895)
sighingnow Jul 30, 2025
16f3250
[CI/Build] Fix pre-commit failure in docs (#21897)
DarkLight1337 Jul 30, 2025
4cd7fe6
[Docs] Expand introduction to Ray in Multi-node deployment section (#…
crypdick Jul 30, 2025
6f8d261
Update vLLM Benchmark Suite for Xeon based on 0.9.2 release (#21486)
louie-tsai Jul 30, 2025
2ca5f82
[Misc] Remove redundant config definitions (#21891)
DarkLight1337 Jul 30, 2025
02f82fe
[Doc] Update Intern-S1 info (#21908)
jeejeelee Jul 30, 2025
30ef30e
[CI] rollback lint-and-deploy pipeline using amd machine (#21912)
kebe7jun Jul 30, 2025
5477952
[Tests] Fixing bug inside MultiModalProfiler. (#21842)
shenoyvvarun Jul 30, 2025
fc91da5
[Model] Remove DSV2 unused code (#21903)
jeejeelee Jul 30, 2025
533db09
[benchmark] add max-concurrency in result table (#21095)
panpan0000 Jul 30, 2025
5bbaf49
[Doc] Update partial support (#21916)
DarkLight1337 Jul 30, 2025
5c8fe38
[Docs] Fix the example code of streaming chat completions in reasonin…
hsliuustc0106 Jul 30, 2025
1398636
Add @patrickvonplaten as maintainer of mistral's related files. (#21928)
patrickvonplaten Jul 30, 2025
b876860
[Hardware][CPU] Build fix for ARM without BF16 (#21848)
ericcurtin Jul 30, 2025
d979dd6
[Feature][EPLB] Add eplb support for Qwen3 (#20815)
aladerran Jul 30, 2025
fcfd1eb
[Doc] Remove vLLM prefix and add citation for PagedAttention (#21910)
DarkLight1337 Jul 30, 2025
da3e0bd
[Bugfix] we should use metavar is not choices (#21902)
lengrongfu Jul 30, 2025
bf668b5
[Feature] Support multiple api keys in server (#18548)
Yanpas Jul 30, 2025
e91d3c9
[misc] skip p2p check by default (#21904)
youkaichao Jul 30, 2025
0271c2f
[Test] Add Benchmark and Unit Test for `per_token_group_quant` (#21860)
yewentao256 Jul 30, 2025
0e40b26
[CI/Build] Only run markdownlint in CI (#21892)
DarkLight1337 Jul 30, 2025
36ede45
Reduce time wasted in GitHub Actions using `concurrency` (#21919)
hmellor Jul 30, 2025
8f4a1c9
[Misc] Improve code readability of KVCacheManager (#21673)
tanruixiang Jul 30, 2025
ff08e51
[NVIDIA] Fix Llama4 Scout FP4 functionality issues (#21499)
nvpohanh Jul 30, 2025
88edf59
[Docs] Reduce the size of the built docs (#21920)
hmellor Jul 30, 2025
6e599ee
[Bugfix] Fix OOM tests in initialization test (#21921)
Isotr0py Jul 30, 2025
366f6b3
[Bugfix] Fix multi-api server not working for text models (#21933)
DarkLight1337 Jul 30, 2025
ad51030
Override attention metadata for fast prefill in some KV sharing setup…
sarckk Jul 30, 2025
5c765ae
[Bugfix] Fix TypeError in scheduler when comparing mixed request_id t…
chi2liu Jul 30, 2025
004203e
[CI/Build] Fix registry tests (#21934)
DarkLight1337 Jul 30, 2025
4904e53
[Bugfix] SharedStorage Connector for V1 PD multimodal (#21611)
fake0fan Jul 30, 2025
f413523
feat(distributed): add `get_required_kvcache_layout` class method to …
wxsms Jul 30, 2025
8f0d516
[TPU] Support Pathways in vLLM (#21417)
wenxindongwork Jul 30, 2025
56bd537
[Misc] Support more collective_rpc return types (#21845)
njhill Jul 30, 2025
b9b753e
For VLLM_USE_PRECOMPILED, only compiled .so files should be extracted…
dougbtv Jul 30, 2025
f12d925
[Misc] Use dracut on CentOS and skip clone if repo exists for EP kern…
minosfuture Jul 30, 2025
287f527
[Feature] Add async tensor parallelism for scaled mm (#20155)
cascade812 Jul 30, 2025
601f856
[Bugfix] Fix None value handling in trace span creation for cancelled…
br4mm Jul 30, 2025
ca9e2be
[Core] Move EngineCoreRequest to Request conversion out of EngineCore…
linzebing Jul 30, 2025
9cb497b
[Example] Add `async_llm_streaming.py` example for AsyncLLM streaming…
mgoin Jul 31, 2025
ec02e53
[Bugfix] Relax lang pin for voxtral (#21833)
sanchit-gandhi Jul 31, 2025
6144545
[UX] Rename CUTLASS_MLA_VLLM_V1 to CUTLASS_MLA (#21966)
mgoin Jul 31, 2025
0f7919f
[Misc] Expand SUPPORTED_HIDDEN_SIZES for DeepEP low-latency kernels …
jeejeelee Jul 31, 2025
055bd39
[CI Bugfix] Fix CI OOM for `test_shared_storage_connector_hashes` (#2…
mgoin Jul 31, 2025
3e36fcb
[Bugfix]: fix metadata file copy in test_sharded_state_loader (#21830)
andyxning Jul 31, 2025
9532a6d
[Deprecation] Remove deprecated args and methods (#21907)
DarkLight1337 Jul 31, 2025
d2aab33
[CI/Build] get rid of unused VLLM_FA_CMAKE_GPU_ARCHES (#21599)
dtrifiro Jul 31, 2025
2836dd7
[Model][CI] Let more pooling models support v1 (#21747)
noooop Jul 31, 2025
5daffe7
[BugFix] Fix case where `collective_rpc` returns `None` (#22006)
njhill Jul 31, 2025
207b750
[NVIDIA] Add SM100 Flashinfer MoE per tensor scale fp8 backend (#21458)
amirkl94 Jul 31, 2025
9484641
[Model] Add step3 vl (#21998)
Oliver-ss Jul 31, 2025
7349d52
[ez] Remove a trailing space from compilation/decorators.py (#22028)
zhxchen17 Jul 31, 2025
58bb902
fix(setup): improve precompiled wheel setup for Docker builds (#22025)
dougbtv Jul 31, 2025
0780bb5
Removing amdproduction Tests (#22027)
Alexei-V-Ivanov-AMD Jul 31, 2025
53c21e4
Update torch_xla pin to 20250730 (#21956)
vanbasten23 Jul 31, 2025
9e0726e
[Meta] Official Eagle mm support, first enablement on llama4 (#20788)
morgendave Jul 31, 2025
71470bc
[Misc] Add unit tests for chunked local attention (#21692)
sarckk Jul 31, 2025
2dff2e2
[Bugfix] Fix MTP weight loading (#21941)
benchislett Jul 31, 2025
6e672da
Add FlashInfer allreduce RMSNorm Quant fusion (#21069)
ilmarkov Jul 31, 2025
c3e0e93
[Feature] Add Flashinfer MoE Support for Compressed Tensor NVFP4 (#21…
yewentao256 Jul 31, 2025
e360316
Add DeepGEMM to Dockerfile in vllm-base image (#21533)
MatthewBonanni Aug 1, 2025
0bd409c
Move flashinfer-python to optional extra `vllm[flashinfer]` (#21959)
mgoin Aug 1, 2025
3700642
[Refactor] Remove Duplicate `per_block_cast_to_fp8`, Remove Dependenc…
yewentao256 Aug 1, 2025
ad57f23
[Bugfix] Fix: Fix multi loras with tp >=2 and LRU cache (#20873)
charent Aug 1, 2025
82de9b9
[Misc] Automatically resolve HF processor init kwargs (#22005)
DarkLight1337 Aug 1, 2025
e1a7fe4
[BugFix] fix: aot passes kvcache dtype information (#19750)
mickaelseznec Aug 1, 2025
0f46a78
[Model] [Quantization] Support quantization for Gemma3n (#21974)
kylesayrs Aug 1, 2025
61dcc28
[Doc] Add Voxtral to Supported Models page (#22059)
DarkLight1337 Aug 1, 2025
53d7c39
Update sampling_metadata.py (#21937)
Aviadr-neureality Aug 1, 2025
79731a7
[Doc] Fix a syntax error of example code in structured_outputs.md (#2…
hsliuustc0106 Aug 1, 2025
b4e081c
[Bugfix] Disable multi-modal preprocessor cache for DP (#21896)
DarkLight1337 Aug 1, 2025
e0f63e4
[Core] Avoid repeated len(block_token_ids) check in hash_request_toke…
linzebing Aug 1, 2025
98df153
[Frontend] Align tool_choice="required" behavior with OpenAI when too…
n0gu-furiosa Aug 1, 2025
da31f6a
Revert precompile wheel changes (#22055)
simon-mo Aug 1, 2025
27a145e
[Doc] Add example for Step3-VL (#22061)
Aug 1, 2025
e6680f9
[Bugfix] Add log prefix in non-dp mode engine core (#21889)
wuhang2014 Aug 1, 2025
0f81b31
[Misc] Remove upper bound in openai package version (#22060)
WoosukKwon Aug 1, 2025
4931486
[Doc] Added warning of speculating with draft model (#22047)
david6666666 Aug 1, 2025
28b18cc
[Quantization] Enable BNB support for InternS1 (#21953)
jeejeelee Aug 1, 2025
87c94bc
Revert "Update sampling_metadata.py (#21937)" (#22088)
hmellor Aug 1, 2025
dfbc1f8
[Speculative Decoding] Add `speculators` config support (#21345)
dsikka Aug 1, 2025
26b5f7b
[BUG] [ROCm] Fix import bug on ROCm (#22083)
tjtanaa Aug 1, 2025
fb0e0d4
Fix `get_kwargs` for case where type hint is `list[Union[str, type]]`…
hmellor Aug 1, 2025
f81c1bb
[Bugfix] Check NVIDIA artifactory is accessible before using flashinf…
mgoin Aug 1, 2025
0a6d305
feat(multimodal): Add customizable background color for RGBA to RGB c…
ahengljh Aug 1, 2025
5c54d97
[Bugfix][PD] set max_completion_tokens=1 if req has this value (#21841)
Abirdcfly Aug 1, 2025
a59cd9d
[Refactor] Fix Compile Warning #1444-D (#21462)
yewentao256 Aug 1, 2025
8026a33
[BugFix] Update AttnFusionPass cache key (#21947)
zou3519 Aug 1, 2025
3146519
[BugFix] Don't change title of top-level process (#22032)
njhill Aug 1, 2025
97608dc
[Docs] use `uv` in CPU installation docs (#22089)
davidxia Aug 1, 2025
2d7b09b
Deprecate `--disable-log-requests` and replace with `--enable-log-req…
hmellor Aug 1, 2025
326a1b0
Improve documentation of `ModelConfig.try_get_generation_config` to p…
hmellor Aug 1, 2025
3f8e952
[Bugfix] Fix glm4.1v video inference issue (#22067)
Isotr0py Aug 1, 2025
b879ecd
[Bugfix] fix when skip tokenizer init (#21922)
lengrongfu Aug 1, 2025
d666466
security policy: take 1 (#21119)
sidhpurwala-huzaifa Aug 1, 2025
ac45c44
[Bugfix] [Performance] DeepEPHighThroughput + DeepSeek : Quant before…
varun-sundar-rabindranath Aug 1, 2025
38c8bce
Enable headless models for pooling in the Transformers backend (#21767)
hmellor Aug 1, 2025
8d70599
[Misc] Minor enhancement of benchmark_moe (#22068)
jeejeelee Aug 1, 2025
3277e8f
Fix pre-commit failure for SECURTIY.md (#22102)
mgoin Aug 1, 2025
9659bc7
[compile][startup] Disable C++ compilation of symbolic shapes (#20836)
anijain2305 Aug 1, 2025
d331759
Introduce RayPPCommunicator for ray-based PP (#21660)
ruisearch42 Aug 1, 2025
d84b97a
Add lora test for tp>1 case for TPU. (#21970)
vanbasten23 Aug 1, 2025
881e1af
[BugFix] Harden distributed DP startup (#21538)
njhill Aug 1, 2025
88faa46
[CI] Initial tests for SM100 Blackwell runner (#21877)
mgoin Aug 1, 2025
eefbf4a
[Perf] Optimize `reshape_and_cache_flash` CUDA Kernel (#22036)
yewentao256 Aug 1, 2025
3654847
feat: Add Support GPTQ Quantization MOE on ROCM vllm serve (#21733)
JartX Aug 2, 2025
2332243
[V1][CUDA] Full cudagraph support for FlashInfer (#21367)
fhl2000 Aug 2, 2025
ee2eb6e
[Model] Qwen2.5 VL SiLU-and-Mul (#22066)
vllmellm Aug 2, 2025
5739371
[Misc] `VLLM_TARGET_DEVICE.lower()` (#22101)
NickLucche Aug 2, 2025
a65f46b
[Misc] DeepGemmExperts : Avoid JIT generation in the hot-path (#21955)
varun-sundar-rabindranath Aug 2, 2025
9f9c38c
[Speculators][Speculative Decoding] Add Qwen Eagle3 Support (#21835)
dsikka Aug 2, 2025
8d524ce
[BugFix] Improve internal DP load balancing (#21617)
njhill Aug 2, 2025
6e8d8c4
[Test] Add Unit Test for Batched DeepGEMM (#21559)
yewentao256 Aug 2, 2025
0edaf75
[Attention][DBO] Add support for "splitting" the CommonAttentionMetad…
SageMoore Aug 2, 2025
d3a6f21
[FEAT][ROCm] Enable running Flash Attention as ViT attn backend for Q…
vllmellm Aug 2, 2025
4ac8437
[Misc] Getting and passing ray runtime_env to workers (#22040)
ruisearch42 Aug 2, 2025
8564dc9
Fix test_kv_sharing_fast_prefill flakiness (#22038)
sarckk Aug 2, 2025
c64861d
[Bugfix] Mamba2 remove bugged initial state condition in chunk scan (…
cyang49 Aug 2, 2025
067c34a
docs: remove deprecated disable-log-requests flag (#22113)
Aug 2, 2025
58eee5f
[PERF] Use faster way of decode in tokenizer: avoid useless list-to-l…
vadiklyutiy Aug 2, 2025
25373b6
for glm-4.1V update (#22000)
zRzRzRzRzRzRzR Aug 2, 2025
b690e34
[Model] Mamba2 preallocate SSM output tensor to avoid d2d copy overhe…
cyang49 Aug 2, 2025
f5d0f47
[Frontend] Improve error message for too many mm items (#22114)
DarkLight1337 Aug 2, 2025
4abfd87
[V1] [Hybrid] Validate compatibility of attention backend batch reord…
tdoublep Aug 2, 2025
fa7cef7
feat(compilation): add VLLM_COMPILE_DEPYF env var to control decompil…
vincentzed Jul 31, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
16 changes: 10 additions & 6 deletions .buildkite/nightly-benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performanc
## Trigger the benchmark

Performance benchmark will be triggered when:

- A PR being merged into vllm.
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label.

Expand All @@ -38,6 +39,7 @@ bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
```

Runtime environment variables:

- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
Expand All @@ -46,12 +48,14 @@ Runtime environment variables:
- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.

Nightly benchmark will be triggered when:

- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.

## Performance benchmark details

See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
> NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead.
>
### Latency test

Here is an example of one test inside `latency-tests.json`:
Expand All @@ -74,21 +78,21 @@ Here is an example of one test inside `latency-tests.json`:
In this example:

- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
- The `parameters` attribute control the command line arguments to be used for `vllm bench latency`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `vllm bench latency`. For example, the corresponding command line arguments for `vllm bench latency` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`

Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.

WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.

### Throughput test

The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `vllm bench throughput`.

The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.

### Serving test

We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
We test the throughput by using `vllm bench serve` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:

```json
[
Expand All @@ -100,7 +104,6 @@ We test the throughput by using `benchmark_serving.py` with request rate = inf t
"tensor_parallel_size": 1,
"swap_space": 16,
"disable_log_stats": "",
"disable_log_requests": "",
"load_format": "dummy"
},
"client_parameters": {
Expand All @@ -118,8 +121,8 @@ Inside this example:

- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
- The `server-parameters` includes the command line arguments for vLLM server.
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `benchmark_serving.py`
- The `client-parameters` includes the command line arguments for `vllm bench serve`.
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `vllm bench serve`

The number of this test is less stable compared to the delay and latency benchmarks (due to randomized sharegpt dataset sampling inside `benchmark_serving.py`), but a large change on this number (e.g. 5% change) still vary the output greatly.

Expand Down Expand Up @@ -149,6 +152,7 @@ Here is an example using the script to compare result_a and result_b without det

Here is an example using the script to compare result_a and result_b with detail test name.
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`

| | results_a/benchmark_results.json_name | results_a/benchmark_results.json | results_b/benchmark_results.json_name | results_b/benchmark_results.json | perf_ratio |
|---|---------------------------------------------|----------------------------------------|---------------------------------------------|----------------------------------------|----------|
| 0 | serving_llama8B_tp1_sharegpt_qps_1 | 142.633982 | serving_llama8B_tp1_sharegpt_qps_1 | 156.526018 | 1.097396 |
Expand Down
21 changes: 11 additions & 10 deletions .buildkite/nightly-benchmarks/nightly-annotation.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# Nightly benchmark annotation

## Description

Expand All @@ -13,15 +14,15 @@ Please download the visualization scripts in the post

- Find the docker we use in `benchmarking pipeline`
- Deploy the docker, and inside the docker:
- Download `nightly-benchmarks.zip`.
- In the same folder, run the following code:

```bash
export HF_TOKEN=<your HF token>
apt update
apt install -y git
unzip nightly-benchmarks.zip
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
```
- Download `nightly-benchmarks.zip`.
- In the same folder, run the following code:

```bash
export HF_TOKEN=<your HF token>
apt update
apt install -y git
unzip nightly-benchmarks.zip
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
```

And the results will be inside `./benchmarks/results`.
34 changes: 17 additions & 17 deletions .buildkite/nightly-benchmarks/nightly-descriptions.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,25 +13,25 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/
## Setup

- Docker images:
- vLLM: `vllm/vllm-openai:v0.6.2`
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
- vLLM: `vllm/vllm-openai:v0.6.2`
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
- Hardware
- 8x Nvidia A100 GPUs
- 8x Nvidia A100 GPUs
- Workload:
- Dataset
- ShareGPT dataset
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
- Models: llama-3 8B, llama-3 70B.
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
- Dataset
- ShareGPT dataset
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
- Models: llama-3 8B, llama-3 70B.
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).

## Known issues

Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# Performance benchmarks descriptions

## Latency tests

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@
"test_name": "Test name",
"gpu_type": "GPU",
"completed": "# of req.",
"max_concurrency": "# of max concurrency.",
"request_throughput": "Tput (req/s)",
"total_token_throughput": "Total Token Tput (tok/s)",
"output_throughput": "Output Tput (tok/s)",
Expand Down Expand Up @@ -100,7 +101,7 @@ def get_size_with_unit(bytes, suffix="B"):
raw_result = json.loads(f.read())

if "serving" in str(test_file):
# this result is generated via `benchmark_serving.py`
# this result is generated via `vllm bench serve` command

# attach the benchmarking command to raw_result
try:
Expand All @@ -120,7 +121,7 @@ def get_size_with_unit(bytes, suffix="B"):
continue

elif "latency" in f.name:
# this result is generated via `benchmark_latency.py`
# this result is generated via `vllm bench latency` command

# attach the benchmarking command to raw_result
try:
Expand Down Expand Up @@ -148,7 +149,7 @@ def get_size_with_unit(bytes, suffix="B"):
continue

elif "throughput" in f.name:
# this result is generated via `benchmark_throughput.py`
# this result is generated via `vllm bench throughput` command

# attach the benchmarking command to raw_result
try:
Expand Down
40 changes: 21 additions & 19 deletions .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ get_current_llm_serving_engine() {
echo "Container: vllm"
# move to a completely irrelevant directory, to avoid import vllm from current folder
export CURRENT_LLM_SERVING_ENGINE=vllm

return
fi
}
Expand All @@ -95,12 +95,14 @@ json2args() {
}

kill_gpu_processes() {
pkill -f python
pkill -f python3
pkill -f tritonserver
pkill -f pt_main_thread
pkill -f text-generation
pkill -f lmdeploy
pkill -f '[p]ython'
pkill -f '[p]ython3'
pkill -f '[t]ritonserver'
pkill -f '[p]t_main_thread'
pkill -f '[t]ext-generation'
pkill -f '[l]mdeploy'
# vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445
pkill -f '[V]LLM'

while [ "$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1)" -ge 1000 ]; do
sleep 1
Expand All @@ -125,7 +127,7 @@ ensure_installed() {
}

run_serving_tests() {
# run serving tests using `benchmark_serving.py`
# run serving tests using `vllm bench serve` command
# $1: a json file specifying serving test cases

local serving_test_file
Expand Down Expand Up @@ -225,7 +227,7 @@ run_serving_tests() {

if [[ "$dataset_name" = "sharegpt" ]]; then

client_command="python3 benchmark_serving.py \
client_command="vllm bench serve \
--backend $backend \
--tokenizer /tokenizer_cache \
--model $model \
Expand All @@ -246,7 +248,7 @@ run_serving_tests() {
sonnet_output_len=$(echo "$common_params" | jq -r '.sonnet_output_len')
sonnet_prefix_len=$(echo "$common_params" | jq -r '.sonnet_prefix_len')

client_command="python3 benchmark_serving.py \
client_command="vllm bench serve \
--backend $backend \
--tokenizer /tokenizer_cache \
--model $model \
Expand All @@ -265,13 +267,13 @@ run_serving_tests() {
$client_args"

else

echo "The dataset name must be either 'sharegpt' or 'sonnet'. Got $dataset_name."
exit 1

fi



echo "Running test case $test_name with qps $qps"
echo "Client command: $client_command"
Expand Down Expand Up @@ -302,7 +304,7 @@ run_serving_tests() {
}

run_genai_perf_tests() {
# run genai-perf tests
# run genai-perf tests

# $1: a json file specifying genai-perf test cases
local genai_perf_test_file
Expand All @@ -311,14 +313,14 @@ run_genai_perf_tests() {
# Iterate over genai-perf tests
jq -c '.[]' "$genai_perf_test_file" | while read -r params; do
# get the test name, and append the GPU type back to it.
test_name=$(echo "$params" | jq -r '.test_name')
test_name=$(echo "$params" | jq -r '.test_name')

# if TEST_SELECTOR is set, only run the test cases that match the selector
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
echo "Skip test case $test_name."
continue
fi

# prepend the current serving engine to the test name
test_name=${CURRENT_LLM_SERVING_ENGINE}_${test_name}

Expand Down Expand Up @@ -369,10 +371,10 @@ run_genai_perf_tests() {
qps=$num_prompts
echo "now qps is $qps"
fi

new_test_name=$test_name"_qps_"$qps
backend=$CURRENT_LLM_SERVING_ENGINE

if [[ "$backend" == *"vllm"* ]]; then
backend="vllm"
fi
Expand Down Expand Up @@ -413,7 +415,7 @@ prepare_dataset() {
do
cat sonnet.txt >> sonnet_4x.txt
done

}

main() {
Expand Down
Loading
Loading