Skip to content
Closed
Changes from all commits
Commits
Show all changes
232 commits
Select commit Hold shift + click to select a range
9fb66ff
feat: bf16 x mxfp4 cutlass fused moe for hopper
djmmoss Aug 18, 2025
07e376d
[CI][Entrypoints]: add filter to generation to filter out invalid too…
wseaton Aug 14, 2025
3f7d7ac
[CI] Fix `tests/distributed/test_ca_buffer_sharing.py` (#22849)
ilmarkov Aug 14, 2025
eb4cfac
[CI] remove flaky v0 test (#22864)
robertgshaw2-redhat Aug 14, 2025
4e78f74
vLLM Benchmark suite improvement (#22119)
louie-tsai Aug 14, 2025
62df10f
[Bugfix] Fix `PixtralHFImagePixelInputs` dynamic shape check (#22827)
Isotr0py Aug 14, 2025
275a334
[BugFix] Threadsafe close async zmq sockets (#22877)
njhill Aug 14, 2025
95ba0df
Remove Phi 4 Flash configuration workaround (#22723)
hmellor Aug 14, 2025
70bcc4b
[Bugfix] Add reset prefix cache for online serving (#22726)
iAmir97 Aug 14, 2025
52729d5
[Doc] fix dead link (#22898)
dtrifiro Aug 14, 2025
e396d2f
[CI] Re-enable transcriptions `test_long_audio_request` (#22890)
NickLucche Aug 14, 2025
16ef143
[Perf] Dont create unnecessary pooling params (#22876)
LucasWilkinson Aug 14, 2025
f611b92
[Model] Modify the gate implementation of glm4_moe (#22832)
jeejeelee Aug 14, 2025
a219618
[Bugfix] Replace custom Encoding class with BatchEncoding in MistralT…
ZJY0516 Aug 14, 2025
65ad494
[Bugfix] Fix parsing of `--disable-mm-preprocessor-cache` (#22909)
DarkLight1337 Aug 14, 2025
72fda97
[CI] [Hybrid] Bump min transformers version for Bamba and Jamba (#22…
tdoublep Aug 14, 2025
c446fb4
[Kernel] [Quantization] Add MXFP4 and bias support for marlin kernel …
jinzhen-lin Aug 14, 2025
adb7678
docs: update fastsafetensors usage instructions (#22891)
NirLevy98 Aug 14, 2025
d24c6bb
[CI] Temporarily disable flaky test (#22930)
LucasWilkinson Aug 14, 2025
79e4f5a
[Kernel] Add nvfp4 gemm flashinfer backends (#22346)
nvjullin Aug 14, 2025
3d2b6c3
[Quantization]: Support compressed-tensors mixed-precision model load…
dsikka Aug 14, 2025
ecf34ca
[Core] Return final response for aborted requests from `AsyncLLM.gene…
njhill Aug 14, 2025
e27479b
[BugFix] Fix initial DP request load imbalance (#22910)
njhill Aug 14, 2025
b1ae1e2
[Bugfix] use flash attn on sm90 (#22933)
zyongye Aug 14, 2025
b17cb00
[Kernel] Add cuda kernel for gpt_oss activation (#22538)
jeejeelee Aug 15, 2025
3c2693d
Revert "[Kernel] Add cuda kernel for gpt_oss activation" (#22948)
simon-mo Aug 15, 2025
496e3fe
[BugFix][KVConn] Fix use of `get_required_kvcache_layout` (#22734)
njhill Aug 15, 2025
465686c
[BugFix] Fix port lookup in internal DP LB tests (#22252)
njhill Aug 15, 2025
6caa9f2
[CI Perf] Prune tests in `tests/kernels/quantization/` (#22942)
mgoin Aug 15, 2025
5fd03f5
[CI Perf] Prune tests in `tests/kernels/moe/` (#22939)
mgoin Aug 15, 2025
b725016
[CI Perf] Prune tests in `tests/kernels/attention/` (#22936)
mgoin Aug 15, 2025
04e9109
refactor: Change scaling factors calculation for flashinfer FusedMoE …
amirkl94 Aug 15, 2025
2e72687
[Feature] Full Cuda Graph Support for Cutlass MLA and 6% E2E Throughp…
yewentao256 Aug 15, 2025
93ff0c3
[Mamba] - refactor: Renamed mamba_attn to mamba2_attn (#22818)
Josephasafg Aug 15, 2025
0bfdd80
Revert "[ROCm][AITER] Support AITER Rope ops in RotaryEmbedding Modul…
tjtanaa Aug 15, 2025
581e0c0
[P/D]Provide bucket algorithm rate limiter for proxy_server (#22643)
frankie-ys Aug 15, 2025
041fa23
[CI] Pooling models mteb test uses enforce_eager (#22878)
noooop Aug 15, 2025
aa2eb6a
[V1] - Split Prefill and Decode for Mamba1 models (#22653)
amirai21 Aug 15, 2025
38c8f87
[Bugfix] Unquote file uri before reading image (#22912)
sayandipdutta Aug 15, 2025
0ce4673
[Bugfix] fix cuda 12.6 and 11.8 build (#22952)
jinzhen-lin Aug 15, 2025
8120bd7
[MM] Allow skipping memory profiling for multimodal models. (#22950)
Aug 15, 2025
11652f2
Improve multimodal hasher performance for re-used Image prompts (#22825)
p88h Aug 15, 2025
a857d8d
[V1] [Hybrid] Support using float32 for state in Hybrid Models (Mamba…
tdoublep Aug 15, 2025
76c2fa8
[Misc] Ignore ep_kernels_workspace (#22807)
jeejeelee Aug 15, 2025
af610df
[CI] Remove duplicated docs build from buildkite (#22924)
hmellor Aug 15, 2025
5904082
[Frontend] Expose do_log_stats interval to env (#22905)
Csrayz Aug 15, 2025
5f48728
[Core] Allow full cudagraph with separate attention routines and orth…
fhl2000 Aug 15, 2025
50c1a08
[V0 Deprecation] Remove advance_step (#22969)
WoosukKwon Aug 15, 2025
8db45eb
[BugFix] Skip the Q component for QKVParallelLinear in the case of QK…
sstamenk Aug 15, 2025
51b5895
[FIXBUG] Correctly Apply Grammar Bitmask in Mixed Batches (#22896)
JartX Aug 15, 2025
6c4c5ea
[Benchmarks] Include image data when ShareGPT4V dataset is used. (#22…
huachenheli Aug 15, 2025
04c52c0
[Structured Output] Make the output of structured output example more…
shen-shanshan Aug 15, 2025
ba6499c
[Kernels] Clean up FusedMoeMethodBase and modular kernel setup. Remo…
bnellnm Aug 15, 2025
ab544cd
[Model] Granite-4 support loading quantized checkpoint (#22925)
cyang49 Aug 15, 2025
47d4185
[Log] Debug Once for Randomizing dummy data for DP Rank (#22860)
yewentao256 Aug 15, 2025
612eab5
[Core] direct indexing on self.block_table_np in compute_slot_mapping…
linzebing Aug 15, 2025
627c147
[Bugfix] Added more env vars to hash (#22449)
nvjullin Aug 15, 2025
a22c39f
Use regex in convert-results-json-to-markdown.py (#22989)
mgoin Aug 15, 2025
a57f6d2
[CI] Speed up Whisper tests by reusing server (#22859)
mgoin Aug 15, 2025
eec4da9
[Fix] enable swap_ab for pplx problem size computation (#22991)
shixianc Aug 15, 2025
af8ffba
Add PrefixRepetitionRandomDataset to `vllm bench serve` datasets (#20…
eicherseiji Aug 15, 2025
4ab6bd4
minor: zero workspace buffer init for flashinfer trtllm-gen attn (#22…
yyihuang Aug 15, 2025
9b6683f
[Attention] FA3 Attention Sinks Perf Boost (#22478)
LucasWilkinson Aug 15, 2025
8d808ce
[BugFix] Fix regression caused by mamba state dtype PR (#22998)
tdoublep Aug 15, 2025
456c8cf
ci: Add CUDA + arm64 release builds (#21201)
seemethere Aug 15, 2025
02200dc
[Structured Outputs] [Bug] Fix misalignment in apply_grammar_bitmask …
rishitdholakia13 Aug 15, 2025
6c4a7f2
[BugFix] Handle case where async utility call is cancelled (#22996)
njhill Aug 15, 2025
21ead32
[v1] Move block_hashes from KVCacheManager to Request.block_hashes (#…
orozery Aug 15, 2025
4654781
Support multiple attention groups for KV sharing (#22672)
sarckk Aug 15, 2025
e67c504
[BugFix] Make `run_once` thread-safe (#22978)
oraluben Aug 15, 2025
f2a8c6f
[Misc] Support passing multiple request ids at once to `AsyncLLM.abor…
njhill Aug 16, 2025
f1e219c
[Kernel] Simplify `get_kv_cache_layout` and cache `use_trtllm_attenti…
NickLucche Aug 16, 2025
98b4d43
[Bugfix] Fix DeepSeek MTP (#22934)
benchislett Aug 16, 2025
a60d0c7
[Frontend] Avoid list copies in `serving_chat.py` (#22947)
njhill Aug 16, 2025
d854c2a
[V1] support min_tokens for detokener (#22014)
calvin0327 Aug 16, 2025
5b5d22e
[misc] nsys profile output kernel classifier and visualizer (#22971)
gracehonv Aug 16, 2025
3cee7a3
[XPU]avoid circular import during XPU init (#23017)
jikunshang Aug 16, 2025
2886073
[Build] Env var to disable sccache (#22968)
LucasWilkinson Aug 16, 2025
1935c34
[BugFix] Add support for loading prompt embeds tensors serialized on …
qthequartermasterman Aug 16, 2025
8c8aaf1
[Misc] Add --save-dir option to benchmark_moe (#23020)
jeejeelee Aug 16, 2025
dc88091
[Multimodal] Update Tensor schema test to cover arbitrary shape mm in…
Isotr0py Aug 16, 2025
28234fe
[Core] Make cudagraph check cuda platform only (#23005)
yaochengji Aug 16, 2025
cbc33c1
[CI][Bugfix] Skip Ovis2 generation test because of broken remote code…
Isotr0py Aug 16, 2025
efd10c3
Add docs for PrefixRepetitionDataset + enable usage with `vllm bench …
eicherseiji Aug 16, 2025
a611c4b
[Refactor] Allow optional MultiModalKwargsItem in IPC (#23022)
DarkLight1337 Aug 16, 2025
0025ac6
[New Model]mBART model (#22883)
princepride Aug 16, 2025
10bd3f2
Fix handling of `max_num_batched_tokens` for pooling tasks (#23004)
maxdebayser Aug 16, 2025
0691dba
[Frontend] Added support for HermesToolParser for models without spec…
minpeter Aug 16, 2025
e45076e
[Bugfix gpt-oss] Fix float32 convert for flashinfer sink support (#23…
mgoin Aug 16, 2025
40a0d51
[Flaky CI] Increase timeout tolerance for test_mp_crash_detection+tes…
mgoin Aug 16, 2025
ed52e53
[Kernel/Quant] Remove AQLM (#22943)
mgoin Aug 16, 2025
9931ad7
[V1] Logits processors extensibility (#19912)
afeldman-nm Aug 16, 2025
1330105
[Bugfix] fix qwen3 moe fp8 accuracy issue (#23031)
jinzhen-lin Aug 17, 2025
0c1d8f7
[UX] Separate marlin moe config logic from triton moe (#23006)
mgoin Aug 17, 2025
b1a3260
[Refactor] Defer tensor data construction in MultiModalKwargs (#23030)
DarkLight1337 Aug 17, 2025
049cef9
[Misc] method name typo fix (#23042)
andyxning Aug 17, 2025
9690747
[Kernel] Add cuda kernel for gpt_oss activation (#22951)
jeejeelee Aug 17, 2025
1e8a902
[Bugfix] should use stack instead of concat (#22972)
947132885 Aug 17, 2025
db8f535
[Misc] fix typo in the multimodal doc (#23051)
KevinZeng08 Aug 17, 2025
e6bc394
[BugFix] Fix for IMA in FA3 varlen combine (#22967)
LucasWilkinson Aug 17, 2025
f924f5a
[Misc] Remove dead return (#23061)
WoosukKwon Aug 17, 2025
72d3950
[Misc] Convert use_structured_output property into constant (#23060)
WoosukKwon Aug 17, 2025
9310d15
[XPU] fix xpu to set cudagraph batch sizes (#23044)
calvin0327 Aug 17, 2025
071fdbf
fix: gptq marlin weight loading failure (#23066)
simon-mo Aug 17, 2025
625926c
[Misc] Minor code cleanup for _get_prompt_logprobs_dict (#23064)
WoosukKwon Aug 18, 2025
cf0a037
[Misc] enhance static type hint (#23059)
andyxning Aug 18, 2025
0a3d765
[Bugfix] fix Qwen2.5-Omni processor output mapping (#23058)
DoubleVII Aug 18, 2025
d117d48
[Bugfix][CI] Machete kernels: deterministic ordering for more cache h…
andylolu2 Aug 18, 2025
d623acb
[Misc] refactor function name (#23029)
andyxning Aug 18, 2025
3f9a589
[Misc] Fix backward compatibility from #23030 (#23070)
ywang96 Aug 18, 2025
f562f66
[XPU] Fix compile size for xpu (#23069)
jikunshang Aug 18, 2025
2c46786
[XPU][CI]add xpu env vars in CI scripts (#22946)
jikunshang Aug 18, 2025
445e353
[Refactor] Define MultiModalKwargsItems separate from MultiModalKwarg…
DarkLight1337 Aug 18, 2025
43de8bc
[Bugfix] fix IntermediateTensors equal method (#23027)
andyxning Aug 18, 2025
bbaa94c
[Refactor] Get prompt updates earlier (#23097)
DarkLight1337 Aug 18, 2025
beecdf8
chore: remove unnecessary patch_padding_side for the chatglm model (#…
carlory Aug 18, 2025
a08fb18
[Bugfix] Support compile for Transformers multimodal (#23095)
zucchini-nlp Aug 18, 2025
0bf52cb
[CI Bugfix] Pin `openai<1.100` to unblock CI (#23118)
mgoin Aug 18, 2025
9b1f185
feat: add support for cutlass fused moe for gpt-oss on sm90
djmmoss Aug 21, 2025
4d2db6f
fix: OpenAI SDK compat (ResponseTextConfig) (#23126)
h-brenoskuk Aug 18, 2025
481be6d
Use Blackwell FlashInfer MXFP4 MoE by default if available (#23008)
mgoin Aug 18, 2025
9ef6864
Install tpu_info==0.4.0 to fix core dump for TPU (#23135)
xiangxu-google Aug 18, 2025
eeeb87d
[Misc] Minor refactoring for prepare_inputs (#23116)
WoosukKwon Aug 18, 2025
a9b22c0
[Spec Decode] Make `propose_draft_token_ids` non-blocking for lower T…
WoosukKwon Aug 19, 2025
5f6b0a1
[Misc] Add @tdoublep as a maintainer of hybrid model and Triton-atten…
tdoublep Aug 19, 2025
c1b8173
[CI][V0 Deprecation] Removed V0 Only Chunked Prefill and Prefix Cachi…
robertgshaw2-redhat Aug 19, 2025
d9939e5
[V0 Deprecation] Remove V0 FlashInfer attention backend (#22776)
WoosukKwon Aug 19, 2025
e9c1adb
chore: disable enable_cpp_symbolic_shape_guards (#23048)
xiszishu Aug 19, 2025
f7ffaa3
[TPU] make ptxla not imported when using tpu_commons (#23081)
yaochengji Aug 19, 2025
9a8c210
[Hardware][IBM Z]Enable v1 for s390x and s390x dockerfile fixes (#22725)
nikheal2 Aug 19, 2025
8d737da
Migrate InternVLImagePixelInputs (in nemotron_vl.py) to TensorSchema …
bbeckca Aug 19, 2025
4c46375
[Log] Warning Once for Cutlass MLA (#23137)
yewentao256 Aug 19, 2025
5fee7b8
[Model] Support Pipeline Parallelism for moonshotai/Kimi-VL-A3B-Think…
ZJY0516 Aug 19, 2025
2030ddb
[misc] split engine_model into json file for nsys profile tool (#23117)
gracehonv Aug 19, 2025
28f315d
[Benchmark] Add flag --served-model-name to benchmark_serving_multi_t…
pliops-daniels Aug 19, 2025
4858a70
Fix GLM-4.5V-FP8 numerical issue (#22949)
zixi-qi Aug 19, 2025
bb34190
[Misc] Add request_id into benchmark_serve.py (#23065)
hustxiayang Aug 19, 2025
1cd3c15
[Bugfix] Fix broken Minimax-01-VL model (#22116)
Isotr0py Aug 19, 2025
647d69b
[bug fix] Fix llama4 spec decoding (#22691)
zixi-qi Aug 19, 2025
321dcd6
[Misc] Avoid accessing req_ids inside a loop (#23159)
WoosukKwon Aug 19, 2025
299b096
[Doc] use power of 2 (#23172)
Tialo Aug 19, 2025
32cb16c
[Misc] Fix seq_lens for graph capture (#23175)
WoosukKwon Aug 19, 2025
56fa841
[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel (#21…
elvischenv Aug 19, 2025
f41be30
[Model] Add multi_label_classification support (#23173)
noooop Aug 19, 2025
4862644
[Model] support new model ovis2.5 (#23084)
myselvess Aug 19, 2025
f189ec0
[Bugfix] Fix benchmark_moe.py (#23177)
jeejeelee Aug 19, 2025
cbaba9d
[FEAT] [Performance] Enable DP for ViT in Qwen2.5VL (#22742)
tjtanaa Aug 19, 2025
f4a7919
[Model] Removes redundant all-reduce operation in Qwen3MoeSparseMoeBl…
yiz-liu Aug 19, 2025
c1a3d12
Add return_token_ids parameter to OpenAI API endpoints (#22587)
ultmaster Aug 19, 2025
3b7e373
Migrate LlavaOnevisionMultiInputs to TensorSchema (#21844)
bbeckca Aug 19, 2025
6697ac7
[CI/Build] Update transformers to v4.55.2 (#23093)
Isotr0py Aug 19, 2025
c8a53e5
[Misc] Fix the benchmark's README and improve the error messages for …
tanruixiang Aug 19, 2025
b6f3b11
[Frontend] Add `/collective_rpc` API endpoint (#23075)
22quinn Aug 19, 2025
2d3c47f
[Misc] Enable yapf for FlashInfer backend (#23193)
WoosukKwon Aug 19, 2025
ef1fa1d
[Bugfix] Fix accuracy issue when using flashinfer cutlass moe, TP=1 a…
bnellnm Aug 19, 2025
40a6d44
fix: use cache_salt for gpt-oss (#23186)
dr75 Aug 19, 2025
09a6735
[Misc] Minor refactoring for FlashInfer backend (#23147)
WoosukKwon Aug 19, 2025
b51df6a
[CI/Build] Add support for Python 3.13 (#13164)
mgoin Aug 19, 2025
0f68e55
[NVIDIA] Add SM100 Flashinfer Cutlass MoE fp8 backend (#22357)
amirkl94 Aug 19, 2025
9a62d10
[CI/Build] Replace lm-eval gsm8k tests with faster implementation (#2…
mgoin Aug 19, 2025
5870cad
[BugFix] fix CUTLASS MLA full cudagraph (#23200)
LucasWilkinson Aug 19, 2025
d63fd65
[Benchmarks] Add video inputs to ShareGPTDataset. (#23199)
huachenheli Aug 19, 2025
ec89a52
[Quantization] Bump Compressed Tensors Version (#23202)
kylesayrs Aug 20, 2025
82061bc
[Core] Optimize scheduler request removal for single completions (#21…
chi2liu Aug 20, 2025
a1cb9fb
[CI Perf] Only test bfloat16 for tests/compile/test_fusion_all_reduce…
mgoin Aug 20, 2025
c153756
[Core] Add torch profiler CPU traces for AsyncLLM. (#21794)
huachenheli Aug 20, 2025
e71f229
[Doc] Update V1 status of various pooling models (#23189)
DarkLight1337 Aug 20, 2025
dd532ae
[Attention] Optimize make_local_attention_virtual_batches for Flash A…
linzebing Aug 20, 2025
b515118
Fix a performance comparison issue in Benchmark Suite (#23047)
louie-tsai Aug 20, 2025
648cdaf
chore: support pytorch format in lora (#22790)
KilJaeeun Aug 20, 2025
e85b346
[CI/Build] Also check DP in benchmarks throughput script (#23038)
zhewenl Aug 20, 2025
c43ca52
[CI/Build] Sync multimodal tests (#23181)
DarkLight1337 Aug 20, 2025
44862d8
[BugFix] Fix stuck stats/metrics after requests are aborted (#22995)
njhill Aug 20, 2025
536b4a2
fix cuda graph (#22721)
fsx950223 Aug 20, 2025
b2fd7cc
[Model] use autoWeightsLoader for gptoss (#22446)
calvin0327 Aug 20, 2025
3cfcd13
Fix missing quotes (#23242)
wzshiming Aug 20, 2025
3e32704
[Model] Support deepseek with eagle (#21086)
xyang16 Aug 20, 2025
bd65d52
[Bugfix] Ensure correctness of Cohere2Vision processing (#23245)
DarkLight1337 Aug 20, 2025
ee0dd04
Update to flashinfer-python==0.2.12 and disable AOT compile for non-r…
mgoin Aug 20, 2025
843e77b
[Model][V1] Support Ernie MTP (#22169)
xyxinyang Aug 20, 2025
b96ca94
[Model] Improve olmo and olmo2 (#23228)
jeejeelee Aug 20, 2025
29f58a0
[Fix] fix offline env use local mode path (#22526)
lengrongfu Aug 20, 2025
7bc51c2
[Bugfix] Ensure correctness of HCXVision processing (#23254)
DarkLight1337 Aug 20, 2025
b15629b
[Kernel] CUTLASS MoE FP8: Integrate cuda moe permute/unpermute (#23045)
shixianc Aug 20, 2025
2de3c7b
[CLI][Doc] Formalize `--mm-encoder-tp-mode` (#23190)
DarkLight1337 Aug 20, 2025
240e099
[Misc] Add max_seq_len to CommonAttentionMetadata (#23216)
WoosukKwon Aug 20, 2025
35b1c74
[FIXBUG ] Allow disabling rocm_aiter_fa backend for ROCm GPUs not com…
JartX Aug 20, 2025
285cd2b
[torch.compile] Support conditional torch.compile per module (#22269)
sarckk Aug 20, 2025
c826d11
Migrate Mistral3ImagePixelInputs to TensorSchema (#21945)
bbeckca Aug 20, 2025
58afbd2
Limit HTTP header count and size (#23267)
russellb Aug 20, 2025
582c727
Small fix for Command-A-Vision (#23268)
dongluw Aug 20, 2025
c0eb3d7
[Kernel/Quant] Remove the original marlin format and qqq (#23204)
mgoin Aug 20, 2025
c68cadb
[Fix] correct tool_id for kimi-k2 when use tool_choice=required (#21259)
MoyanZitto Aug 20, 2025
fa40ad3
[Frontend] improve error logging of chat completion (#22957)
heheda12345 Aug 20, 2025
8396597
[Perf] Speed up function `_convert_tokens_to_string_with_added_encode…
misrasaurabh1 Aug 20, 2025
410423e
Do not use eval() to convert unknown types (#23266)
russellb Aug 20, 2025
e0be5ba
[Feature] use --eplb_config to set eplb param (#20562)
lengrongfu Aug 20, 2025
649fcea
[misc] fix multiple arch wheels for the nightly index (#23110)
youkaichao Aug 20, 2025
f54d68b
Remove chunked_prefill_enabled flag in V1 MLA (#23183)
MatthewBonanni Aug 20, 2025
a473c5b
Feature/mla tests (#23195)
MatthewBonanni Aug 20, 2025
b1602a8
[Fix] remove is_marlin param in benchmark_moe (#23286)
shixianc Aug 20, 2025
8c26b47
[EP] Add logging for experts map (#22685)
22quinn Aug 20, 2025
97b7516
Remove duplicate entry in vllm.attention.__all__ (#23296)
russellb Aug 21, 2025
c300639
[CI Bugfix] Fix CI by fully removing --enable-prompt-adapter (#23284)
mgoin Aug 21, 2025
24c8bb6
[Optimization] Make new_block_ids None if empty (#23262)
WoosukKwon Aug 21, 2025
a0c60ea
[CPU] Refactor CPU W8A8 scaled_mm (#23071)
bigPYJ1151 Aug 21, 2025
0382521
[CI/Build] Split out mm processor tests (#23260)
DarkLight1337 Aug 21, 2025
93c5489
[V1][Mamba1] - Full CUDA and Piecewise CUDA Graphs Support (#23035)
Josephasafg Aug 21, 2025
83fb982
[Compile] Fix Compile Warning SM100 Cutlass MLA (#23287)
yewentao256 Aug 21, 2025
048330f
[Model][VLM] Support R-4B Model (#23246)
yannqi Aug 21, 2025
6ae6cf1
[CI] Delete images older than 24h. (#23291)
QiliangCui Aug 21, 2025
453d898
[CI] Block the cu126 wheel build while broken (#23285)
mgoin Aug 21, 2025
5833876
[Sampler] Support returning final logprobs (#22387)
22quinn Aug 21, 2025
352d13e
[Bugfix] Fix extra whitespace in strings caused by newline (#23272)
DarkLight1337 Aug 21, 2025
857da6c
[BugFix] Fix Python 3.9 Support (#23306)
jaredoconnell Aug 21, 2025
56dd418
[Model] Add LFM2 architecture (#22845)
paulpak58 Aug 21, 2025
b4d2a4a
[Refactor] Simplify code for MM budget (#23310)
DarkLight1337 Aug 21, 2025
66a8d24
[Doc] Fix batch-level DP example (#23325)
DarkLight1337 Aug 21, 2025
1dc73ba
[Performance] V1 Pooling Models E2E Performance Optimization (#23162)
noooop Aug 21, 2025
ac51913
[V1] Remove unnecessary check for main thread (#23298)
robertgshaw2-redhat Aug 21, 2025
c5e2aee
[Bugfix] set system_message in phi4mini chat template (#23309)
zhuangqh Aug 21, 2025
15195dc
[Multimodal] Always enable hashing mm data (#23308)
ywang96 Aug 21, 2025
3a0ee9f
[ci/build] Fix abi tag for aarch64 (#23329)
youkaichao Aug 21, 2025
c16f981
Migrate MolmoImageInputs to TensorSchema (#22022)
bbeckca Aug 21, 2025
1542ce3
Fix nvfp4 swizzling (#23140)
yiliu30 Aug 21, 2025
a86eaa5
add tg-mxfp4-moe-test (#22540)
IwakuraRein Aug 21, 2025
d7d87dc
[Bug] Fix R1 Accuracy 0 Bug (#23294)
yewentao256 Aug 21, 2025
3d3c649
[Bugfix] Fix port conflict by obtaining a list of open ports upfront …
minosfuture Aug 21, 2025
bd792f6
[Misc] Misc code cleanup/simplification (#23304)
njhill Aug 21, 2025
f329657
[BugFix][gpt-oss] Fix Chat Completion with Multiple Output Message (#…
heheda12345 Aug 21, 2025
c98c1db
typo
djmmoss Aug 21, 2025
98fa266
updates
djmmoss Aug 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 141 additions & 4 deletions vllm/model_executor/layers/quantization/mxfp4.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,11 @@
return envs.VLLM_USE_FLASHINFER_MOE_MXFP4_BF16

# Enable by default on SM100 if MXFP8 is not explicitly enabled
if (current_platform.is_device_capability(100) and has_flashinfer()
if (current_platform.is_device_capability(100) or current_platform.is_device_capability(90) and has_flashinfer()
and not envs.is_set("VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8")):
logger.info_once(
"Enabling FlashInfer MXFP4 BF16 backend by default for Blackwell. "
"Enabling FlashInfer MXFP4 BF16 backend by default for Blackwell and Hopper. "
"For faster performance, consider setting "

Check failure on line 46 in vllm/model_executor/layers/quantization/mxfp4.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (E501)

vllm/model_executor/layers/quantization/mxfp4.py:46:81: E501 Line too long (90 > 80)
"VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1, "
"though this may impact accuracy.")
return True
Expand Down Expand Up @@ -113,6 +113,7 @@
self.topk_indices_dtype = None
self.moe = moe
self.use_marlin = self._should_use_marlin()
self.flashinfer_autotune = True

if current_platform.is_device_capability(100) and not has_flashinfer():
logger.warning_once(
Expand Down Expand Up @@ -171,14 +172,17 @@
layer.hidden_size = hidden_size
layer.intermediate_size_per_partition = \
intermediate_size_per_partition_after_pad
elif should_use_flashinfer_mxfp4():
elif should_use_flashinfer_mxfp4() and current_platform.is_device_capability(100):
# pad the intermediate size to be a multiple of 2 * mxfp4_block
# for to hold non-uniform sharded tensor as well as swizzling
# other padding to increase performance
intermediate_size_per_partition_after_pad = round_up(
intermediate_size_per_partition, 256)
hidden_size = round_up(hidden_size, 256)
elif _should_use_flashinfer_mxfp4_bf16() and current_platform.is_device_capability(90):
intermediate_size_per_partition_after_pad = round_up(
intermediate_size_per_partition, 128)
elif current_platform.is_rocm():

Check failure on line 185 in vllm/model_executor/layers/quantization/mxfp4.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (E501)

vllm/model_executor/layers/quantization/mxfp4.py:185:81: E501 Line too long (86 > 80)
intermediate_size_per_partition_after_pad = round_up(
intermediate_size_per_partition, 128)
else:
Expand Down Expand Up @@ -384,6 +388,96 @@
layer.w2_bias = Parameter(torch.stack(gemm2_bias_shuffled).reshape(
self.num_experts, -1),
requires_grad=False)
elif _should_use_flashinfer_mxfp4_bf16() and current_platform.is_device_capability(90):
assert layer.w13_weight.dtype == torch.uint8, f"layer.w13_weight.dtype: {layer.w13_weight.dtype}, expected: {torch.uint8}"
assert layer.w2_weight.dtype == torch.uint8, f"layer.w2_weight.dtype: {layer.w2_weight.dtype}, expected: {torch.uint8}"

Check failure on line 393 in vllm/model_executor/layers/quantization/mxfp4.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (E501)

vllm/model_executor/layers/quantization/mxfp4.py:393:81: E501 Line too long (134 > 80)
assert layer.w13_weight_scale.dtype == torch.uint8, f"layer.w13_weight_scale.dtype: {layer.w13_weight_scale.dtype}, expected: {torch.uint8}"

Check failure on line 394 in vllm/model_executor/layers/quantization/mxfp4.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (E501)

vllm/model_executor/layers/quantization/mxfp4.py:394:81: E501 Line too long (131 > 80)
assert layer.w2_weight_scale.dtype == torch.uint8, f"layer.w2_weight_scale.dtype: {layer.w2_weight_scale.dtype}, expected: {torch.uint8}"

Check failure on line 395 in vllm/model_executor/layers/quantization/mxfp4.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (E501)

vllm/model_executor/layers/quantization/mxfp4.py:395:81: E501 Line too long (152 > 80)
assert layer.w13_bias.dtype == torch.bfloat16, f"layer.w13_bias.dtype: {layer.w13_bias.dtype}, expected: {torch.bfloat16}"

Check failure on line 396 in vllm/model_executor/layers/quantization/mxfp4.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (E501)

vllm/model_executor/layers/quantization/mxfp4.py:396:81: E501 Line too long (149 > 80)
assert layer.w2_bias.dtype == torch.bfloat16, f"layer.w2_bias.dtype: {layer.w2_bias.dtype}, expected: {torch.bfloat16}"

Check failure on line 397 in vllm/model_executor/layers/quantization/mxfp4.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (E501)

vllm/model_executor/layers/quantization/mxfp4.py:397:81: E501 Line too long (134 > 80)

Check failure on line 398 in vllm/model_executor/layers/quantization/mxfp4.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (E501)

vllm/model_executor/layers/quantization/mxfp4.py:398:81: E501 Line too long (131 > 80)
layer.gemm1_alpha = Parameter(torch.tensor(
[1.702] * self.num_experts, dtype=torch.float32).cuda(),
requires_grad=False)
layer.gemm1_beta = Parameter(torch.tensor(
[1.0] * self.num_experts, dtype=torch.float32).cuda(),
requires_grad=False)
layer.gemm1_clamp_limit = Parameter(torch.tensor(
[7.0] * self.num_experts, dtype=torch.float32).cuda(),
requires_grad=False)
sf_block_size = 32 # mxfp4 block size

assert (layer.w13_weight.dim() == 3
and layer.w13_weight.shape[0] == self.num_experts
and layer.w13_weight.shape[1] == self.intermediate_size * 2
and layer.w13_weight.shape[2] == self.hidden_size // 2)
assert (layer.w13_weight_scale.dim() == 3
and layer.w13_weight_scale.shape[0] == self.num_experts
and layer.w13_weight_scale.shape[1]
== self.intermediate_size * 2
and layer.w13_weight_scale.shape[2]
== self.hidden_size // sf_block_size)
assert (layer.w2_weight.dim() == 3
and layer.w2_weight.shape[0] == self.num_experts
and layer.w2_weight.shape[1] == self.hidden_size and
layer.w2_weight.shape[2] == self.intermediate_size // 2)
assert (layer.w2_weight_scale.dim() == 3
and layer.w2_weight_scale.shape[1] == self.hidden_size
and layer.w2_weight_scale.shape[2]
== self.intermediate_size // sf_block_size)
assert (layer.w13_bias.dim() == 2
and layer.w13_bias.shape[0] == self.num_experts
and layer.w13_bias.shape[1] == self.intermediate_size * 2)
assert (layer.w2_bias.dim() == 2
and layer.w2_bias.shape[0] == self.num_experts
and layer.w2_bias.shape[1] == self.hidden_size)



Check failure on line 436 in vllm/model_executor/layers/quantization/mxfp4.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (E501)

vllm/model_executor/layers/quantization/mxfp4.py:436:81: E501 Line too long (83 > 80)
# De-interleave weights, scales, and biases for gate and up projections
w13_weight_data = layer.w13_weight.data
gate_w, up_w = w13_weight_data[:, ::2, :], w13_weight_data[:, 1::2, :]
deinterleaved_w13_weight = torch.cat([gate_w, up_w], dim=1)
w1_weight, w3_weight = torch.chunk(deinterleaved_w13_weight, 2, dim=1)
layer.w13_weight = torch.nn.Parameter(torch.cat([w3_weight, w1_weight], dim=1).cuda(), requires_grad=False)

w13_bias_data = layer.w13_bias.data.to(torch.float32)
gate_b, up_b = w13_bias_data[:, ::2], w13_bias_data[:, 1::2]
deinterleaved_w13_bias = torch.cat([gate_b, up_b], dim=1)
b1, b3 = torch.chunk(deinterleaved_w13_bias, 2, dim=-1)
b = torch.cat([b3, b1], dim=-1)
layer.w13_bias = torch.nn.Parameter(b.to(torch.bfloat16).cuda(), requires_grad=False)

# Scale
w13_scale_data = layer.w13_weight_scale.data
gate_s, up_s = w13_scale_data[:, ::2, :], w13_scale_data[:, 1::2, :]
deinterleaved_w13_scale = torch.cat([gate_s, up_s], dim=1)
w1_weight_scale, w3_weight_scale = torch.chunk(deinterleaved_w13_scale, 2, dim=1)
all_w31_scales = torch.cat([w3_weight_scale, w1_weight_scale], dim=1)

w31_scales = all_w31_scales.to(torch.uint8).view(torch.uint8)
w31_s_shape = w31_scales.shape
w31_scales_interleaved = w31_scales.reshape(
w31_s_shape[0], w31_s_shape[1],
(w31_s_shape[2] // 4), 4)
w31_scales_interleaved = w31_scales_interleaved.permute(0, 2, 1, 3)
w31_scales_interleaved = w31_scales_interleaved.reshape(
w31_s_shape[0], w31_s_shape[2] // 4, w31_s_shape[1] * 4)

layer.w13_weight_scale = torch.nn.Parameter(w31_scales_interleaved.cuda(), requires_grad=False)

w2_weight_scale = layer.w2_weight_scale.data
w2_scales = w2_weight_scale.to(torch.uint8).view(torch.uint8)
w2_s_shape = w2_scales.shape
w2_scales_interleaved = w2_scales.reshape(
w2_s_shape[0], w2_s_shape[1],
(w2_s_shape[2] // 4), 4)
w2_scales_interleaved = w2_scales_interleaved.permute(0, 2, 1, 3)
w2_scales_interleaved = w2_scales_interleaved.reshape(
w2_s_shape[0], w2_s_shape[2] // 4, w2_s_shape[1] * 4)

layer.w2_weight_scale = torch.nn.Parameter(w2_scales_interleaved.cuda(), requires_grad=False)

else:
from triton_kernels.matmul_ogs import FlexCtx, PrecisionConfig

Expand Down Expand Up @@ -510,7 +604,7 @@
logical_replica_count), (
"MXFP4 are not supported with this configuration.")

if should_use_flashinfer_mxfp4():
if should_use_flashinfer_mxfp4() and current_platform.is_device_capability(100):
from flashinfer import mxfp8_quantize, trtllm_fp4_block_scale_moe
assert not self.moe.use_ep, (
"EP is not supported for flashinfer mxfp4 moe backend yet.")
Expand Down Expand Up @@ -551,6 +645,49 @@
True, # do finalize
)[0]
return trtllm_gen_output
elif _should_use_flashinfer_mxfp4_bf16() and current_platform.is_device_capability(90):

assert x.dtype == torch.bfloat16

quant_scales = [
layer.w13_weight_scale,
layer.w2_weight_scale,
]

topk_weights, topk_ids = FusedMoE.select_experts(
hidden_states=x,
router_logits=router_logits,
use_grouped_topk=use_grouped_topk,
top_k=top_k,
renormalize=renormalize,
topk_group=topk_group,
num_expert_group=num_expert_group,
custom_routing_function=custom_routing_function,
scoring_func=scoring_func,
e_score_correction_bias=e_score_correction_bias,
)

output = torch.zeros_like(x)

with torch.inference_mode(), autotune(self.flashinfer_autotune):
_ = cutlass_fused_moe(
input=x,
token_selected_experts=topk_ids,
token_final_scales=topk_weights,
fc1_expert_weights=layer.w13_weight,
fc2_expert_weights=layer.w2_weight,
output_dtype=torch.bfloat16,
quant_scales=quant_scales,
fc1_expert_biases=layer.w13_bias,
fc2_expert_biases=layer.w2_bias,
swiglu_alpha=layer.gemm1_alpha,

Check failure on line 683 in vllm/model_executor/layers/quantization/mxfp4.py

View workflow job for this annotation

GitHub Actions / pre-commit

Ruff (F821)

vllm/model_executor/layers/quantization/mxfp4.py:683:42: F821 Undefined name `autotune`
swiglu_beta=layer.gemm1_beta,
swiglu_limit=layer.gemm1_clamp_limit,
use_w4_group_scaling=True,
output=output,
)
self.flashinfer_autotune = False
return output
else:
return triton_kernel_moe_forward(
hidden_states=x,
Expand Down