Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
518 commits
Select commit Hold shift + click to select a range
61d1b35
[BugFix] Register expert_map as named buffer for wake_up and sleep (#…
wuxibin89 Sep 23, 2025
f05a4f0
[P/D] Support NIXL connector to disconnect during a clean shutdown (#…
chaunceyjiang Sep 23, 2025
da5e7e4
[Docs] NixlConnector quickstart guide (#24249)
panpan0000 Sep 23, 2025
4c966e4
[XPU] Fix MOE DP accuracy issue on XPU (#25465)
faaany Sep 23, 2025
2c58742
[UX] Change kv-cache-memory log level to debug (#25479)
mgoin Sep 23, 2025
a903669
[V1] Remove V0 code paths for Hybrid models (#25400)
tdoublep Sep 23, 2025
cc1dc7e
[Core/DBO][2/N] Dual-Batch Overlap add DeepEP High Throughput support…
LucasWilkinson Sep 23, 2025
875d6de
Add backward compatibility for `GuidedDecodingParams` (#25422)
hmellor Sep 23, 2025
f11e3c5
[Kernels] Support blocked fp8 quantization for compressed tensors MoE…
bnellnm Sep 23, 2025
2357480
[BugFix] Fix UB in per_token_group_quant.cu (#24913)
rivos-shreeasish Sep 23, 2025
846197f
[Log] Optimize kv cache memory log from Bytes to GiB (#25204)
yewentao256 Sep 23, 2025
527821d
Use macro guard CUDA functions for back compatibility in grouped_topk…
minosfuture Sep 23, 2025
100b630
[V1][Kernel] Add triton implementation for `reshape_and_cache_flash` …
bringlein Sep 23, 2025
24e8222
[Misc] Reduce initialization time of auto_tune (#23682)
wdhongtw Sep 23, 2025
867ecdd
[Spec Decode][CI] Add e2e test for `examples/spec_decode.py` and prev…
ekagra-ranjan Sep 23, 2025
5abb117
[Core] Ensure LoRA linear respect the base_layer's tp_size and tp_ran…
jeejeelee Sep 23, 2025
a3a7828
[ROCm] Add skinny gemm bias support for dtypes fp16,bf16,fp8 (#24988)
amd-hhashemi Sep 23, 2025
8c1c81a
[core] add nccl symmetric memory for all reduce (#24532)
Amir-19 Sep 23, 2025
6340025
[Performance] Move apply_w8a8_block_fp8_linear to an op class (#24666)
ElizaWszola Sep 23, 2025
24fab45
[Perf] Change default CUDAGraphMode from PIECEWISE to FULL_AND_PIECEW…
mgoin Sep 23, 2025
d5944d5
[Speculators][Speculative Decoding] Fix gpt-oss eagle3 accuracy issue…
jiahanc Sep 23, 2025
a8ffc4f
[Bugfix] Lower gpt-oss max cudagraph size to 992 to be compatible wit…
mgoin Sep 23, 2025
8bdd8b5
Enable symmetric memory all reduce by default only enabling for TP (#…
ilmarkov Sep 23, 2025
8b8a8af
[CI] Fix Pre-commit Issue (#25497)
yewentao256 Sep 23, 2025
c828d1b
[Bugfix] gpt-oss container tool output bug (#25485)
alecsolder Sep 23, 2025
08275ec
[Build] Update Xgrammar to 0.1.25 (#25467)
chaunceyjiang Sep 23, 2025
690f948
[Bugfix] Fix for the import error from #24588 (#25481)
gshtras Sep 23, 2025
ae00292
[CI/Build] Fix and re-enable v1 PP test on CI (#25496)
Isotr0py Sep 23, 2025
4f8c4b8
[Core] Use KVCacheBlock as much as possible instead of dict[block_id,…
Jialin Sep 23, 2025
969b4da
[V0 Deprecation] Remove placeholder attn (#25510)
tdoublep Sep 23, 2025
eca7be9
Add VLLM_ENABLE_INDUCTOR_MAX_AUTOTUNE & VLLM_ENABLE_INDUCTOR_COORDINA…
rouchenzi Sep 23, 2025
4f2954f
Fix triton_reshape_and_cache_flash.py triton import (#25522)
mgoin Sep 23, 2025
95bc60e
[gpt-oss][bugfix] remove logic to require resp_ in ResponseAPI (#25428)
qandrew Sep 23, 2025
7361ab3
Remove redundant mutates_args and dispatch_key for direct_register_cu…
mgoin Sep 23, 2025
abad204
[BugFix] Fix OOM in vLLM replicas by ensuring consistent NCCL memory …
kouroshHakha Sep 23, 2025
c85d75c
Add `VLLM_NVTX_SCOPES_FOR_PROFILING=1` to enable `nvtx.annotate` scop…
coreylowman Sep 23, 2025
5e25b12
[Kernel] [Mamba] Remove BLOCK_H=1 from list of tuneable configuration…
tdoublep Sep 23, 2025
bde2a1a
[ROCm] Small functional changes for gptoss (#25201)
jpvillam-amd Sep 23, 2025
e0b24ea
[Perf] Increase default max splits for FA3 full cudagraphs (#25495)
LucasWilkinson Sep 23, 2025
1210e4d
[Bugfix] [B200] cutlass_mla - ensure kv_split == 1 for batch size > 1…
alexm-redhat Sep 23, 2025
dc464a3
[BugFix] AssertionError: Do not capture num_reqs > max_num_reqs for u…
LucasWilkinson Sep 24, 2025
7ad5e50
Improve output when failing json.loads() on structured output test (#…
dougbtv Sep 24, 2025
0d235b8
Add CUTLASS FP8 MOE benchmark scripts and kernel config (#25302)
chenxi-yang Sep 24, 2025
88d7bdb
[Bug] Fix AttributeError: 'FusedMoE' object has no attribute 'w13_wei…
yewentao256 Sep 24, 2025
c8bde93
[BUG] Allows for RunAI Streamer and Torch.compile cache to be used to…
ahao-anyscale Sep 24, 2025
be0bb56
[Model] Support SeedOss Reason Parser (#24263)
LuYanFCP Sep 24, 2025
d06b5a9
[V1][Metrics] Add per-request TPOT histogram (#24015)
baxingpiaochong Sep 24, 2025
1983609
[Bugfix] Use a separate FlashInfer workspace buffer for trtllm-gen (#…
benchislett Sep 24, 2025
de94289
[Core] Support weight_loader_v2 for `UnquantizedLinearMethod` (#23036)
kylesayrs Sep 24, 2025
bf68fd7
[Compile] Fix AMD Compile Error (#25518)
yewentao256 Sep 24, 2025
9df8da5
[BugFix] Fix MLA assert with CUTLASS MLA (#25478)
LucasWilkinson Sep 24, 2025
359d293
[fix]: add Arm 4bit fused moe support (#23809)
nikhil-arm Sep 24, 2025
77d9069
[KV sharing] Re-land Gemma3n model changes from #22628 (#24357)
sarckk Sep 24, 2025
c30b405
[Spec Decode] Enable FlashInfer Spec Decoding (#25196)
benchislett Sep 24, 2025
d747c2e
[Perf] Fix jit compiles at runtime of fla gated delta rule (#25432)
coreylowman Sep 24, 2025
5caaeb7
[Bugfix] [Frontend] Cleanup gpt-oss non-streaming chat tool calls (#2…
bbrowning Sep 24, 2025
190c45a
[TPU][Bugfix] fix the missing apply_model in tpu worker (#25526)
yaochengji Sep 24, 2025
fed8a9b
[Misc] Retry HF processing if "Already borrowed" error occurs (#25535)
DarkLight1337 Sep 24, 2025
1cbcfb9
[Bugfix][CPU] Skip unsupported custom op register on CPU (#25534)
bigPYJ1151 Sep 24, 2025
27ec3c7
[CI/Build] Fix v1 OOT registration test (#25547)
Isotr0py Sep 24, 2025
6488f34
[Misc]] Move processing context to multimodal directory (#25548)
DarkLight1337 Sep 24, 2025
77a7fce
[CI/Build] add nightly prime-rl integration tests (#25207)
Jackmin801 Sep 24, 2025
2e19a84
[V0 Deprecation] Remove max_seq_len_to_capture (#25543)
WoosukKwon Sep 24, 2025
2338daf
[BugFix] Potential Fix for FA3 full-cudagraph IMA (#25490)
LucasWilkinson Sep 24, 2025
b67dece
[misc] update the warning message (#25566)
youkaichao Sep 24, 2025
42488da
[Bugfix] Fix dummy video number of frames calculation (#25553)
ywang96 Sep 24, 2025
58c360d
[Bug] fix import and unit test (#25558)
jmkuebler Sep 24, 2025
1642995
[Benchmark] Fix regression in structured output benchmark (#25500)
russellb Sep 24, 2025
b106890
[docs] fix nixl kv_connector_extra_config.backends key (#25565)
panpan0000 Sep 24, 2025
e18b714
[Bugfix] Fix DeepSeekV31ToolParser to correctly parse multiple tools …
taohui Sep 24, 2025
8938774
Move `DeviceConfig`, `ObservabilityConfig`, `SpeechToTextConfig` to t…
hmellor Sep 24, 2025
9313be5
[Misc] Improve type annotations for jsontree (#25577)
DarkLight1337 Sep 24, 2025
487745f
[ROCm][Bugfix] Only enable +rms_norm based on aiter if not explicitly…
gshtras Sep 24, 2025
302eb94
[ROCm][Build][Bugfix] Fix ROCm base docker whls installation order (#…
gshtras Sep 24, 2025
d83f3f7
Fixes and updates to bench_per_token_quant_fp8 (#25591)
mgoin Sep 24, 2025
2dda3e3
[Bugfix] add cache model when from object storage get model (#24764)
lengrongfu Sep 24, 2025
54e42b7
Support mnnvl all2allv from Flashinfer (#21003)
wenscarl Sep 24, 2025
f84a472
Suppress benign cuBLAS warning when capturing cudagraphs with DBO (#2…
SageMoore Sep 24, 2025
8c85305
[Docs] Enable `fail_on_warning` for the docs build in CI (#25580)
hmellor Sep 24, 2025
e6750d0
[V0 Deprecation] Remove unused classes in attention (#25541)
WoosukKwon Sep 24, 2025
fea8006
[Logging] Improve log for when DeepEP HT disables CUDA Graphs (#25531)
tlrmchlsmth Sep 24, 2025
6160ba4
feat: BF16 FlashInfer Fused Cutlass MOE for Hopper and Blackwell Expe…
djmmoss Sep 24, 2025
1f29141
[Refactor] Use DeepGEMM Col Major TMA Aligned Tensor (#25517)
yewentao256 Sep 24, 2025
e7f27ea
Improve `--help` for enhanced user experience (#24903)
hmellor Sep 24, 2025
5c1e496
[MISC] replace c10::optional with std::optional (#25602)
842974287 Sep 24, 2025
52d0cb8
[Model] Improve DotsOCRForCausalLM (#25466)
jeejeelee Sep 24, 2025
05c1948
[Kernel] Support DCP for Triton backend (#25132)
frank-wei Sep 25, 2025
4492e3a
[Bug] Dynamo Unsupported due to `BasevLLMParameter.torch_function` ca…
yewentao256 Sep 25, 2025
90b139c
Enable Fbgemm NVFP4 on Dense models (#25609)
samanamp Sep 25, 2025
845adb3
[Model] Add LongCat-Flash (#23991)
OftenDream Sep 25, 2025
c85be1f
optimize: eliminate duplicate split_enc_dec_inputs calls (#25573)
nicole-lihui Sep 25, 2025
a676e66
[Bugfix] fix apply_temperature to avoid nan in probs (#24734)
courage17340 Sep 25, 2025
755ed7b
[Misc] Simplify PoolerOutput and move to `v1/outputs` (#25629)
DarkLight1337 Sep 25, 2025
bc092ea
Map CwmForCausalLM to llama and LlamaForCausalLM (#25611)
jacobkahn Sep 25, 2025
af4ee63
typo: remove duplicate `is` (#25641)
nicole-lihui Sep 25, 2025
1260180
Revert "[Performance] Move apply_w8a8_block_fp8_linear to an op class…
tlrmchlsmth Sep 25, 2025
393de22
[fix] Update torch version in cpu-build.txt for AArch64/ppc64le and D…
fadara01 Sep 25, 2025
7be9ffc
[Misc] Fix Qwen3-VL `video_grid_thw` typing (#25646)
ywang96 Sep 25, 2025
3c2b2cc
[Bugfix] Add triton.language.tensor placeholder (#25649)
adobrzyn Sep 25, 2025
17b4c66
[Bugfix] Fix Qwen3-VL max_num_video_tokens calculation for video prof…
Isotr0py Sep 25, 2025
12c1287
[mypy] Further improve MM type annotations (#25654)
DarkLight1337 Sep 25, 2025
eaeca3c
[Bugfix] Parse SpeculativeConfig Error (#25142)
yyzxw Sep 25, 2025
7f570f1
[V0 deprecation] Remove unreachable model_config.supported_tasks (#25…
noooop Sep 25, 2025
70fbdb2
Add backward compatibility for `guided_...` API (#25615)
hmellor Sep 25, 2025
0bcc3a1
[CI/Build] Fix flaky entrypoints test (#25663)
DarkLight1337 Sep 25, 2025
d2af674
[XPU][Triton]add xpu config in triton_reshape_and_cache_flash (#25643)
jikunshang Sep 25, 2025
1e9a77e
[Hardware][RISC-V] Add riscv64 support for vLLM with scalar (#22112)
langc23 Sep 25, 2025
2f17117
[mypy] Fix wrong type annotations related to tuple (#25660)
DarkLight1337 Sep 25, 2025
6c340da
[misc] log info messages by default for hanging / busy / idle (#25627)
youkaichao Sep 25, 2025
69a8c8e
[torch.compile] Make Query Quantization Fusable (#24914)
jmkuebler Sep 25, 2025
eb32335
[CPU] update torch 2.8 and fix missing fields in TorchSDPAMetadata (#…
bigPYJ1151 Sep 25, 2025
532a6cf
[ux] Switch a warning to debug about a pytorch fallback (#23750)
russellb Sep 25, 2025
03858e6
[Bugfix] Fix InternS1 video processing after Transformers v4.56 (#25644)
Isotr0py Sep 25, 2025
0754ac4
[Misc] Remove cruft file in repo (#25678)
NickLucche Sep 25, 2025
2e5df88
[Logging] Remove TORCH_NCCL_AVOID_RECORD_STREAMS to squash a warning …
tlrmchlsmth Sep 25, 2025
e04a1b6
[BUGFIX] Fix crash in Eagle Speculative Decoding models when exceedin…
AlonKejzman Sep 25, 2025
916bd92
Revert "[Bug] Dynamo Unsupported due to `BasevLLMParameter.torch_func…
mgoin Sep 25, 2025
13cc7f5
[BugFix] Fix DBO hang (#25625)
LucasWilkinson Sep 25, 2025
b8d9e4a
[Model] Add optional parameter to reasoning parser constructor (#25554)
taohui Sep 25, 2025
0ea80c8
[Model] Define `merge_by_field_config` MM interface (#25676)
DarkLight1337 Sep 25, 2025
71b25b0
[V0 deprecation] Clean up V0 fallback in compilation config (#25675)
Isotr0py Sep 25, 2025
3468f17
[V0 deprecation] Remove _VLLM_V1 suffixes from attention backend name…
MatthewBonanni Sep 25, 2025
0fa673a
[V0 deprecation] Clean up LoRA (#25686)
jeejeelee Sep 25, 2025
6b0fcbb
[Misc] Simplify `test_argsort_mm_positions` (#25690)
DarkLight1337 Sep 25, 2025
3d54bdc
[Optimization] Streamline `InputPreprocessor` (#25702)
DarkLight1337 Sep 25, 2025
89fa54e
[Optimization] Use a cheaper cache key in `get_model_architecture` (#…
DarkLight1337 Sep 25, 2025
e71b8e2
[Spec Decode] Add Batch Parallel Ngram. Upto 8x lower overhead. (#24986)
ekagra-ranjan Sep 25, 2025
8c435c9
[Core] Enable command line logging for LLMEngine (#25610)
zhuohan123 Sep 25, 2025
57329a8
[Model] rename NemotronH_Nano_VL -> NemotronH_Nano_VL_V2 (#25708)
tomeras91 Sep 25, 2025
081b559
Fix routing_bias dtype (#25711)
wenscarl Sep 25, 2025
9fe4c2b
[Refactor] Remove DeepGEMM OP Register (#25710)
yewentao256 Sep 26, 2025
8b77328
[Misc] Don't log shm dequeue delay warning on worker side (#25720)
njhill Sep 26, 2025
53a3084
Llamas 3.1 405B fp4 changes upstreaming from 355_wip (#25135)
maleksan85 Sep 26, 2025
13dd93c
[Core] Force PIECEWISE CUDAGraph mode for encoder-decoder (#25701)
russellb Sep 26, 2025
983056e
[Misc] Remove unnecessary memoryviews in shm_broadcast.py (#25721)
njhill Sep 26, 2025
392edee
EVS Support (Video tokens pruning) (#22980)
BloodAxe Sep 26, 2025
3edf87d
[CI/Build] fix doc build warning: Failed to get 'name: description' p…
yitingdc Sep 26, 2025
e84e073
fix: revert cast to cpu in `MsgpackEncoder._encode_tensor` to avoid h…
qthequartermasterman Sep 26, 2025
d48f4d6
perf: Avoid copying inputs_embeds tensors to GPU unless prompt_embeds…
qthequartermasterman Sep 26, 2025
52621c8
[Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI300…
xaguilar-amd Sep 26, 2025
6e30010
fix: print outputt offline_inference/base/chat.py example (#25744)
Iceber Sep 26, 2025
99b3a50
[Qwen3-Next][GDN] fixes cuda graph capturing bug in GDN metadata and …
sighingnow Sep 26, 2025
dd70437
Remove cuda hard-code in compute_causal_conv1d_metadata (#25555)
wxsIcey Sep 26, 2025
19f76ee
[misc] refactor speculative config (#25657)
yyzxw Sep 26, 2025
dfb9af2
[Bugfix] Fix Shared Expert/Zero expert code in FusedMoE.process_chunk…
SageMoore Sep 26, 2025
b03b1b9
Support LongCat-Flash-Chat tool call (#24083)
Xu-Wenqing Sep 26, 2025
633f943
[Doc] Update Batch-level DP docs (#25757)
DarkLight1337 Sep 26, 2025
2b6b1d7
[Model] Mamba2 varlen refactor (#21467)
cyang49 Sep 26, 2025
2827b3f
[CI] Fix test_shared_storage_connector_hashes (#25748)
chaunceyjiang Sep 26, 2025
fe6b19c
[Bugfix] Properly abort pooling request. (#25734)
noooop Sep 26, 2025
bc9d7b5
[CI/Build] Split up Distributed Tests (#25572)
DarkLight1337 Sep 26, 2025
db1e42f
[CI/Build] Fix some V1 tests not being run (#25569)
DarkLight1337 Sep 26, 2025
d4d9899
[Quantization] Add field to skip unquantized modules for GPTQ config …
Isotr0py Sep 26, 2025
984d184
[BugFix] Fix using `dbo_decode_token_threshold` always (and ignoring …
LucasWilkinson Sep 26, 2025
8d52f2b
[ray][metrics] Replace ':' with '_' for OpenTelemetry compatibility i…
eicherseiji Sep 26, 2025
56aafa8
[Misc] fix unique_filepath (#25732)
ZJY0516 Sep 26, 2025
33f6aaf
Eagle3 that supports the Minicpm3 model (#24243)
LDLINGLINGLING Sep 26, 2025
b761df9
[Doc]: improve CPU(x86) build-wheel-from-source section (#25617)
brokedba Sep 26, 2025
11aafd9
[Bugfix] Improve GLM4 MoE Reasoning Parser's is_reasoning_end Conditi…
frankwang28 Sep 26, 2025
0002b7f
[Docs] Add Toronto Meetup (#25773)
mgoin Sep 26, 2025
f708bd4
[CI] Add E2E Blackwell Quantized MoE Test (#25723)
mgoin Sep 26, 2025
f075693
[V1] address post issues related to #20059 (part 1) (#23046)
fhl2000 Sep 26, 2025
cf89202
[CI] Fix FlashInfer AOT in release docker image (#25730)
mgoin Sep 26, 2025
c70ac4b
[spec decode] Consolidate speculative decode method name for MTP (#25…
zixi-qi Sep 26, 2025
4778b42
Reduce the Cuda Graph memory footprint when running with DBO (#25779)
SageMoore Sep 26, 2025
dc48ba0
Kernel-override Determinism [1/n] (#25603)
bwasti Sep 26, 2025
4e33a7e
[Bugfix] Optimize CpuGpuBuffer initialization (#25447)
namanlalitnyu Sep 27, 2025
6f5c093
[Spec decode] automatically disable mm for text-only draft models (#2…
jmkuebler Sep 27, 2025
8bf8f45
[Core] Don't count preempted tokens in prefix cache hit rate (#25787)
zhuohan123 Sep 27, 2025
3958b96
Add option to restrict media domains (#25783)
russellb Sep 27, 2025
92da847
Add flashinfer-build.sh and register precompiled cu128 wheel in Docke…
mgoin Sep 27, 2025
f1d53d1
[Multimodal][Speculative Decoding]Eagle Eagle3 mm support, enablement…
david6666666 Sep 27, 2025
c242c98
[Bugfix] Allow Only SDPA Backend for ViT on B200 for Qwen3-VL (#25788)
yewentao256 Sep 27, 2025
d346ec6
[CI/Build] Consolidate model loader tests and requirements (#25765)
DarkLight1337 Sep 27, 2025
b3613e3
[CI/Build] Add timing to Model Executor Test (#25799)
22quinn Sep 27, 2025
cd87bfb
[CI/Build] Reorganize root-level V1 tests (#25767)
DarkLight1337 Sep 27, 2025
3939152
[Misc] Fix codeowners override for v1 sample and attention (#25037)
22quinn Sep 27, 2025
23b8ee6
[Misc] Update openai client example file for multimodal (#25795)
ywang96 Sep 27, 2025
1761739
[Bugfix] Add missing `image_size` for phi4_multimodal (#25796)
Renovamen Sep 27, 2025
27d7638
[Bugfix] Merge MM embeddings by index instead of token IDs (#16229)
DarkLight1337 Sep 27, 2025
3f5d902
Validate API tokens in constant time (#25781)
russellb Sep 27, 2025
7977e50
Add filtering for chat template kwargs (#25794)
russellb Sep 27, 2025
ec152c8
Fix GPTQ model loading in Transformers backend (#25770)
hmellor Sep 27, 2025
f9df8b4
[Bugfix] Fix triton import precommit failure (#25803)
tlrmchlsmth Sep 27, 2025
a5354b3
[Bugfix][WideEP] Apply TP Attn + EP MoE fix to other models (#24982)
tlrmchlsmth Sep 27, 2025
ecb37e2
[docs] transcriptions API audio upload (#25446)
yyzxw Sep 27, 2025
49996cd
[env] default nixl side port conflicts with kv-event zmq port (#25056)
panpan0000 Sep 27, 2025
b65e56b
[Core] Refactor self.model() to call a helper for subclassing. (#25084)
patrick-toulme Sep 27, 2025
c0ec818
[torch.compile]: Add VLLM_DEBUG_DUMP_PATH environment variable (#25651)
ZJY0516 Sep 27, 2025
5546acb
[Bug]: Set LD_LIBRARY_PATH to include the 'standard' CUDA location (#…
smarterclayton Sep 27, 2025
c216119
[Core] GC Debug callback (#24829)
Jialin Sep 27, 2025
da63274
[Bugfix][NIXL] Fix Async Scheduler timeout issue (#25808)
NickLucche Sep 27, 2025
6931144
[MM] Optimize memory profiling for scattered multimodal embeddings (#…
ywang96 Sep 28, 2025
6144754
[Bugfix] Fix Qwen3-VL regression from #24982 (#25814)
ywang96 Sep 28, 2025
0efd540
[VLM] Update Qwen3-VL max_num_video_tokens calculation for configurab…
Isotr0py Sep 28, 2025
f4e4088
Fix random dataset mismatched token length with config. (#24937)
weireweire Sep 28, 2025
b1ded11
Update GLM-4.5 Doc transformers version (#25830)
zRzRzRzRzRzRzR Sep 28, 2025
471997a
[Bugfix] fix Qwen3VLMoe load when pp > 1 (#25838)
JJJYmmm Sep 28, 2025
0307428
Remove redundant cudagraph dispatcher warning (#25841)
mgoin Sep 28, 2025
a3ae45a
[Misc] fix tests failure by using current_platform (#25825)
kingsmad Sep 29, 2025
9b44a7d
[P/D] NIXL Updates (#25844)
robertgshaw2-redhat Sep 29, 2025
219cfbe
Add Phi4FlashForCausalLM to _PREVIOUSLY_SUPPORTED_MODELS (#25832)
tdoublep Sep 29, 2025
143844f
[XPU]Fix xpu spec decoding UTs, avoid using cuda graph (#25847)
jikunshang Sep 29, 2025
65ecb4f
[Bugfix] Fallback ViT attn backend to SDPA for blackwell (#25851)
ywang96 Sep 29, 2025
bd51f78
[V0 Deprecation][Models] Remove all V0 condition for mm embeddings me…
Isotr0py Sep 29, 2025
1b67b04
[Misc] Remove more `get_input_embeddings_v0` (#25857)
DarkLight1337 Sep 29, 2025
9360d34
update to latest deepgemm for dsv3.2 (#25871)
youkaichao Sep 29, 2025
edbaadd
[Bugfix] Fix requirements paths in install instructions (#25827)
yingjun-mou Sep 29, 2025
8616300
[Model][Bugfix] Fix issues in MiDashengLM implementation for quantize…
zhoukezi Sep 29, 2025
4322723
[torch.compile] serialize cudagraph_mode as its enum name instead of …
ZJY0516 Sep 29, 2025
d0d138b
[Nixl][P/D] Add cuda2cpu support (HD->DH transfer) (#24690)
chenxi-yang Sep 29, 2025
145ac73
[Bugfix][Speculative Decoding] Fix Eagle3 quantization config issue (…
rahul-tuli Sep 29, 2025
0899ba5
[CI/Build] Include Transformers backend test in nightly transformers …
Isotr0py Sep 29, 2025
e61eb5e
[Model] Remove MotifForCausalLM (#25866)
jeejeelee Sep 29, 2025
d5ab285
[Bugfix] Use correct key "ignore" for config.json non-quantized layer…
leejnau Sep 29, 2025
c42ff4f
[BugFix][torch.compile] KV scale calculation issues with FP8 quantiza…
adabeyta Sep 29, 2025
9bedac9
[Doc] Add documentation for vLLM continuous benchmarking and profilin…
namanlalitnyu Sep 29, 2025
61a3431
[Bugfix][ROCm] Fixing trying to import non-existent symbols from libn…
gshtras Sep 29, 2025
fea3e47
[Kernel] Chunk-aligned mamba2 (#24683)
tdoublep Sep 29, 2025
8eb0a1d
[Doc] Polish example for torchrun dp (#25899)
zhuohan123 Sep 29, 2025
2e4fe48
[NIXL] Increase default KV block eviction timeout on P (#25897)
NickLucche Sep 29, 2025
6a113d9
[V0 Deprecation] Remove `vllm.worker` and update according imports (#…
aarnphm Sep 29, 2025
78a47f8
Test Prompt Embeds/LoRA compatibility and Enable LoRA Support for OPT…
qthequartermasterman Sep 30, 2025
89e4050
[Bug] Fix Weight Loading for Block FP8 Cutlass SM90 (#25909)
yewentao256 Sep 30, 2025
d3bd171
[Benchmark] Support benchmark throughput for external launcher DP (#2…
zhuohan123 Sep 30, 2025
61aedb5
Move`VllmConfig` from `config/__init__.py` to `config/vllm.py` (#25271)
hmellor Sep 30, 2025
23194d8
[BugFix] Fix DP/EP hang (#25906)
LucasWilkinson Sep 30, 2025
e47433b
[BugFix] Pass config_format via try_get_generation_config (#25912)
acisseJZhong Sep 30, 2025
2e1b8bc
[Model][Bugfix] Fix MiDashengLM audio encoder mask by removing incorr…
zhoukezi Sep 30, 2025
e23cacd
[Bugfix]: Clean up chunked prefill logging when using whisper (#25075)
simondanielsson Sep 30, 2025
fa7e254
[New Model] DeepSeek-V3.2 (Rebased to Main) (#25896)
zyongye Sep 30, 2025
8d0afa9
[Doc] Add Cambricon MLU support (#25942)
a120092009 Sep 30, 2025
1ad3aca
Updated TRL integration docs (#25684)
sergiopaniego Sep 30, 2025
ef6e0e7
[Bugfix][Model]fix ernie45 moe gate&bias dtype to float32 (#25936)
CSWYF3634076 Sep 30, 2025
d7e34b4
[Model] Move `vision_feature_select_strategy` into `resolve_visual_en…
DarkLight1337 Sep 30, 2025
e184c9c
[perf] Use CPU tensor to reduce GPU->CPU sync (#25884)
lhtin Sep 30, 2025
80608ba
[NIXL] Add support for MLA caches with different latent dim (#25902)
NickLucche Sep 30, 2025
bc546f7
[CI] Move applicable tests to CPU (#24080)
rzabarazesh Sep 30, 2025
bb6d430
[Fix] Improve CPU backend compatibility for RISC-V (#25816)
ihb2032 Sep 30, 2025
35fe398
[Kernel][Moe Configs] Add more tuned triton configs for ExpertsInt8 a…
Josephasafg Sep 30, 2025
099aaee
Add Hugging Face Inference Endpoints guide to Deployment docs (#25886)
sergiopaniego Sep 30, 2025
f4db5e6
[Bugfix][Model] Fix inference for Hunyuan dense models (#25354)
Anionex Sep 30, 2025
ef28354
[Bugfix] Fix accuracy issue of TRTLLM FP8 MOE and improve logging (#2…
pavanimajety Sep 30, 2025
9f1c4ec
[Bugfix] Token type and position embeddings fail to be applied to `in…
DarkLight1337 Sep 30, 2025
a2e6fa7
[bugfix][deepseek] fix flashmla kernel selection (#25956)
youkaichao Sep 30, 2025
2682bb7
Squashed commit of nm/lwilkinson/dbo-alt-schedules changes relative t…
LucasWilkinson Sep 30, 2025
fcd015c
get deep ll to run
LucasWilkinson Sep 30, 2025
6dc3492
fix
LucasWilkinson Sep 30, 2025
2d7ca92
cleanup
LucasWilkinson Sep 30, 2025
4effe25
cleanup
LucasWilkinson Sep 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ steps:
queue: arm64_cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg torch_cuda_arch_list='8.7 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) --target vllm-openai --progress plain -f docker/Dockerfile ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.9.1 --build-arg FLASHINFER_AOT_COMPILE=true --build-arg torch_cuda_arch_list='8.7 9.0 10.0+PTX 12.0' --build-arg INSTALL_KV_CONNECTORS=true --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m) --target vllm-openai --progress plain -f docker/Dockerfile ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT-$(uname -m)"

# Add job to create multi-arch manifest
Expand Down
10 changes: 0 additions & 10 deletions .buildkite/scripts/hardware_ci/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -86,10 +86,6 @@ if [[ $commands == *"pytest -v -s models/test_registry.py"* ]]; then
commands=${commands//"pytest -v -s models/test_registry.py"/"pytest -v -s models/test_registry.py -k 'not BambaForCausalLM and not GritLM and not Mamba2ForCausalLM and not Zamba2ForCausalLM'"}
fi

if [[ $commands == *"VLLM_USE_V1=0 pytest -v -s models/test_initialization.py -k 'not llama4 and not plamo2'"* ]]; then
commands=${commands//"VLLM_USE_V1=0 pytest -v -s models/test_initialization.py -k 'not llama4 and not plamo2'"/"VLLM_USE_V1=0 pytest -v -s models/test_initialization.py -k 'not llama4 and not plamo2 and not BambaForCausalLM and not Gemma2ForCausalLM and not Grok1ModelForCausalLM and not Zamba2ForCausalLM and not Gemma2Model and not GritLM'"}
fi

if [[ $commands == *"pytest -v -s compile/test_basic_correctness.py"* ]]; then
commands=${commands//"pytest -v -s compile/test_basic_correctness.py"/"VLLM_USE_TRITON_FLASH_ATTN=0 pytest -v -s compile/test_basic_correctness.py"}
fi
Expand Down Expand Up @@ -167,12 +163,6 @@ if [[ $commands == *" entrypoints/llm "* ]]; then
--ignore=entrypoints/llm/test_prompt_validation.py "}
fi

#Obsolete currently
##ignore certain Entrypoints/llm tests
#if [[ $commands == *" && pytest -v -s entrypoints/llm/test_guided_generate.py"* ]]; then
# commands=${commands//" && pytest -v -s entrypoints/llm/test_guided_generate.py"/" "}
#fi

# --ignore=entrypoints/openai/test_encoder_decoder.py \
# --ignore=entrypoints/openai/test_embedding.py \
# --ignore=entrypoints/openai/test_oot_registration.py
Expand Down
7 changes: 2 additions & 5 deletions .buildkite/scripts/hardware_ci/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -58,11 +58,8 @@ function cpu_tests() {
# pytest -x -v -s tests/kernels/attention/test_cache.py -m cpu_model
# pytest -x -v -s tests/kernels/attention/test_mla_decode_cpu.py -m cpu_model

# Note: disable Bart until supports V1
pytest -x -v -s tests/models/language/generation -m cpu_model \
--ignore=tests/models/language/generation/test_bart.py
VLLM_CPU_SGL_KERNEL=1 pytest -x -v -s tests/models/language/generation -m cpu_model \
--ignore=tests/models/language/generation/test_bart.py
pytest -x -v -s tests/models/language/generation -m cpu_model
VLLM_CPU_SGL_KERNEL=1 pytest -x -v -s tests/models/language/generation -m cpu_model

pytest -x -v -s tests/models/language/pooling -m cpu_model
pytest -x -v -s tests/models/multimodal/generation \
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/scripts/hardware_ci/run-tpu-v1-test-part2.sh
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ echo "--- Installing Python dependencies ---"
python3 -m pip install --progress-bar off git+https://github.com/thuml/depyf.git \
&& python3 -m pip install --progress-bar off pytest pytest-asyncio tpu-info \
&& python3 -m pip install --progress-bar off "lm-eval @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d" \
&& python3 -m pip install --progress-bar off hf-transfer
&& python3 -m pip install --progress-bar off hf-transfer tblib==3.1.0
echo "--- Python dependencies installed ---"
export VLLM_USE_V1=1
export VLLM_XLA_CHECK_RECOMPILATION=1
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/scripts/hardware_ci/run-tpu-v1-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ echo "--- Installing Python dependencies ---"
python3 -m pip install --progress-bar off git+https://github.com/thuml/depyf.git \
&& python3 -m pip install --progress-bar off pytest pytest-asyncio tpu-info \
&& python3 -m pip install --progress-bar off "lm-eval @ git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d" \
&& python3 -m pip install --progress-bar off hf-transfer
&& python3 -m pip install --progress-bar off hf-transfer tblib==3.1.0
echo "--- Python dependencies installed ---"
export VLLM_USE_V1=1
export VLLM_XLA_CHECK_RECOMPILATION=1
Expand Down
7 changes: 3 additions & 4 deletions .buildkite/scripts/hardware_ci/run-xpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,16 +35,15 @@ docker run \
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 -O3 -O.cudagraph_mode=NONE
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager -tp 2 --distributed-executor-backend ray
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager -tp 2 --distributed-executor-backend mp
VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager
VLLM_ATTENTION_BACKEND=TRITON_ATTN python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --block-size 64 --enforce-eager
cd tests
pytest -v -s v1/core
pytest -v -s v1/engine
pytest -v -s v1/sample --ignore=v1/sample/test_logprobs.py --ignore=v1/sample/test_logprobs_e2e.py
pytest -v -s v1/worker --ignore=v1/worker/test_gpu_model_runner.py
pytest -v -s v1/structured_output
pytest -v -s v1/spec_decode --ignore=v1/spec_decode/test_max_len.py --ignore=v1/spec_decode/test_eagle.py --ignore=v1/spec_decode/test_tree_attention.py
pytest -v -s v1/spec_decode --ignore=v1/spec_decode/test_max_len.py --ignore=v1/spec_decode/test_tree_attention.py
pytest -v -s v1/kv_connector/unit --ignore=v1/kv_connector/unit/test_multi_connector.py --ignore=v1/kv_connector/unit/test_nixl_connector.py --ignore=v1/kv_connector/unit/test_shared_storage_connector.py
pytest -v -s v1/test_metrics
pytest -v -s v1/test_serial_utils.py
pytest -v -s v1/test_utils.py
pytest -v -s v1/test_metrics_reader.py
'
59 changes: 59 additions & 0 deletions .buildkite/scripts/run-prime-rl-test.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/bin/bash
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

# Setup script for Prime-RL integration tests
# This script prepares the environment for running Prime-RL tests with nightly vLLM

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd "${SCRIPT_DIR}/../.." && pwd)"
PRIME_RL_REPO="https://github.com/PrimeIntellect-ai/prime-rl.git"
PRIME_RL_DIR="${REPO_ROOT}/prime-rl"

echo "Setting up Prime-RL integration test environment..."

# Clean up any existing Prime-RL directory
if [ -d "${PRIME_RL_DIR}" ]; then
echo "Removing existing Prime-RL directory..."
rm -rf "${PRIME_RL_DIR}"
fi

# Install UV if not available
if ! command -v uv &> /dev/null; then
echo "Installing UV package manager..."
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
fi

# Clone Prime-RL repository at specific branch for reproducible tests
PRIME_RL_BRANCH="integ-vllm-main"
echo "Cloning Prime-RL repository at branch: ${PRIME_RL_BRANCH}..."
git clone --branch "${PRIME_RL_BRANCH}" --single-branch "${PRIME_RL_REPO}" "${PRIME_RL_DIR}"
cd "${PRIME_RL_DIR}"

echo "Setting up UV project environment..."
export UV_PROJECT_ENVIRONMENT=/usr/local
ln -s /usr/bin/python3 /usr/local/bin/python

# Remove vllm pin from pyproject.toml
echo "Removing vllm pin from pyproject.toml..."
sed -i '/vllm==/d' pyproject.toml

# Sync Prime-RL dependencies
echo "Installing Prime-RL dependencies..."
uv sync --inexact && uv sync --inexact --all-extras

# Verify installation
echo "Verifying installations..."
uv run python -c "import vllm; print(f'vLLM version: {vllm.__version__}')"
uv run python -c "import prime_rl; print('Prime-RL imported successfully')"

echo "Prime-RL integration test environment setup complete!"

echo "Running Prime-RL integration tests..."
export WANDB_MODE=offline # this makes this test not require a WANDB_API_KEY
uv run pytest -vs tests/integration/test_rl.py -m gpu

echo "Prime-RL integration tests completed!"
Loading