Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1684 commits
Select commit Hold shift + click to select a range
1e57b1e
[Misc] Remove unnecessary decode call (#12833)
DarkLight1337 Feb 6, 2025
85ac82d
[Kernel] Make rotary_embedding ops more flexible with input shape (#1…
Isotr0py Feb 6, 2025
09b95e3
[torch.compile] PyTorch 2.6 and nightly compatibility (#12393)
youkaichao Feb 6, 2025
afe74f7
[Doc] double quote cmake package in build.inc.md (#12840)
jitseklomp Feb 6, 2025
8108ac8
[Bugfix] Fix unsupported FA version check for Turing GPU (#12828)
Isotr0py Feb 6, 2025
467a96a
[V1] LoRA Support (#10957)
varun-sundar-rabindranath Feb 6, 2025
aff4045
Add Bamba Model (#10909)
fabianlim Feb 6, 2025
741429a
[MISC] Check space in the file names in the pre commit checks (#12804)
houseroad Feb 6, 2025
b260782
[misc] Revert # 12833 (#12857)
khluu Feb 7, 2025
ef533d2
[Bugfix] FA2 illegal memory access (#12848)
LucasWilkinson Feb 7, 2025
433c4a4
Make vllm compatible with verl (#12824)
ZSL98 Feb 7, 2025
aa375dc
[Bugfix] Missing quant_config in deepseek embedding layer (#12836)
SzymonOzog Feb 7, 2025
6e1fc61
Prevent unecessary requests to huggingface hub (#12837)
maxdebayser Feb 7, 2025
1918aa1
[MISC][EASY] Break check file names into entry and args in the pre-co…
houseroad Feb 7, 2025
ce26b16
[Misc] Remove unnecessary detokenization in multimodal processing (#1…
DarkLight1337 Feb 7, 2025
538fab9
PR #12718 (#12718)
garg-amit Feb 7, 2025
0630d45
[V1] Logprobs and prompt logprobs support (#9880)
afeldman-nm Feb 7, 2025
eaa92d4
[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activa…
tjtanaa Feb 7, 2025
932c6b7
[V1] LM Eval With Streaming Integration Tests (#11590)
robertgshaw2-redhat Feb 7, 2025
45cbc49
[Bugfix] Fix disagg hang caused by the prefill and decode communicati…
houseroad Feb 8, 2025
b21f0f9
[V1][Minor] Remove outdated comment (#12928)
WoosukKwon Feb 8, 2025
3243158
[V1] Move KV block hashes from Request to KVCacheManager (#12922)
WoosukKwon Feb 8, 2025
306923d
[Bugfix] Fix Qwen2_5_VLForConditionalGeneration packed_modules_mappin…
jeejeelee Feb 8, 2025
cc01223
[Misc] Fix typo in the example file (#12896)
DK-DARKmatter Feb 8, 2025
d01f66b
[Bugfix] Fix multi-round chat error when mistral tokenizer is used (#…
zifeitong Feb 8, 2025
91dd8f7
[bugfix] respect distributed_executor_backend in world_size=1 (#12934)
youkaichao Feb 8, 2025
e31498b
[Misc] Add offline test for disaggregated prefill (#12418)
Shaoting-Feng Feb 8, 2025
4ea48fb
[V1][Minor] Move cascade attn logic outside _prepare_inputs (#12943)
WoosukKwon Feb 8, 2025
407b553
[Build] Make pypi install work on CPU platform (#12874)
wangxiyuan Feb 8, 2025
2880e21
[Hardware][Intel-Gaudi] Enable long-contexts + LoRA support for Intel…
scsudhak-intel Feb 8, 2025
7e18376
[misc] Add LoRA to benchmark_serving (#12898)
varun-sundar-rabindranath Feb 8, 2025
011e612
[Misc] Log time consumption on weight downloading (#12926)
waltforme Feb 8, 2025
c45d398
[CI] Resolve transformers-neuronx version conflict (#12925)
liangfu Feb 8, 2025
256a2d2
[Doc] Correct HF repository for TeleChat2 models (#12949)
waltforme Feb 8, 2025
4c8dd12
[Misc] Add qwen2.5-vl BNB support (#12944)
Isotr0py Feb 8, 2025
8a69e0e
[CI/Build] Auto-fix Markdown files (#12941)
DarkLight1337 Feb 8, 2025
913df14
[Bugfix] Remove unused seq_group_metadata_list from ModelInputForGPU …
ShangmingCai Feb 8, 2025
fe743b7
[bugfix] fix early import of flash attention (#12959)
youkaichao Feb 8, 2025
86222a3
[VLM] Merged multi-modal processor for GLM4V (#12449)
jeejeelee Feb 8, 2025
870c374
[V1][Minor] Remove outdated comment (#12968)
WoosukKwon Feb 8, 2025
d366ccc
[RFC] [Mistral] FP8 format (#10130)
patrickvonplaten Feb 8, 2025
24700c3
[V1] Cache `uses_mrope` in GPUModelRunner (#12969)
WoosukKwon Feb 8, 2025
cf797aa
[core] port pynvml into vllm codebase (#12963)
youkaichao Feb 9, 2025
29f1d47
[MISC] Always import version library first in the vllm package (#12979)
houseroad Feb 9, 2025
59fff4a
[core] improve error handling when wake up from sleep mode (#12981)
youkaichao Feb 10, 2025
aa0ca5e
[core][rlhf] add colocate example for RLHF (#12984)
youkaichao Feb 10, 2025
67c4637
[V1] Use msgpack for core request serialization (#12918)
njhill Feb 10, 2025
44607e0
Check if selected backend is None in get_attn_backend_cls() (#12975)
terrytangyuan Feb 10, 2025
b2496bb
[core] fix sleep mode and pytorch checkpoint compatibility (#13001)
youkaichao Feb 10, 2025
2431371
[Doc] Add link to tool_choice tracking issue in tool_calling.md (#13003)
terrytangyuan Feb 10, 2025
fde7126
[misc] Add retries with exponential backoff for HF file existence che…
khluu Feb 10, 2025
51f0b5f
[Bugfix] Clean up and fix multi-modal processors (#13012)
DarkLight1337 Feb 10, 2025
2ae8890
Fix seed parameter behavior in vLLM (#13007)
SmartManoj Feb 10, 2025
08b2d84
[Model] Ultravox Model: Support v0.5 Release (#12912)
farzadab Feb 10, 2025
91e8767
[misc] Fix setup.py condition to avoid AMD from being mistaken with C…
khluu Feb 11, 2025
2ff4857
[V1][Minor] Move scheduler outputs to a separate file (#13062)
WoosukKwon Feb 11, 2025
2c0f582
[Docs] Annouce Meta Meetup (#13065)
simon-mo Feb 11, 2025
cb080f3
[Bugfix] Support missing tool parameters in mistral tokenizer (#12884)
fgreinacher Feb 11, 2025
58047c6
[Benchmark] Add BurstGPT to benchmark_serving (#13063)
WoosukKwon Feb 11, 2025
c320ca8
[Core] Don't do platform detection at import time (#12933)
russellb Feb 11, 2025
78a141d
[Misc] LoRA - Refactor Punica ops tests (#12970)
varun-sundar-rabindranath Feb 11, 2025
fc6485d
[Bugfix]: Reasoning output bug according to the chat template change …
gaocegege Feb 11, 2025
41c5dd4
[V1][Metrics] Add GPU prefix cache hit rate % gauge (#12592)
comaniac Feb 11, 2025
9cf4759
[executor] init `local_rank` as device index (#13027)
MengqingCao Feb 11, 2025
7539bbc
[ROCm] Using a more precise memory profiling (#12624)
gshtras Feb 11, 2025
da31719
[Build] Fix cuda link target of cumem_allocator in CPU env (#12863)
guoyuhong Feb 11, 2025
2e3b969
[Platform] add pre_register_and_update function (#12432)
wangxiyuan Feb 11, 2025
110f59a
[Bugfix] fix flaky test (#13089)
SmartManoj Feb 11, 2025
75e6e14
[V1][Metrics] Add several request timing histograms (#12644)
markmc Feb 11, 2025
ad97763
Set `torch_dtype` in `TransformersModel` (#13088)
hmellor Feb 11, 2025
bf3e052
[Misc] Fix typo at comments at metrics.py (#13024)
je1lee Feb 11, 2025
21f5d50
[Bugfix] Do not use resource module on Windows (#12858) (#13029)
MoonRide303 Feb 11, 2025
6c4dbe2
[BugFix] Pop instead of del CUDA_VISIBLE_DEVICES (#12962)
HollowMan6 Feb 11, 2025
2b25b7d
Fix initializing GGUF weights for ColumnParallelLinear when using ten…
SzymonOzog Feb 11, 2025
565c1ef
[CI/Build][Bugfix] Fix CPU backend default threads num (#13077)
bigPYJ1151 Feb 11, 2025
deb6c1c
[Doc] Improve OpenVINO installation doc (#13102)
hmellor Feb 11, 2025
14ecab5
[Bugfix] Guided decoding falls back to outlines when fails to import …
terrytangyuan Feb 11, 2025
72c2b68
[Misc] Move pre-commit suggestion back to the end (#13114)
russellb Feb 11, 2025
3ee696a
[RFC][vllm-API] Support tokenizer registry for customized tokenizer i…
youngkent Feb 12, 2025
974dfd4
[Model] IBM/NASA Prithvi Geospatial model (#12830)
christian-pinto Feb 12, 2025
842b0fd
[ci] Add more source file dependencies for some tests (#13123)
khluu Feb 12, 2025
e92694b
[Neuron][Kernel] Support Longer Sequences in NKI-based Flash PagedAtt…
lingfanyu Feb 12, 2025
a0597c6
Bump helm/kind-action from 1.10.0 to 1.12.0 (#11612)
dependabot[bot] Feb 12, 2025
dd3b4a0
Bump actions/stale from 9.0.0 to 9.1.0 (#12462)
dependabot[bot] Feb 12, 2025
0c7d9ef
Bump helm/chart-testing-action from 2.6.1 to 2.7.0 (#12463)
dependabot[bot] Feb 12, 2025
d59def4
Bump actions/setup-python from 5.3.0 to 5.4.0 (#12672)
dependabot[bot] Feb 12, 2025
7c4033a
Further reduce the HTTP calls to huggingface.co (#13107)
maxdebayser Feb 12, 2025
f1042e8
[Misc] AMD Build Improvements (#12923)
842974287 Feb 12, 2025
f4d97e4
[Bug] [V1] Try fetching stop_reason from EngineOutput before checking…
bnellnm Feb 12, 2025
985b4a2
[Bugfix] Fix num video tokens calculation for Qwen2-VL (#13148)
DarkLight1337 Feb 12, 2025
314cfad
[Frontend] Generate valid tool call IDs when using `tokenizer-mode=mi…
rafvasq Feb 12, 2025
82cabf5
[Misc] Delete unused LoRA modules (#13151)
jeejeelee Feb 12, 2025
042c341
Introduce VLLM_CUDART_SO_PATH to allow users specify the .so path (#1…
houseroad Feb 12, 2025
2c2b560
[CI/Build] Use mypy matcher for pre-commit CI job (#13162)
russellb Feb 12, 2025
36a0863
[CORE] [QUANT] Support for GPTQModel's `dynamic` quantization per mod…
Qubitium Feb 12, 2025
09972e7
[Bugfix] Allow fallback to AWQ from AWQMarlin at per-layer granularit…
mgoin Feb 12, 2025
14b7899
[CI] Fix failing FP8 cpu offload test (#13170)
mgoin Feb 12, 2025
4c0d93f
[V1][Bugfix] Copy encoder input ids to fix set iteration issue during…
andoorve Feb 12, 2025
8eafe5e
[CI/Build] Ignore ruff warning up007 (#13182)
russellb Feb 13, 2025
9f9704d
[perf-benchmark] cleanup unused Docker images and volumes in H100 ben…
khluu Feb 13, 2025
4fc5c23
[NVIDIA] Support nvfp4 quantization (#12784)
kaixih Feb 13, 2025
d88c866
[Bugfix][Example] Fix GCed profiling server for TPU (#12792)
mgoin Feb 13, 2025
bc55d13
[VLM] Implement merged multimodal processor for Mllama (#11427)
Isotr0py Feb 13, 2025
009439c
Simplify logic of locating CUDART so file path (#13203)
houseroad Feb 13, 2025
60c68df
[Build] Automatically use the wheel of the base commit with Python-on…
comaniac Feb 13, 2025
04f50ad
[Bugfix] deepseek_r1_reasoning_parser put reason content in wrong fie…
LikeSundayLikeRain Feb 13, 2025
d46d490
[Frontend] Move CLI code into vllm.cmd package (#12971)
russellb Feb 13, 2025
cb944d5
Allow Unsloth Dynamic 4bit BnB quants to work (#12974)
danielhanchen Feb 13, 2025
0ccd876
[CI/Build] Allow ruff to auto-fix some issues (#13180)
russellb Feb 13, 2025
9605c12
[V1][core] Implement pipeline parallel on Ray (#12996)
ruisearch42 Feb 13, 2025
fa253f1
[VLM] Remove input processor from clip and siglip (#13165)
Isotr0py Feb 13, 2025
578087e
[Frontend] Pass pre-created socket to uvicorn (#13113)
russellb Feb 13, 2025
fdcf64d
[V1] Clarify input processing and multimodal feature caching logic (#…
ywang96 Feb 13, 2025
c9d3ecf
[VLM] Merged multi-modal processor for Molmo (#12966)
DarkLight1337 Feb 13, 2025
2092a6f
[V1][Core] Add worker_base for v1 worker (#12816)
AoyuQC Feb 13, 2025
02ed8a1
[Misc] Qwen2.5-VL Optimization (#13155)
wulipc Feb 13, 2025
1bc3b5e
[VLM] Separate text-only and vision variants of the same model archit…
DarkLight1337 Feb 13, 2025
37dfa60
[Bugfix] Missing Content Type returns 500 Internal Server Error (#13193)
vaibhavjainwiz Feb 13, 2025
d84cef7
[Frontend] Add `/v1/audio/transcriptions` OpenAI API endpoint (#12909)
NickLucche Feb 13, 2025
bffddd9
Add label if pre-commit passes (#12527)
hmellor Feb 13, 2025
2344192
Optimize moe_align_block_size for deepseek_v3 (#12850)
mgoin Feb 13, 2025
c1e37bf
[Kernel][Bugfix] Refactor and Fix CUTLASS 2:4 Sparse Kernels (#13198)
tlrmchlsmth Feb 14, 2025
e38be64
Revert "Add label if pre-commit passes" (#13242)
hmellor Feb 14, 2025
4108869
[ROCm] Avoid using the default stream on ROCm (#13238)
gshtras Feb 14, 2025
8c32b08
[Kernel] Fix awq error when n is not divisable by 128 (#13227)
jinzhen-lin Feb 14, 2025
dd5ede4
[V1] Consolidate MM cache size to vllm.envs (#13239)
ywang96 Feb 14, 2025
09545c0
[Bugfix/CI] Turn test_compressed_tensors_2of4_sparse back on (#13250)
tlrmchlsmth Feb 14, 2025
0676782
[Bugfix][CI] Inherit codespell settings from pyproject.toml in the pr…
tlrmchlsmth Feb 14, 2025
84683fa
[Bugfix] Offline example of disaggregated prefill (#13214)
XiaobingSuper Feb 14, 2025
40932d7
[Misc] Remove redundant statements in scheduler.py (#13229)
WrRan Feb 14, 2025
f2b20fe
Consolidate Llama model usage in tests (#13094)
hmellor Feb 14, 2025
f0b2da7
Expand MLA to support most types of quantization (#13181)
mgoin Feb 14, 2025
cbc4012
[V1] LoRA - Enable Serving Usecase (#12883)
varun-sundar-rabindranath Feb 14, 2025
ba59b78
[ROCm][V1] Add intial ROCm support to V1 (#12790)
SageMoore Feb 14, 2025
b0ccfc5
[Bugfix][V1] GPUModelRunner._update_states should return True when th…
imkero Feb 14, 2025
45f90bc
[WIP] TPU V1 Support Refactored (#13049)
alexm-redhat Feb 14, 2025
185cc19
[Frontend] Optionally remove memory buffer used for uploading to URLs…
pooyadavoodi Feb 14, 2025
83481ce
[Bugfix] Fix missing parentheses (#13263)
xu-song Feb 14, 2025
556ef7f
[Misc] Log time consumption of sleep and wake-up (#13115)
waltforme Feb 14, 2025
4da1f66
[VLM] Keep track of whether prompt replacements have been applied (#1…
DarkLight1337 Feb 14, 2025
085b7b2
[V1] Simplify GPUModelRunner._update_states check (#13265)
njhill Feb 14, 2025
6224a9f
Support logit_bias in v1 Sampler (#13079)
houseroad Feb 14, 2025
7734e9a
[Core] choice-based structured output with xgrammar (#12632)
russellb Feb 14, 2025
c9e2d64
[Hardware][Gaudi][Bugfix] Fix error for guided decoding (#12317)
Feb 14, 2025
5e5c8e0
[Quant][Perf] Use moe_wna16 kernel by default for MoEs with many expe…
mgoin Feb 14, 2025
3bcb8c7
[Core] Reduce TTFT with concurrent partial prefills (#10235)
joerunde Feb 14, 2025
a12934d
[V1][Core] min_p sampling support (#13191)
AoyuQC Feb 14, 2025
e7eea5a
[V1][CI] Fix failed v1-test because of min_p (#13316)
WoosukKwon Feb 15, 2025
6a854c7
[V1][Sampler] Don't apply temp for greedy-only (#13311)
njhill Feb 15, 2025
0c73026
[V1][PP] Fix memory profiling in PP (#13315)
WoosukKwon Feb 15, 2025
c9f9d5b
[Bugfix][AMD] Update torch_bindings so that scaled_fp4_quant isn't bu…
SageMoore Feb 15, 2025
579d7a6
[Bugfix][Docs] Fix offline Whisper (#13274)
NickLucche Feb 15, 2025
97a3d6d
[Bugfix] Massage MLA's usage of flash attn for RoCM (#13310)
tlrmchlsmth Feb 15, 2025
9076325
[BugFix] Don't scan entire cache dir when loading model (#13302)
njhill Feb 15, 2025
067fa22
[Bugfix]Fix search start_index of stop_checker (#13280)
xu-song Feb 15, 2025
7fdaaf4
[Bugfix] Fix qwen2.5-vl image processor (#13286)
Isotr0py Feb 15, 2025
2ad1bc7
[V1][Metrics] Add iteration_tokens_total histogram from V0 (#13288)
markmc Feb 15, 2025
ed0de3e
[AMD] [Model] DeepSeek tunings (#13199)
rasmith Feb 15, 2025
9206b3d
[V1][PP] Run engine busy loop with batch queue (#13064)
comaniac Feb 15, 2025
54ed913
[ci/build] update flashinfer (#13323)
youkaichao Feb 15, 2025
367cb8c
[Doc] [2/N] Add Fuyu E2E example for multimodal processor (#13331)
DarkLight1337 Feb 15, 2025
80f63a3
[V1][Spec Decode] Ngram Spec Decode (#12193)
LiuXiaoxuanPKU Feb 16, 2025
12913d1
[Quant] Add `SupportsQuant` to phi3 and clip (#13104)
kylesayrs Feb 16, 2025
d3d547e
[Bugfix] Pin xgrammar to 0.1.11 (#13338)
mgoin Feb 16, 2025
dc0f7cc
[BugFix] Enhance test_pos_encoding to support execution on multi-devi…
wchen61 Feb 16, 2025
b7d3098
[V1] Update doc and examples for H2O-VL (#13349)
ywang96 Feb 16, 2025
124776e
[ci] skip failed tests for flashinfer (#13352)
youkaichao Feb 16, 2025
a0231b7
[platform] add base class for communicators (#13208)
youkaichao Feb 16, 2025
5d2965b
[Bugfix] Fix 2 Node and Spec Decode tests (#13341)
DarkLight1337 Feb 16, 2025
da833b0
[Docs] Change myenv to vllm. Update python_env_setup.inc.md (#13325)
arkylin Feb 16, 2025
7b89386
[V1][BugFix] Add __init__.py to v1/spec_decode/ (#13359)
WoosukKwon Feb 16, 2025
e18227b
[V1][PP] Cache Intermediate Tensors (#13353)
WoosukKwon Feb 16, 2025
d67cc21
[Bugfix][Platform][CPU] Fix cuda platform detection on CPU backend ed…
Isotr0py Feb 16, 2025
69e1d23
[V1][BugFix] Clean up rejection sampler & Fix warning msg (#13362)
WoosukKwon Feb 16, 2025
2010f04
[V1][Misc] Avoid unnecessary log output (#13289)
jeejeelee Feb 17, 2025
46cdd59
[Feature][Spec Decode] Simplify the use of Eagle Spec Decode (#12304)
ShangmingCai Feb 17, 2025
f857311
Fix spelling error in index.md (#13369)
yankooo Feb 17, 2025
4518683
Run v1 benchmark and integrate with PyTorch OSS benchmark database (#…
huydhn Feb 17, 2025
238dfc8
[MISC] tiny fixes (#13378)
MengqingCao Feb 17, 2025
7b623fc
[VLM] Check required fields before initializing field config in `Dict…
DarkLight1337 Feb 17, 2025
1f69c4a
[Model] Support Mamba2 (Codestral Mamba) (#9292)
tlrmchlsmth Feb 17, 2025
30513d1
[Bugfix] fix xpu communicator (#13368)
yma11 Feb 17, 2025
ce77eb9
[Bugfix] Fix VLLM_USE_MODELSCOPE issue (#13384)
r4ntix Feb 17, 2025
4c21ce9
[V1] Get input tokens from scheduler (#13339)
WoosukKwon Feb 17, 2025
6ac485a
[V1][PP] Fix intermediate tensor values (#13417)
comaniac Feb 17, 2025
cd4a72a
[V1][Spec decode] Move drafter to model runner (#13363)
WoosukKwon Feb 17, 2025
b3942e1
[Bugfix][CI][V1] Work around V1 + CUDA Graph + torch._scaled_mm fallb…
tlrmchlsmth Feb 18, 2025
efbe854
[Misc] Remove dangling references to `SamplingType.BEAM` (#13402)
hmellor Feb 18, 2025
67ef8f6
[Model] Enable quantization support for `transformers` backend (#12960)
Isotr0py Feb 18, 2025
7c7adf8
[ROCm] fix get_device_name for rocm (#13438)
divakar-amd Feb 18, 2025
932b51c
[v1] fix parallel config rank (#13445)
youkaichao Feb 18, 2025
88787bc
[Quant] Molmo SupportsQuant (#13336)
kylesayrs Feb 18, 2025
00294e1
[Quant] Arctic SupportsQuant (#13366)
kylesayrs Feb 18, 2025
a1074b3
[Bugfix] Only print out chat template when supplied (#13444)
terrytangyuan Feb 18, 2025
ac19b51
[core] fix sleep mode in pytorch 2.6 (#13456)
youkaichao Feb 18, 2025
d1b649f
[Quant] Aria SupportsQuant (#13416)
kylesayrs Feb 18, 2025
9915912
[V1][PP] Fix & Pin Ray version in requirements-cuda.txt (#13436)
WoosukKwon Feb 18, 2025
b53d799
Add outlines fallback when JSON schema has enum (#13449)
mgoin Feb 18, 2025
e2603fe
[Bugfix] Ensure LoRA path from the request can be included in err msg…
terrytangyuan Feb 18, 2025
8cf97f8
[Bugfix] Fix failing transformers dynamic module resolving with spawn…
Isotr0py Feb 18, 2025
2358ca5
[Doc]: Improve feature tables (#13224)
hmellor Feb 18, 2025
29fc577
[Bugfix] Remove noisy error logging during local model loading (#13458)
Isotr0py Feb 18, 2025
435b502
[ROCm] Make amdsmi import optional for other platforms (#13460)
DarkLight1337 Feb 18, 2025
d3231cb
[Bugfix] Handle content type with optional parameters (#13383)
zifeitong Feb 18, 2025
3809458
[Bugfix] Fix invalid rotary embedding unit test (#13431)
liangfu Feb 18, 2025
a02c86b
[CI/Build] migrate static project metadata from setup.py to pyproject…
dtrifiro Feb 18, 2025
4fb8142
[V1][PP] Enable true PP with Ray executor (#13472)
WoosukKwon Feb 18, 2025
7b203b7
[misc] fix debugging code (#13487)
youkaichao Feb 18, 2025
a4d577b
[V1][Tests] Adding additional testing for multimodal models to V1 (#1…
andoorve Feb 18, 2025
30172b4
[V1] Optimize handling of sampling metadata and req_ids list (#13244)
njhill Feb 18, 2025
c8d70e2
Pin Ray version to 2.40.0 (#13490)
WoosukKwon Feb 18, 2025
4c82229
[V1][Spec Decode] Optimize N-gram matching with Numba (#13365)
WoosukKwon Feb 18, 2025
00b69c2
[Misc] Remove dangling references to `--use-v2-block-manager` (#13492)
hmellor Feb 19, 2025
d0a7a27
[Hardware][Gaudi][Feature] Support Contiguous Cache Fetch (#12139)
Feb 19, 2025
9aa95b0
[perf-benchmark] Allow premerge ECR (#13509)
khluu Feb 19, 2025
8aada19
[ROCm][MoE configs] mi325 mixtral & mi300 qwen_moe (#13503)
divakar-amd Feb 19, 2025
fd84857
[Doc] Add clarification note regarding paligemma (#13511)
ywang96 Feb 19, 2025
d5d214a
[1/n][CI] Load models in CI from S3 instead of HF (#13205)
khluu Feb 19, 2025
3b05cd4
[perf-benchmark] Fix ECR path for premerge benchmark (#13512)
khluu Feb 19, 2025
fdc5df6
use device param in load_model method (#13037)
Zzhiter Feb 19, 2025
983a40a
[Bugfix] Fix Positive Feature Layers in Llava Models (#13514)
alex-jw-brooks Feb 19, 2025
f525c0b
[Model][Speculative Decoding] DeepSeek MTP spec decode (#12755)
luccafong Feb 19, 2025
caf7ff4
[V1][Core] Generic mechanism for handling engine utility (#13060)
njhill Feb 19, 2025
4233302
[Feature] Pluggable platform-specific scheduler (#13161)
yannicks1 Feb 19, 2025
81dabf2
[CI/Build] force writing version file (#13544)
dtrifiro Feb 19, 2025
52ce14d
[doc] clarify profiling is only for developers (#13554)
youkaichao Feb 19, 2025
377d10b
[VLM][Bugfix] Pass processor kwargs properly on init (#13516)
DarkLight1337 Feb 19, 2025
5ae9f26
[Bugfix] Fix device ordinal for multi-node spec decode (#13269)
ShangmingCai Feb 19, 2025
ad5a35c
[doc] clarify multi-node serving doc (#13558)
youkaichao Feb 19, 2025
01c184b
Fix copyright year to auto get current year (#13561)
wilsonwu Feb 19, 2025
fbbe1fb
[MISC] Logging the message about Ray teardown (#13502)
comaniac Feb 19, 2025
550d97e
[Misc] Avoid calling unnecessary `hf_list_repo_files` for local model…
Isotr0py Feb 19, 2025
a4c402a
[BugFix] Avoid error traceback in logs when V1 `LLM` terminates (#13565)
njhill Feb 20, 2025
473f51c
[3/n][CI] Load Quantization test models with S3 (#13570)
khluu Feb 20, 2025
512368e
[Misc] Qwen2.5 VL support LoRA (#13261)
jeejeelee Feb 20, 2025
88f6ba3
[ci] Add AWS creds for AMD (#13572)
khluu Feb 20, 2025
0d243f2
[ROCm][MoE] mi300 mixtral8x7B perf for specific BS (#13577)
divakar-amd Feb 20, 2025
ba81163
[core] add sleep and wake up endpoint and v1 support (#12987)
youkaichao Feb 20, 2025
8c755c3
[bugfix] spec decode worker get tp group only when initialized (#13578)
simon-mo Feb 20, 2025
9621667
[Misc] Warn if the vLLM version can't be retrieved (#13501)
alex-jw-brooks Feb 20, 2025
041e294
[Misc] add mm_processor_kwargs to extra_body for Qwen2.5-VL (#13533)
wulipc Feb 20, 2025
0023cd2
[ROCm] MI300A compile targets deprecation (#13560)
gshtras Feb 20, 2025
3738e6f
[API Server] Add port number range validation (#13506)
terrytangyuan Feb 20, 2025
497bc83
[CI/Build] Use uv in the Dockerfile (#13566)
mgoin Feb 20, 2025
aa1e62d
[ci] Fix spec decode test (#13600)
khluu Feb 20, 2025
a64a844
[2/n][ci] S3: Use full model path (#13564)
khluu Feb 20, 2025
b69692a
[Kernel] LoRA - Refactor sgmv kernels (#13110)
varun-sundar-rabindranath Feb 20, 2025
992e5c3
Merge similar examples in `offline_inference` into single `basic` exa…
hmellor Feb 20, 2025
ed6e907
[Bugfix] Fix deepseekv3 grouped topk error (#13474)
Chen-XiaoBing Feb 20, 2025
9b6c2be
Merge branch 'tg/update_vllm' into gm/validation
tamazgadaev Feb 25, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
9 changes: 7 additions & 2 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
# SPDX-License-Identifier: Apache-2.0

import os
import sys
import zipfile

# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 250 MB
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 250))
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 400 MiB
# Note that we have 400 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/3792 .
# Please also sync the value with the one in Dockerfile.
VLLM_MAX_SIZE_MB = int(os.environ.get('VLLM_MAX_SIZE_MB', 400))


def print_top_10_largest_files(zip_file):
Expand Down
26 changes: 26 additions & 0 deletions .buildkite/generate_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# SPDX-License-Identifier: Apache-2.0

import argparse
import os

template = """<!DOCTYPE html>
<html>
<body>
<h1>Links for vLLM</h1/>
<a href="../{wheel_html_escaped}">{wheel}</a><br/>
</body>
</html>
"""

parser = argparse.ArgumentParser()
parser.add_argument("--wheel", help="The wheel path.", required=True)
args = parser.parse_args()

filename = os.path.basename(args.wheel)

with open("index.html", "w") as f:
print(f"Generated index.html for {args.wheel}")
# cloudfront requires escaping the '+' character
f.write(
template.format(wheel=filename,
wheel_html_escaped=filename.replace("+", "%2B")))
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8 -b "auto" -l 1000 -f 5 -t 1
model_name: "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.356
- name: "exact_match,flexible-extract"
value: 0.358
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash ./run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM -b "auto" -t 2
model_name: "nm-testing/SparseLlama-3.1-8B-gsm8k-pruned.2of4-chnl_wts_per_tok_dyn_act_fp8-BitM"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.6353
- name: "exact_match,flexible-extract"
value: 0.637
limit: null
num_fewshot: null
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3.2-1B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Expand Down
6 changes: 3 additions & 3 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,6 @@ while getopts "m:b:l:f:" OPT; do
done

lm_eval --model hf \
--model_args pretrained=$MODEL,parallelize=True \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
--model_args "pretrained=$MODEL,parallelize=True" \
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size "$BATCH_SIZE"
6 changes: 3 additions & 3 deletions .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,6 @@ while getopts "m:b:l:f:t:" OPT; do
done

lm_eval --model vllm \
--model_args pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend="ray",trust_remote_code=true,max_model_len=4096 \
--tasks gsm8k --num_fewshot $FEWSHOT --limit $LIMIT \
--batch_size $BATCH_SIZE
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,distributed_executor_backend=ray,trust_remote_code=true,max_model_len=4096" \
--tasks gsm8k --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size "$BATCH_SIZE"
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/run-tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ while getopts "c:t:" OPT; do
done

# Parse list of configs.
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < $CONFIG
IFS=$'\n' read -d '' -r -a MODEL_CONFIGS < "$CONFIG"

for MODEL_CONFIG in "${MODEL_CONFIGS[@]}"
do
Expand Down
1 change: 1 addition & 0 deletions .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# SPDX-License-Identifier: Apache-2.0
"""
LM eval harness on model to compare vs HF baseline computed offline.
Configs are found in configs/$MODEL.yaml
Expand Down
46 changes: 18 additions & 28 deletions .buildkite/nightly-benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,13 @@
# vLLM benchmark suite


## Introduction

This directory contains two sets of benchmark for vllm.

- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.


See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.

See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.

## Performance benchmark quick overview

Expand All @@ -19,17 +17,14 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performan

**For benchmarking developers**: please try your best to constraint the duration of benchmarking to about 1 hr so that it won't take forever to run.


## Nightly benchmark quick overview

**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.

**Benchmarking engines**: vllm, TGI, trt-llm and lmdeploy.

**Benchmarking Duration**: about 3.5hrs.



## Trigger the benchmark

Performance benchmark will be triggered when:
Expand All @@ -39,16 +34,11 @@ Performance benchmark will be triggered when:
Nightly benchmark will be triggered when:
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.




## Performance benchmark details


See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.


#### Latency test
### Latency test

Here is an example of one test inside `latency-tests.json`:

Expand All @@ -68,23 +58,25 @@ Here is an example of one test inside `latency-tests.json`:
```

In this example:
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`

- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`

Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.

WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.

### Throughput test

#### Throughput test
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.

The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.

#### Serving test
### Serving test

We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:

```
```json
[
{
"test_name": "serving_llama8B_tp1_sharegpt",
Expand All @@ -109,6 +101,7 @@ We test the throughput by using `benchmark_serving.py` with request rate = inf t
```

Inside this example:

- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
- The `server-parameters` includes the command line arguments for vLLM server.
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
Expand All @@ -118,36 +111,33 @@ The number of this test is less stable compared to the delay and latency benchma

WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.

#### Visualizing the results
### Visualizing the results

The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
If you do not see the table, please wait till the benchmark finish running.
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.



## Nightly test details

See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.

### Workflow

#### Workflow

- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.

#### Nightly tests
### Nightly tests

In [nightly-tests.json](tests/nightly-tests.json), we include the command line arguments for benchmarking commands, together with the benchmarking test cases. The format is highly similar to performance benchmark.

#### Docker containers
### Docker containers

The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.

WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.

WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).

Loading
Loading