Release v0.5.5 · vllm-project/vllm

Highlights

Performance Update

We introduced a new mode that schedule multiple GPU steps in advance, reducing CPU overhead (#7000, #7387, #7452, #7703). Initial result shows 20% improvements in QPS for a single GPU running 8B and 30B models. You can set --num-scheduler-steps 8 as a parameter to the API server (via vllm serve) or AsyncLLMEngine. We are working on expanding the coverage to LLM class and aiming to turning it on by default
Various enhancements:
- Use flashinfer sampling kernel when avaiable, leading to 7% decoding throughput speedup (#7137)
- Reduce Python allocations, leading to 24% throughput speedup (#7162, 7364)
- Improvements to the zeromq based decoupled frontend (#7570, #7716, #7484)

Model Support

Support Jamba 1.5 (#7415, #7601, #6739)
Support for the first audio model UltravoxModel (#7615, #7446)
Improvements to vision models:
- Support image embeddings as input (#6613)
- Support SigLIP encoder and alternative decoders for LLaVA models (#7153)
Support loading GGUF model (#5191) with tensor parallelism (#7520)
Progress in encoder decoder models: support for serving encoder/decoder models (#7258), and architecture for cross-attention (#4942)

Hardware Support

AMD: Add fp8 Linear Layer for rocm (#7210)
Enhancements to TPU support: load time W8A16 quantization (#7005), optimized rope (#7635), and support multi-host inference (#7457).
Intel: various refactoring for worker, executor, and model runner (#7686, #7712)

Others

Optimize prefix caching performance (#7193)
Speculative decoding
- Use target model max length as default for draft model (#7706)
- EAGLE Implementation with Top-1 proposer (#6830)
Entrypoints
- A new chat method in the LLM class (#5049)
- Support embeddings in the run_batch API (#7132)
- Support prompt_logprobs in Chat Completion (#7453)
Quantizations
- Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)
- Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174)
torch.compile: register custom ops for kernels (#7591, #7594, #7536)

What's Changed

[ci][frontend] deduplicate tests by @youkaichao in #7101
[Doc] [SpecDecode] Update MLPSpeculator documentation by @tdoublep in #7100
[Bugfix] Specify device when loading LoRA and embedding tensors by @jischein in #7129
[MISC] Use non-blocking transfer in prepare_input by @comaniac in #7172
[Core] Support loading GGUF model by @Isotr0py in #5191
[Build] Add initial conditional testing spec by @simon-mo in #6841
[LoRA] Relax LoRA condition by @jeejeelee in #7146
[Model] Support SigLIP encoder and alternative decoders for LLaVA models by @DarkLight1337 in #7153
[BugFix] Fix DeepSeek remote code by @dsikka in #7178
[ BugFix ] Fix ZMQ when VLLM_PORT is set by @robertgshaw2-neuralmagic in #7205
[Bugfix] add gguf dependency by @kpapis in #7198
[SpecDecode] [Minor] Fix spec decode sampler tests by @LiuXiaoxuanPKU in #7183
[Kernel] Add per-tensor and per-token AZP epilogues by @ProExpertProg in #5941
[Core] Optimize evictor-v2 performance by @xiaobochen123 in #7193
[Core] Subclass ModelRunner to support cross-attention & encoder sequences (towards eventual encoder/decoder model support) by @afeldman-nm in #4942
[Bugfix] Fix GPTQ and GPTQ Marlin CPU Offloading by @mgoin in #7225
[BugFix] Overhaul async request cancellation by @njhill in #7111
[Doc] Mock new dependencies for documentation by @ywang96 in #7245
[BUGFIX]: top_k is expected to be an integer. by @Atllkks10 in #7227
[Frontend] Gracefully handle missing chat template and fix CI failure by @DarkLight1337 in #7238
[distributed][misc] add specialized method for cuda platform by @youkaichao in #7249
[Misc] Refactor linear layer weight loading; introduce BasevLLMParameter and weight_loader_v2 by @dsikka in #5874
[ BugFix ] Move zmq frontend to IPC instead of TCP by @robertgshaw2-neuralmagic in #7222
Fixes typo in function name by @rafvasq in #7275
[Bugfix] Fix input processor for InternVL2 model by @Isotr0py in #7164
[OpenVINO] migrate to latest dependencies versions by @ilya-lavrenov in #7251
[Doc] add online speculative decoding example by @stas00 in #7243
[BugFix] Fix frontend multiprocessing hang by @maxdebayser in #7217
[Bugfix][FP8] Fix dynamic FP8 Marlin quantization by @mgoin in #7219
[ci] Make building wheels per commit optional by @khluu in #7278
[Bugfix] Fix gptq failure on T4s by @LucasWilkinson in #7264
[FrontEnd] Make merge_async_iterators is_cancelled arg optional by @njhill in #7282
[Doc] Update supported_hardware.rst by @mgoin in #7276
[Kernel] Fix Flashinfer Correctness by @LiuXiaoxuanPKU in #7284
[Misc] Fix typos in scheduler.py by @ruisearch42 in #7285
[Frontend] remove max_num_batched_tokens limit for lora by @NiuBlibing in #7288
[Bugfix] Fix LoRA with PP by @andoorve in #7292
[Model] Rename MiniCPMVQwen2 to MiniCPMV2.6 by @jeejeelee in #7273
[Bugfix][Kernel] Increased atol to fix failing tests by @ProExpertProg in #7305
[Frontend] Kill the server on engine death by @joerunde in #6594
[Bugfix][fast] Fix the get_num_blocks_touched logic by @zachzzc in #6849
[Doc] Put collect_env issue output in a block by @mgoin in #7310
[CI/Build] Dockerfile.cpu improvements by @dtrifiro in #7298
[Bugfix] Fix new Llama3.1 GGUF model loading by @Isotr0py in #7269
[Misc] Temporarily resolve the error of BitAndBytes by @jeejeelee in #7308
Add Skywork AI as Sponsor by @simon-mo in #7314
[TPU] Add Load-time W8A16 quantization for TPU Backend by @lsy323 in #7005
[Core] Support serving encoder/decoder models by @DarkLight1337 in #7258
[TPU] Fix dockerfile.tpu by @WoosukKwon in #7331
[Performance] Optimize e2e overheads: Reduce python allocations by @alexm-neuralmagic in #7162
[Bugfix] Fix speculative decoding with MLPSpeculator with padded vocabulary by @tjohnson31415 in #7218
[Speculative decoding] [Multi-Step] decouple should_modify_greedy_probs_inplace by @SolitaryThinker in #6971
[Core] Streamline stream termination in AsyncLLMEngine by @njhill in #7336
[Model][Jamba] Mamba cache single buffer by @mzusman in #6739
[VLM][Doc] Add stop_token_ids to InternVL example by @Isotr0py in #7354
[Performance] e2e overheads reduction: Small followup diff by @alexm-neuralmagic in #7364
[Bugfix] Fix reinit procedure in ModelInputForGPUBuilder by @alexm-neuralmagic in #7360
[Frontend] Support embeddings in the run_batch API by @pooyadavoodi in #7132
[Bugfix] Fix ITL recording in serving benchmark by @ywang96 in #7372
[Core] Add span metrics for model_forward, scheduler and sampler time by @sfc-gh-mkeralapura in #7089
[Bugfix] Fix PerTensorScaleParameter weight loading for fused models by @dsikka in #7376
[Misc] Add numpy implementation of compute_slot_mapping by @Yard1 in #7377
[Core] Fix edge case in chunked prefill + block manager v2 by @cadedaniel in #7380
[Bugfix] Fix phi3v batch inference when images have different aspect ratio by @Isotr0py in #7392
[TPU] Use mark_dynamic to reduce compilation time by @WoosukKwon in #7340
Updating LM Format Enforcer version to v0.10.6 by @noamgat in #7189
[core] [2/N] refactor worker_base input preparation for multi-step by @SolitaryThinker in #7387
[CI/Build] build on empty device for better dev experience by @tomeras91 in #4773
[Doc] add instructions about building vLLM with VLLM_TARGET_DEVICE=empty by @tomeras91 in #7403
[misc] add commit id in collect env by @youkaichao in #7405
[Docs] Update readme by @simon-mo in #7316
[CI/Build] Minor refactoring for vLLM assets by @ywang96 in #7407
[Kernel] Flashinfer correctness fix for v0.1.3 by @LiuXiaoxuanPKU in #7319
[Core][VLM] Support image embeddings as input by @ywang96 in #6613
[Frontend] Disallow passing model as both argument and option by @DarkLight1337 in #7347
[CI/Build] bump Dockerfile.neuron image base, use public ECR by @dtrifiro in #6832
[Bugfix] Fix logit soft cap in flash-attn backend by @WoosukKwon in #7425
[ci] Entrypoints run upon changes in vllm/ by @khluu in #7423
[ci] Cancel fastcheck run when PR is marked ready by @khluu in #7427
[ci] Cancel fastcheck when PR is ready by @khluu in #7433
[Misc] Use scalar type to dispatch to different gptq_marlin kernels by @LucasWilkinson in #7323
[Core] Consolidate GB constant and enable float GB arguments by @DarkLight1337 in #7416
[Core/Bugfix] Add FP8 K/V Scale and dtype conversion for prefix/prefill Triton Kernel by @jon-chuang in #7208
[Bugfix] Handle PackageNotFoundError when checking for xpu version by @sasha0552 in #7398
[CI/Build] bump minimum cmake version by @dtrifiro in #6999
[Core] Shut down aDAG workers with clean async llm engine exit by @ruisearch42 in #7224
[mypy] Misc. typing improvements by @DarkLight1337 in #7417
[Misc] improve logits processors logging message by @aw632 in #7435
[ci] Remove fast check cancel workflow by @khluu in #7455
[Bugfix] Fix weight loading for Chameleon when TP>1 by @DarkLight1337 in #7410
[hardware] unify usage of is_tpu to current_platform.is_tpu() by @youkaichao in #7102
[TPU] Suppress import custom_ops warning by @WoosukKwon in #7458
Revert "[Doc] Update supported_hardware.rst (#7276)" by @WoosukKwon in #7467
[Frontend][Core] Add plumbing to support audio language models by @petersalas in #7446
[Misc] Update LM Eval Tolerance by @dsikka in #7473
[Misc] Update gptq_marlin to use new vLLMParameters by @dsikka in #7281
[Misc] Update Fused MoE weight loading by @dsikka in #7334
[Misc] Update awq and awq_marlin to use vLLMParameters by @dsikka in #7422
Announce NVIDIA Meetup by @simon-mo in #7483
[frontend] spawn engine process from api server process by @youkaichao in #7484
[Misc] compressed-tensors code reuse by @kylesayrs in #7277
[misc][plugin] add plugin system implementation by @youkaichao in #7426
[TPU] Support multi-host inference by @WoosukKwon in #7457
[Bugfix][CI] Import ray under guard by @WoosukKwon in #7486
[CI/Build]Reduce the time consumption for LoRA tests by @jeejeelee in #7396
[misc][ci] fix cpu test with plugins by @youkaichao in #7489
[Bugfix][Docs] Update list of mock imports by @DarkLight1337 in #7493
[doc] update test script to include cudagraph by @youkaichao in #7501
Fix empty output when temp is too low by @CatherineSue in #2937
[ci] fix model tests by @youkaichao in #7507
[Bugfix][Frontend] Disable embedding API for chat models by @QwertyJack in #7504
[Misc] Deprecation Warning when setting --engine-use-ray by @wallashss in #7424
[VLM][Core] Support profiling with multiple multi-modal inputs per prompt by @DarkLight1337 in #7126
[core] [3/N] multi-step args and sequence.py by @SolitaryThinker in #7452
[TPU] Set per-rank XLA cache by @WoosukKwon in #7533
[Misc] Revert compressed-tensors code reuse by @kylesayrs in #7521
llama_index serving integration documentation by @pavanjava in #6973
[Bugfix][TPU] Correct env variable for XLA cache path by @WoosukKwon in #7544
[Bugfix] update neuron for version > 0.5.0 by @omrishiv in #7175
[Misc] Update dockerfile for CPU to cover protobuf installation by @PHILO-HE in #7182
[Bugfix] Fix default weight loading for scalars by @mgoin in #7534
[Bugfix][Harmless] Fix hardcoded float16 dtype for model_is_embedding by @mgoin in #7566
[Misc] Add quantization config support for speculative model. by @ShangmingCai in #7343
[Feature]: Add OpenAI server prompt_logprobs support #6508 by @gnpinkert in #7453
[ci/test] rearrange tests and make adag test soft fail by @youkaichao in #7572
Chat method for offline llm by @nunjunj in #5049
[CI] Move quantization cpu offload tests out of fastcheck by @mgoin in #7574
[Misc/Testing] Use torch.testing.assert_close by @jon-chuang in #7324
register custom op for flash attn and use from torch.ops by @youkaichao in #7536
[Core] Use uvloop with zmq-decoupled front-end by @njhill in #7570
[CI] Fix crashes of performance benchmark by @KuntaiDu in #7500
[Bugfix][Hardware][AMD][Frontend] add quantization param to embedding checking method by @gongdao123 in #7513
support tqdm in notebooks by @fzyzcjy in #7510
[Feature][Hardware][Amd] Add fp8 Linear Layer for Rocm by @charlifu in #7210
[Kernel] W8A16 Int8 inside FusedMoE by @mzusman in #7415
[Kernel] Add tuned triton configs for ExpertsInt8 by @mgoin in #7601
[spec decode] [4/N] Move update_flash_attn_metadata to attn backend by @SolitaryThinker in #7571
[Core] Fix tracking of model forward time to the span traces in case of PP>1 by @sfc-gh-mkeralapura in #7440
[Doc] Add docs for llmcompressor INT8 and FP8 checkpoints by @mgoin in #7444
[Doc] Update quantization supported hardware table by @mgoin in #7595
[Kernel] register punica functions as torch ops by @bnellnm in #7591
[Kernel][Misc] dynamo support for ScalarType by @bnellnm in #7594
[Kernel] fix types used in aqlm and ggml kernels to support dynamo by @bnellnm in #7596
[Model] Align nemotron config with final HF state and fix lm-eval-small by @mgoin in #7611
[Bugfix] Fix custom_ar support check by @bnellnm in #7617
.[Build/CI] Enabling passing AMD tests. by @Alexei-V-Ivanov-AMD in #7610
[Bugfix] Clear engine reference in AsyncEngineRPCServer by @ruisearch42 in #7618
[aDAG] Unflake aDAG + PP tests by @rkooo567 in #7600
[Bugfix] add >= 1.0 constraint for openai dependency by @metasyn in #7612
[misc] use nvml to get consistent device name by @youkaichao in #7582
[ci][test] fix engine/logger test by @youkaichao in #7621
[core][misc] update libcudart finding by @youkaichao in #7620
[Model] Pipeline parallel support for JAIS by @mrbesher in #7603
[ci][test] allow longer wait time for api server by @youkaichao in #7629
[Misc]Fix BitAndBytes exception messages by @jeejeelee in #7626
[VLM] Refactor MultiModalConfig initialization and profiling by @ywang96 in #7530
[TPU] Skip creating empty tensor by @WoosukKwon in #7630
[TPU] Use mark_dynamic only for dummy run by @WoosukKwon in #7634
[TPU] Optimize RoPE forward_native2 by @WoosukKwon in #7636
[ Bugfix ] Fix Prometheus Metrics With zeromq Frontend by @robertgshaw2-neuralmagic in #7279
[CI/Build] Add text-only test for Qwen models by @alex-jw-brooks in #7475
[Misc] Refactor Llama3 RoPE initialization by @WoosukKwon in #7637
[Core] Optimize SPMD architecture with delta + serialization optimization by @rkooo567 in #7109
[Core] Use flashinfer sampling kernel when available by @peng1999 in #7137
fix xpu build by @jikunshang in #7644
[Misc] Remove Gemma RoPE by @WoosukKwon in #7638
[MISC] Add prefix cache hit rate to metrics by @comaniac in #7606
[Bugfix] fix lora_dtype value type in arg_utils.py - part 2 by @c3-ali in #5428
[core] Multi Step Scheduling by @SolitaryThinker in #7000
[Core] Support tensor parallelism for GGUF quantization by @Isotr0py in #7520
[Bugfix] Don't disable existing loggers by @a-ys in #7664
[TPU] Fix redundant input tensor cloning by @WoosukKwon in #7660
[Bugfix] use StoreBoolean instead of type=bool for --disable-logprobs-during-spec-decoding by @tjohnson31415 in #7665
[doc] fix doc build error caused by msgspec by @youkaichao in #7659
[Speculative Decoding] Fixing hidden states handling in batch expansion by @abhigoyal1997 in #7508
[ci] Install Buildkite test suite analysis by @khluu in #7667
[Bugfix] support tie_word_embeddings for all models by @zijian-hu in #5724
[CI] Organizing performance benchmark files by @KuntaiDu in #7616
[misc] add nvidia related library in collect env by @youkaichao in #7674
[XPU] fallback to native implementation for xpu custom op by @jianyizh in #7670
[misc][cuda] add warning for pynvml user by @youkaichao in #7675
[Core] Refactor executor classes to make it easier to inherit GPUExecutor by @jikunshang in #7673
[Kernel] (1/N) Machete - Hopper Optimized Mixed Precision Linear Kernel by @LucasWilkinson in #7174
[OpenVINO] Updated documentation by @ilya-lavrenov in #7687
[VLM][Model] Add test for InternViT vision encoder by @Isotr0py in #7409
[Hardware] [Intel GPU] refactor xpu worker/executor by @jikunshang in #7686
[CI/Build] Pin OpenTelemetry versions and make availability errors clearer by @ronensc in #7266
[Misc] Add jinja2 as an explicit build requirement by @LucasWilkinson in #7695
[Core] Add AttentionState abstraction by @Yard1 in #7663
[Intel GPU] fix xpu not support punica kernel (which use torch.library.custom_op) by @jikunshang in #7685
[ci][test] adjust max wait time for cpu offloading test by @youkaichao in #7709
[Core] Pipe worker_class_fn argument in Executor by @Yard1 in #7707
[ci] try to log process using the port to debug the port usage by @youkaichao in #7711
[Model] Add AWQ quantization support for InternVL2 model by @Isotr0py in #7187
[Doc] Section for Multimodal Language Models by @ywang96 in #7719
[mypy] Enable following imports for entrypoints by @DarkLight1337 in #7248
[Bugfix] Mirror jinja2 in pyproject.toml by @sasha0552 in #7723
[BugFix] Avoid premature async generator exit and raise all exception variations by @njhill in #7698
[BUG] fix crash on flashinfer backend with cudagraph disabled, when attention group_size not in [1,2,4,8] by @learninmou in #7509
[Bugfix][Hardware][CPU] Fix mm_limits initialization for CPU backend by @Isotr0py in #7735
[Spec Decoding] Use target model max length as default for draft model by @njhill in #7706
[Bugfix] chat method add_generation_prompt param by @brian14708 in #7734
[Bugfix][Frontend] Fix Issues Under High Load With zeromq Frontend by @robertgshaw2-neuralmagic in #7394
[Bugfix] Pass PYTHONPATH from setup.py to CMake by @sasha0552 in #7730
[multi-step] Raise error if not using async engine by @SolitaryThinker in #7703
[Frontend] Improve Startup Failure UX by @robertgshaw2-neuralmagic in #7716
[misc] Add Torch profiler support by @SolitaryThinker in #7451
[Model] Add UltravoxModel and UltravoxConfig by @petersalas in #7615
[ci] [multi-step] narrow multi-step test dependency paths by @SolitaryThinker in #7760
[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel by @dsikka in #7527
[distributed][misc] error on same VLLM_HOST_IP setting by @youkaichao in #7756
[AMD][CI/Build] Disambiguation of the function call for ROCm 6.2 headers compatibility by @gshtras in #7477
[Kernel] Replaced blockReduce[...] functions with cub::BlockReduce by @ProExpertProg in #7233
[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue by @zifeitong in #7710
[Bug][Frontend] Improve ZMQ client robustness by @joerunde in #7443
Revert "[Kernel] Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)" by @mgoin in #7764
[TPU] Avoid initializing TPU runtime in is_tpu by @WoosukKwon in #7763
[ci] refine dependency for distributed tests by @youkaichao in #7776
[Misc] Use torch.compile for GemmaRMSNorm by @WoosukKwon in #7642
[Speculative Decoding] EAGLE Implementation with Top-1 proposer by @abhigoyal1997 in #6830
Fix ShardedStateLoader for vllm fp8 quantization by @sfc-gh-zhwang in #7708
[Bugfix] Don't build machete on cuda <12.0 by @LucasWilkinson in #7757
[Misc] update fp8 to use vLLMParameter by @dsikka in #7437
[Bugfix] spec decode handle None entries in topk args in create_sequence_group_output by @tjohnson31415 in #7232
[Misc] Enhance prefix-caching benchmark tool by @Jeffwan in #6568
[Doc] Fix incorrect docs from #7615 by @petersalas in #7788
[Bugfix] Use LoadFormat values as choices for vllm serve --load-format by @mgoin in #7784
[ci] Cleanup & refactor Dockerfile to pass different Python versions and sccache bucket via build args by @khluu in #7705
[Misc] fix typo in triton import warning by @lsy323 in #7794
[Frontend] error suppression cleanup by @joerunde in #7786
[Ray backend] Better error when pg topology is bad. by @rkooo567 in #7584
[Hardware][Intel GPU] refactor xpu_model_runner, fix xpu tensor parallel by @jikunshang in #7712
[misc] Add Torch profiler support for CPU-only devices by @DamonFool in #7806
[BugFix] Fix server crash on empty prompt by @maxdebayser in #7746
[github][misc] promote asking llm first by @youkaichao in #7809
[Misc] Update marlin to use vLLMParameters by @dsikka in #7803
Bump version to v0.5.5 by @simon-mo in #7823

New Contributors

@jischein made their first contribution in #7129
@kpapis made their first contribution in #7198
@xiaobochen123 made their first contribution in #7193
@Atllkks10 made their first contribution in #7227
@stas00 made their first contribution in #7243
@maxdebayser made their first contribution in #7217
@NiuBlibing made their first contribution in #7288
@lsy323 made their first contribution in #7005
@pooyadavoodi made their first contribution in #7132
@sfc-gh-mkeralapura made their first contribution in #7089
@jon-chuang made their first contribution in #7208
@aw632 made their first contribution in #7435
@petersalas made their first contribution in #7446
@kylesayrs made their first contribution in #7277
@QwertyJack made their first contribution in #7504
@wallashss made their first contribution in #7424
@pavanjava made their first contribution in #6973
@PHILO-HE made their first contribution in #7182
@gnpinkert made their first contribution in #7453
@gongdao123 made their first contribution in #7513
@charlifu made their first contribution in #7210
@metasyn made their first contribution in #7612
@mrbesher made their first contribution in #7603
@alex-jw-brooks made their first contribution in #7475
@a-ys made their first contribution in #7664
@zijian-hu made their first contribution in #5724
@jianyizh made their first contribution in #7670
@learninmou made their first contribution in #7509
@brian14708 made their first contribution in #7734
@sfc-gh-zhwang made their first contribution in #7708

Full Changelog: v0.5.4...v0.5.5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

v0.5.5

Highlights

Performance Update

Model Support

Hardware Support

Others

What's Changed

New Contributors

Contributors

Uh oh!