Release v0.6.1 · vllm-project/vllm

Highlights

Model Support

Added support for Pixtral (mistralai/Pixtral-12B-2409). (#8377, #8168)
Added support for Llava-Next-Video (#7559), Qwen-VL (#8029), Qwen2-VL (#7905)
Multi-input support for LLaVA (#8238), InternVL2 models (#8201)

Performance Enhancements

Memory optimization for awq_gemm and awq_dequantize, 2x throughput (#8248)

Production Engine

Support load and unload LoRA in api server (#6566)
Add progress reporting to batch runner (#8060)
Add support for NVIDIA ModelOpt static scaling checkpoints. (#6112)

Others

Update the docker image to use Python 3.12 for small performance bump. (#8133)
Added CODE_OF_CONDUCT.md (#8161)

What's Changed

[Doc] [Misc] Create CODE_OF_CONDUCT.md by @mmcelaney in #8161
[bugfix] Upgrade minimum OpenAI version by @SolitaryThinker in #8169
[Misc] Clean up RoPE forward_native by @WoosukKwon in #8076
[ci] Mark LoRA test as soft-fail by @khluu in #8160
[Core/Bugfix] Add query dtype as per FlashInfer API requirements. by @elfiegg in #8173
[Doc] Add multi-image input example and update supported models by @DarkLight1337 in #8181
Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) by @Manikandan-Thangaraj-ZS0321 in #7860
[MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) by @alex-jw-brooks in #8029
Move verify_marlin_supported to GPTQMarlinLinearMethod by @mgoin in #8165
[Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM by @sroy745 in #7962
[Core] Support load and unload LoRA in api server by @Jeffwan in #6566
[BugFix] Fix Granite model configuration by @njhill in #8216
[Frontend] Add --logprobs argument to benchmark_serving.py by @afeldman-nm in #8191
[Misc] Use ray[adag] dependency instead of cuda by @ruisearch42 in #7938
[CI/Build] Increasing timeout for multiproc worker tests by @alexeykondrat in #8203
[Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput by @rasmith in #8248
[Misc] Remove SqueezeLLM by @dsikka in #8220
[Model] Allow loading from original Mistral format by @patrickvonplaten in #8168
[misc] [doc] [frontend] LLM torch profiler support by @SolitaryThinker in #7943
[Bugfix] Fix Hermes tool call chat template bug by @K-Mistele in #8256
[Model] Multi-input support for LLaVA and fix embedding inputs for multi-image models by @DarkLight1337 in #8238
Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) by @wschin in #8241
[tpu][misc] fix typo by @youkaichao in #8260
[Bugfix] Fix broken OpenAI tensorizer test by @DarkLight1337 in #8258
[Model][VLM] Support multi-images inputs for InternVL2 models by @Isotr0py in #8201
[Model][VLM] Decouple weight loading logic for Paligemma by @Isotr0py in #8269
ppc64le: Dockerfile fixed, and a script for buildkite by @sumitd2 in #8026
[CI/Build] Use python 3.12 in cuda image by @joerunde in #8133
[Bugfix] Fix async postprocessor in case of preemption by @alexm-neuralmagic in #8267
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility by @K-Mistele in #8272
[Frontend] Add progress reporting to run_batch.py by @alugowski in #8060
[Bugfix] Correct adapter usage for cohere and jamba by @vladislavkruglikov in #8292
[Misc] GPTQ Activation Ordering by @kylesayrs in #8135
[Misc] Fused MoE Marlin support for GPTQ by @dsikka in #8217
Add NVIDIA Meetup slides, announce AMD meetup, and add contact info by @simon-mo in #8319
[Bugfix] Fix missing post_layernorm in CLIP by @DarkLight1337 in #8155
[CI/Build] enable ccache/scccache for HIP builds by @dtrifiro in #8327
[Frontend] Clean up type annotations for mistral tokenizer by @DarkLight1337 in #8314
[CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail by @alexeykondrat in #8130
Fix ppc64le buildkite job by @sumitd2 in #8309
[Spec Decode] Move ops.advance_step to flash attn advance_step by @kevin314 in #8224
[Misc] remove peft as dependency for prompt models by @prashantgupta24 in #8162
[MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled by @comaniac in #8342
[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture by @alexm-neuralmagic in #8340
[Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers by @SolitaryThinker in #8172
[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag by @tlrmchlsmth in #8043
[Misc] Skip loading extra bias for Qwen2-MOE GPTQ models by @jeejeelee in #8329
[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel by @Isotr0py in #8299
[Hardware][NV] Add support for ModelOpt static scaling checkpoints. by @pavanimajety in #6112
[model] Support for Llava-Next-Video model by @TKONIY in #7559
[Frontend] Create ErrorResponse instead of raising exceptions in run_batch by @pooyadavoodi in #8347
[Model][VLM] Add Qwen2-VL model support by @fyabc in #7905
[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend by @bigPYJ1151 in #7257
[CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation by @alexeykondrat in #8373
[Bugfix] Add missing attributes in mistral tokenizer by @DarkLight1337 in #8364
[Kernel][Misc] Add meta functions for ops to prevent graph breaks by @bnellnm in #6917
[Misc] Move device options to a single place by @akx in #8322
[Speculative Decoding] Test refactor by @LiuXiaoxuanPKU in #8317
Pixtral by @patrickvonplaten in #8377
Bump version to v0.6.1 by @simon-mo in #8379

New Contributors

@mmcelaney made their first contribution in #8161
@elfiegg made their first contribution in #8173
@Manikandan-Thangaraj-ZS0321 made their first contribution in #7860
@sumitd2 made their first contribution in #8026
@alugowski made their first contribution in #8060
@vladislavkruglikov made their first contribution in #8292
@kevin314 made their first contribution in #8224
@TKONIY made their first contribution in #7559
@akx made their first contribution in #8322

Full Changelog: v0.6.0...v0.6.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

v0.6.1

Highlights

Model Support

Performance Enhancements

Production Engine

Others

What's Changed

New Contributors

Contributors

Uh oh!