Skip to content

v0.6.1

Compare
Choose a tag to compare
@github-actions github-actions released this 11 Sep 21:44
· 7684 commits to main since this release
3fd2b0d

Highlights

Model Support

  • Added support for Pixtral (mistralai/Pixtral-12B-2409). (#8377, #8168)
  • Added support for Llava-Next-Video (#7559), Qwen-VL (#8029), Qwen2-VL (#7905)
  • Multi-input support for LLaVA (#8238), InternVL2 models (#8201)

Performance Enhancements

  • Memory optimization for awq_gemm and awq_dequantize, 2x throughput (#8248)

Production Engine

  • Support load and unload LoRA in api server (#6566)
  • Add progress reporting to batch runner (#8060)
  • Add support for NVIDIA ModelOpt static scaling checkpoints. (#6112)

Others

  • Update the docker image to use Python 3.12 for small performance bump. (#8133)
  • Added CODE_OF_CONDUCT.md (#8161)

What's Changed

  • [Doc] [Misc] Create CODE_OF_CONDUCT.md by @mmcelaney in #8161
  • [bugfix] Upgrade minimum OpenAI version by @SolitaryThinker in #8169
  • [Misc] Clean up RoPE forward_native by @WoosukKwon in #8076
  • [ci] Mark LoRA test as soft-fail by @khluu in #8160
  • [Core/Bugfix] Add query dtype as per FlashInfer API requirements. by @elfiegg in #8173
  • [Doc] Add multi-image input example and update supported models by @DarkLight1337 in #8181
  • Inclusion of InternVLChatModel In PP_SUPPORTED_MODELS(Pipeline Parallelism) by @Manikandan-Thangaraj-ZS0321 in #7860
  • [MODEL] Qwen Multimodal Support (Qwen-VL / Qwen-VL-Chat) by @alex-jw-brooks in #8029
  • Move verify_marlin_supported to GPTQMarlinLinearMethod by @mgoin in #8165
  • [Documentation][Spec Decode] Add documentation about lossless guarantees in Speculative Decoding in vLLM by @sroy745 in #7962
  • [Core] Support load and unload LoRA in api server by @Jeffwan in #6566
  • [BugFix] Fix Granite model configuration by @njhill in #8216
  • [Frontend] Add --logprobs argument to benchmark_serving.py by @afeldman-nm in #8191
  • [Misc] Use ray[adag] dependency instead of cuda by @ruisearch42 in #7938
  • [CI/Build] Increasing timeout for multiproc worker tests by @alexeykondrat in #8203
  • [Kernel] [Triton] Memory optimization for awq_gemm and awq_dequantize, 2x throughput by @rasmith in #8248
  • [Misc] Remove SqueezeLLM by @dsikka in #8220
  • [Model] Allow loading from original Mistral format by @patrickvonplaten in #8168
  • [misc] [doc] [frontend] LLM torch profiler support by @SolitaryThinker in #7943
  • [Bugfix] Fix Hermes tool call chat template bug by @K-Mistele in #8256
  • [Model] Multi-input support for LLaVA and fix embedding inputs for multi-image models by @DarkLight1337 in #8238
  • Enable Random Prefix Caching in Serving Profiling Tool (benchmark_serving.py) by @wschin in #8241
  • [tpu][misc] fix typo by @youkaichao in #8260
  • [Bugfix] Fix broken OpenAI tensorizer test by @DarkLight1337 in #8258
  • [Model][VLM] Support multi-images inputs for InternVL2 models by @Isotr0py in #8201
  • [Model][VLM] Decouple weight loading logic for Paligemma by @Isotr0py in #8269
  • ppc64le: Dockerfile fixed, and a script for buildkite by @sumitd2 in #8026
  • [CI/Build] Use python 3.12 in cuda image by @joerunde in #8133
  • [Bugfix] Fix async postprocessor in case of preemption by @alexm-neuralmagic in #8267
  • [Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility by @K-Mistele in #8272
  • [Frontend] Add progress reporting to run_batch.py by @alugowski in #8060
  • [Bugfix] Correct adapter usage for cohere and jamba by @vladislavkruglikov in #8292
  • [Misc] GPTQ Activation Ordering by @kylesayrs in #8135
  • [Misc] Fused MoE Marlin support for GPTQ by @dsikka in #8217
  • Add NVIDIA Meetup slides, announce AMD meetup, and add contact info by @simon-mo in #8319
  • [Bugfix] Fix missing post_layernorm in CLIP by @DarkLight1337 in #8155
  • [CI/Build] enable ccache/scccache for HIP builds by @dtrifiro in #8327
  • [Frontend] Clean up type annotations for mistral tokenizer by @DarkLight1337 in #8314
  • [CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail by @alexeykondrat in #8130
  • Fix ppc64le buildkite job by @sumitd2 in #8309
  • [Spec Decode] Move ops.advance_step to flash attn advance_step by @kevin314 in #8224
  • [Misc] remove peft as dependency for prompt models by @prashantgupta24 in #8162
  • [MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled by @comaniac in #8342
  • [Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture by @alexm-neuralmagic in #8340
  • [Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers by @SolitaryThinker in #8172
  • [CI/Build][Kernel] Update CUTLASS to 3.5.1 tag by @tlrmchlsmth in #8043
  • [Misc] Skip loading extra bias for Qwen2-MOE GPTQ models by @jeejeelee in #8329
  • [Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel by @Isotr0py in #8299
  • [Hardware][NV] Add support for ModelOpt static scaling checkpoints. by @pavanimajety in #6112
  • [model] Support for Llava-Next-Video model by @TKONIY in #7559
  • [Frontend] Create ErrorResponse instead of raising exceptions in run_batch by @pooyadavoodi in #8347
  • [Model][VLM] Add Qwen2-VL model support by @fyabc in #7905
  • [Hardware][Intel] Support compressed-tensor W8A8 for CPU backend by @bigPYJ1151 in #7257
  • [CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation by @alexeykondrat in #8373
  • [Bugfix] Add missing attributes in mistral tokenizer by @DarkLight1337 in #8364
  • [Kernel][Misc] Add meta functions for ops to prevent graph breaks by @bnellnm in #6917
  • [Misc] Move device options to a single place by @akx in #8322
  • [Speculative Decoding] Test refactor by @LiuXiaoxuanPKU in #8317
  • Pixtral by @patrickvonplaten in #8377
  • Bump version to v0.6.1 by @simon-mo in #8379

New Contributors

Full Changelog: v0.6.0...v0.6.1