Skip to content

v0.5.5

Compare
Choose a tag to compare
@github-actions github-actions released this 23 Aug 18:37
· 7783 commits to main since this release
09c7792

Highlights

Performance Update

  • We introduced a new mode that schedule multiple GPU steps in advance, reducing CPU overhead (#7000, #7387, #7452, #7703). Initial result shows 20% improvements in QPS for a single GPU running 8B and 30B models. You can set --num-scheduler-steps 8 as a parameter to the API server (via vllm serve) or AsyncLLMEngine. We are working on expanding the coverage to LLM class and aiming to turning it on by default
  • Various enhancements:
    • Use flashinfer sampling kernel when avaiable, leading to 7% decoding throughput speedup (#7137)
    • Reduce Python allocations, leading to 24% throughput speedup (#7162, 7364)
    • Improvements to the zeromq based decoupled frontend (#7570, #7716, #7484)

Model Support

  • Support Jamba 1.5 (#7415, #7601, #6739)
  • Support for the first audio model UltravoxModel (#7615, #7446)
  • Improvements to vision models:
    • Support image embeddings as input (#6613)
    • Support SigLIP encoder and alternative decoders for LLaVA models (#7153)
  • Support loading GGUF model (#5191) with tensor parallelism (#7520)
  • Progress in encoder decoder models: support for serving encoder/decoder models (#7258), and architecture for cross-attention (#4942)

Hardware Support

  • AMD: Add fp8 Linear Layer for rocm (#7210)
  • Enhancements to TPU support: load time W8A16 quantization (#7005), optimized rope (#7635), and support multi-host inference (#7457).
  • Intel: various refactoring for worker, executor, and model runner (#7686, #7712)

Others

  • Optimize prefix caching performance (#7193)
  • Speculative decoding
    • Use target model max length as default for draft model (#7706)
    • EAGLE Implementation with Top-1 proposer (#6830)
  • Entrypoints
    • A new chat method in the LLM class (#5049)
    • Support embeddings in the run_batch API (#7132)
    • Support prompt_logprobs in Chat Completion (#7453)
  • Quantizations
    • Expand MoE weight loading + Add Fused Marlin MoE Kernel (#7527)
    • Machete - Hopper Optimized Mixed Precision Linear Kernel (#7174)
  • torch.compile: register custom ops for kernels (#7591, #7594, #7536)

What's Changed

New Contributors

Full Changelog: v0.5.4...v0.5.5