Skip to content

Releases: vllm-project/vllm-gaudi

vLLM-Gaudi for vLLM-v0.16.0

06 Mar 13:27
v0.16.0
da2ce40

Choose a tag to compare

vLLM Gaudi Plugin v0.16.0 Release Notes

Overview

This release is based on vLLM v0.16.0 and supports Intel® Gaudi® Software v1.23.0.

Highlights

  • Added validated support for the following models: Qwen3-VL, DeepSeek OCR, MiniMax-M2, Ovis, Mistral-Large-3, and Hunyuan V1.
  • Improved performance by introducing backported bug fixes, mamba improvements, and model weight loading speeds.
  • Enhanced quantization to force CPU loading for INC quantization to prevent OOM.
  • Introduced various improvements to UBI/RHEL Docker images, server defaults, and Coverity fixes.

New Model Support and Updates

  • Change Qwen3-VL to use HPUMMEncoderAttention (#1060)
  • Enable caching for Qwen3 MoE op (#1068)
  • Fix Qwen3-VL MoE execution failure (#1028)
  • Enable DeepSeek OCR model (#954)
  • Add dotsocr and seedoss (#977)
  • Add MiniMax-M2 support (#964)
  • Add Ovis model support with default buckets (#846)
  • Enable Mistral-Large-3-675B-Instruct-2512 model (#871)
  • Add Hunyuan V1 model support (Dense & MoE bf16/FP8) (#875)

Performance

  • [GAUDISW-246429] hpu_mamba_chunk_scan_combined_varlen improvements (#1074)
  • Improve model weight loading speed (#807)
  • Fix warmup regression (#962)

Attention and KV Cache

  • Instead of changing KV cache shape, transpose state in conv1d (#1065)
  • [GAUDISW-245713] Remove bucket densification for long ctx; Edge buckets only for long ctx (#915)
  • Temporarily disable chunked attention (#981)
  • Multimodal model embedding fixes (#759)
  • [CT] Add FP8 GQA Support (#874)
  • [CT] Fix CT Config to honor fp8_inc KV cache dtype (#929)

Quantization

  • Force CPU loading for INC quantization to prevent OOM during weight loading (#1055)
  • Fix INC patching _gate twice (#955)
  • [GAUDISW-246337] Added config with scale method: maxabs_pcs_pow2 for dynamic quant (#949)

Plugin Core

  • Source use_qk_norm parameter directly from config (#1084)
  • Fix last_chunk_indices calculations (#1023)
  • Fix mamba cumsum padded calculations (#1021)
  • Fix redundant transpose in HPUMambaMixer2 (#1015)
  • Fix HPUMambaMixer2 inheritance dependency (#1016)
  • Add _MAMBA_PAD_BLOCK_ID (#951)
  • Enable OffloadingConnector on HPU. (#827)
  • GPT OSS Integration Code (#887)
  • Fix async scheduler + unified attention failure on Qwen2.5-VL (#931)
  • Fix undefined behavior in copy_blocks when source and destination blocks overlap (#329)

Serving and Infrastructure

  • Fix RHEL Dockerfile build order and remove obsolete TencentOS Dockerfile (#1056)
  • Improve Docker autocalc linear recipe for long contexts (cherry-pick to 0.16.0) (#1041)
  • Add libfdt-devel to UBI Dockerfile (#974)
  • Fix device detection when ENABLE_CONSOLE=true (#963)

Fixes

  • Don't destroy server with logprobs (#1098)
  • Coverity fix including security, null-like values, duplicates and typos (#1094)
  • Fix param mismatch for compute_nixl_compatibility_hash() (#1087)
  • Fix Topk Calculation in GPTOSS (#970)
  • Fix reported version of vLLM (#811)
  • Fixing _compile_region for nested attributes (#956)
  • Fix sampler & TP>1 recompilations (#935)
  • Restore default temperature=0 for the server after #32723 (#1037)

Full Changelog

PR Title Author
#1098 Don't destroy server with logprobs @adobrzyn
#1094 Coverity fix including security, null-like values, duplicates and typos @adobrzyn
#1087 fix param mismatch for compute_nixl_compatibility_hash() @hsubramony
#1060 Change Qwen3VL to use HPUMMEncoderAttention @jiminha
#1068 Enable caching for qwen3 moe op @shepark
#1084 use_qk_norm parameter sourced directly from config @rsmyrek
#1056 Fix RHEL Dockerfile build order and remove obsolete TencentOS Dockerfile @PatrykWo
#1037 Back temperature=0 for server as default after #32723 @iboiko-habana
#1089 Change upstream last_good_commit 89a77b10846fd96273cce78d86d2556ea582d26e @iboiko-habana
#1041 Improve docker autocalc linear recipe for long contexts (cherry-pick to 0.16.0) @nngokhale
#1080 Port of #1050 for CI unblocking @iboiko-habana
#1074 hpu_mamba_chunk_scan_combined_varlen improvements @PatrykWilczewski
#1057 Add ci test for granite-4-h-small to v0.16.0 @microslaw
#1065 Instead of changing kv cache shape, transpose state in conv1d @jmamzax
#1023 Fix last_chunk_indices calculations @jbyczkow
#1021 Fix mamba cumsum padded calculations @jkaniecki
#999 Fix redundant transpose in HPUMambaMixer2 (#1015) @ksmusz
#1019 Fixes for #33559 and #34103 @iboiko-habana
#1055 Force CPU loading for INC quantization to prevent OOM during weight loading @agrabow
#1016 Fix HPUMambaMixer2 inheritance dependency @jbyczkow
#1028 Fix qwen3 vl moe execution failure @shepark
#1042 Adding ci_calibration_smoke_tests.sh into v0.16.0 @iboiko-habana
#971 UBI images improvements @ghandoura
#954 Enable deepseek ocr model @HeJunyan
#977 Add dotsocr and seedoss @tianyuan211
#975 Monkey-patch of Attention.forward @tzielinski-habana
#824 Adjust pre-merge workflow to support merge queue trigger event @bmyrcha
#970 Fix Topk Calculation in GPTOSS @SKRohit
#981 Temporarily disable chunked attention @adobrzyn
#982 adding FIX_FOR_VLLM_CUSTOM to CI @iboiko-habana
#974 Add libfdt-devel (new habanalabs-thunk dependency) to ubi dockerfile @mmuszynskihabana
#930 Fix for individual unit tests @tzielinski-habana
#969 CI cleanup 2 @microslaw
...
Read more

vLLM-Gaudi for vLLM-v0.15.1

27 Feb 13:23
v0.15.1
5d6a2db

Choose a tag to compare

vLLM Gaudi Plugin v0.15.1 Release Notes

Overview

This release is based on vLLM v0.15.1 and supports Intel® Gaudi® Software v1.23.0.


Highlights

  • Added validated support for Granite 4.0-h and Qwen3-VL (dense and MoE variants) on Intel Gaudi 3. Additionally, added significant Llama 4 stability fixes.
  • Introduced full chunked prefill attention support for HPU, enabling better memory utilization on long sequences (#821).
  • Integrated FlashAttention online merge in Unified Attention for improved prefill performance (#785).
  • Added KV cache sharing support for HPU, enabling more efficient multi-query scenarios (#834).
  • Introduced support for NVIDIA ModelOpt FP8 quantization format for dense models (#890).
  • Added HPU ops for Mamba mixer2, causal conv1d, and SSD combined kernels enabling hybrid SSM-Transformer models, such as Granite 4.0-h (#886, #897).
  • Added back-to-back matmul operation for improved Multi-Latent Attention (MLA) performance (#770).
  • Introduced prefill-side KV layout and block size support for heterogeneous (disaggregated) inference via NIXL (#867).

New Model Support

  • Add validated support for Qwen3-VL-32B-Instruct, Qwen3-VL-32B-Thinking, and Qwen3-VL-235B-A22B variants (Instruct, Thinking, FP8) on Gaudi 3 (#958)
  • Register the Qwen3VLMoeForConditionalGeneration model for Qwen3-VL MoE variants (#958)
  • Add IBM Granite 4.0-h small (hybrid SSM-Transformer) implementation for HPU (#897)

Performance

  • Add FlashAttention online merge in Unified Attention for faster prefill (#785)
  • Add back-to-back (b2b) matmul for improved MLA attention performance (#770)
  • Support loading q_scale and using fp8_fused_sdpa for MLA prefill (#909)
  • Remove bucket densification for long context; apply edge buckets only for long context scenarios (#980)
  • Implement bucket corrector for Mamba chunk size (#886)
  • Revert "skip HPU graphs for long prefills" to restore graph capture on long sequences (#850)
  • Port initialization profiling noop to reduce startup overhead (#979)

Attention & KV Cache

  • Add support for chunked attention on HPU (#821)
  • Add KV cache sharing for HPU (#834)
  • Enable support for prefill-side kv_layout and block_size update for heterogeneous runs (#867)
  • Add new VLLM_HPU_HETERO_KV_LAYOUT environment variable to control heterogeneous KV layout (#867)
  • Add heterogeneous HPU NIXL connector for disaggregated prefill/decode (#867)
  • Add hpu_attention ops module with attention operation implementations (#785)
  • Monkey-patch Attention.forward for HPU-specific behavior (#973)
  • Platform: declare support_hybrid_kv_cache capability (#834)

Quantization

  • Add support for ModelOpt FP8 quantization format for dense models (#890)
  • Add modelopt to platform supported quantization list (#890)
  • Add dynamic quantization configuration file example (#838)

Plugin Core

  • Register new ops: hpu_attention, hpu_grouped_topk_router, hpu_mamba_mixer2, and hpu_modelopt (#785, #897, #890)
  • Add ops_selector module for HPU operation routing (#897)
  • Add pytorch_implementation module with pure-PyTorch fallback ops (#897)
  • Add causal_conv1d_pytorch and ssd_combined ops for SSM/Mamba support (#897)
  • Add hpu_grouped_topk_router for MoE grouped top-k routing (#897)
  • Source use_qk_norm parameter directly from config (#1035)

Serving & Infrastructure

  • Add GitHub Actions action.yaml for PR detail workflows (#1030)
  • Add CI calibration smoke tests script (#853)
  • Rename and consolidate CI e2e discoverable tests (#840)
  • Fix Jenkins CI for Mistral model tests (#840)
  • Restore temperature=0 as server default after vLLM #32723 (#1038)
  • Backport RHEL/UBI Dockerfile improvements (#1049)

Fixes

  • Fix Llama 4 apply-patches flow, QK flatten positional encoding, and address performance drop (#942)
  • Fix Llama 4 shape mismatch for 32k+ context window (#842, #855)
  • Fix Qwen2.5-VL accuracy regression (#831)
  • Fix Qwen3-VL multimodal model embedding issues (#958)
  • Fix DeepSeek tensor device mismatch (#1029)
  • Force CPU loading for INC quantization to prevent OOM during weight loading (#1005)
  • Fix INC patching _gate twice (#955, #1020)
  • Fix HPU model runner profile_run to work with dynamic kv-cache scales (#852)
  • Fix measurement config file generation in calibrate_model.sh scripts (#853)
  • Revert padding value change for block_list and slot list (#1007)
  • Fix multimodal budget divergence from upstream vLLM (#837)
  • Fix hourly KeyError: <PlatformEnum.OOT: 6> error (#968)
  • Fix torch.compile in data-parallel mode (#722)
  • Correct sliding window enabling logic (#805)
  • Interleaved sliding window fix (#805)
  • Fix Mamba cumsum padded calculations (#1022)
  • Fix redundant transpose in HPUMambaMixer2 (#999, #1014)
  • Fix Qwen3-VL MoE execution failure (#992)
  • Fix last_chunk_indices calculations (#1024)

Security

CVE-2025-69872 (diskcache 5.6.3): vLLM currently depends on diskcache version 5.6.3, which has been reported as affected by CVE-2025-69872. The vulnerability remains unresolved upstream as of the day of this release. According to initial analysis, the vLLM architecture does not expose the vulnerable code path, meaning vLLM is not impacted in practice, despite the dependency being formally flagged.


Deprecation & Breaking Changes

  • Remove tests/models/utils.py to clean up unused test utilities (#864)
  • VLLM_HPU_HETERO_KV_LAYOUT environment variable is now required for heterogeneous (disaggregated) prefill/decode with NIXL (#867)
  • Remove bucket densification for long context workloads; only edge buckets are applied (#980)

Full Changelog

PR Title Author
#805 Interleaved sliding window fix @rsmyrek
#722 DP: Fix for torch.compile @xuechendi
#770 Add b2b matmul @linoybu
#785 Add FlashAttention online merge in Unified Attention @kzawora-intel
#805 Correct sliding window enabling @jbyczkow
#821 Add support for chunked attention @kfojcik-intel
#831 Resolve qwen25 vl accuracy regression @tvoas
#834 KV cache sharing for HPU @jakub-sochacki
#837 Fix diverge from vllm in multiModalBudget @linoybu
#838 Add dynamic quantization configuration file example @dudilester
#840 Jenkins CI fix for Mistral @iboiko-habana
#850 Revert "skip HPU graphs for long prefills" @adobrzyn
#851 Fix for vLLM #32077 @iboiko-habana
#852 Fix HPU model runner profile_run to work with dynamic kv-cache scales @dudilester
#853 Fix measurement config file generation in calibrate_model.sh @nirda7
#864 Remove unused test utils @microslaw
#867 Enable support for prefill side kv_layout and block_size update @yeonsily
#876 Refactor for vLLM #30623 and small fix for #32238 @iboiko-habana
#886 Implement bucket corrector for Mamba chunk size @jbyczkow
#890 Support for modelopt FP8 quantization format for dense models @skavulya
#897 HPU Granite 4.0-h small implementation @jbyczkow
#905 CODEOWNERS update @kzawora-intel
#909 Support loading q_scale and using fp8_fused_sdpa for MLA prefill @lkk12014402
#917 Fix for hourly KeyError: PlatformEnum.OOT @tzielinski-habana
#920 Update compatibility matrix and refine installation instructions @PatrykWo
#942 Llama4 apply patches + QK flatten pos + perf drop fix @Luca-Calabria
#943 Update Dockerfiles and documentation for v0.15.1 release @PatrykWo
#958 Qwen3_VL - multimodal model embedding fixes @slokesha
#968 Fix for hourly KeyError: PlatformEnum.OOT: 6 @tzielinski-habana
#973 Monkey-patch Attention.forward @tzielinski-habana
#979 Port: Initialization profiling noop @adobrzyn
#980 Remove bucket densification for long ctx; Edge buckets only @kfojcik-intel
#1003 Remove duplicate path @adobrzyn
#1005 Force CPU loading for INC quantization to prevent OOM @kamil-kaczor
#1007 Revert padding value change for block_list and slot list @kamil-kaczor
#1020 Fix INC patching _gate twice @kamil-kaczor
#1029 Fix tensor device mismatch in deepseek @kamil-kaczor
#1030 Adding action.yaml @iboiko-habana
#992 Fix qwen3 vl moe execution failure @shepark
#1014 Fixing redundant transpose in HPUMambaMixer2 @ksmusz
#1022 Fix mamba cumsum padded calculations @jkaniecki
#1024 last_chunk_indices calculations fix @jbyczkow
#1035 use_qk_norm parameter sourced directly from config @rsmyrek
#1038 Back temperature=0 for server as default @iboiko-habana
#1049 Backport RHEL/UBI Dockerfile improvements @PatrykWo

New Contributors

Welcome to the following first-time contributors to vLLM Gaudi Plugin! 🎉

vLLM-Gaudi for vLLM-v0.14.1

06 Feb 16:27
c4ecd71

Choose a tag to compare

Highlights

This version is based on vLLM 0.14.1 and supports Intel® Gaudi® v1.23.0.

The release enables support for Qwen3-VL, and initial support for Granite 4.0-h.

What's Changed

New Contributors

Full Changelog: v0.13.0.post1...v0.14.1

vllm-Gaudi v0.13.0.post1

23 Jan 13:32
906abe3

Choose a tag to compare

This version is a hotfix release on top of vLLM-Gaudi for vLLM-v0.13.0

What's Changed

Full Changelog: v0.13.0...v0.13.0.post1

vLLM-Gaudi for vLLM-v0.13.0

09 Jan 13:13
de3d735

Choose a tag to compare

Highlights

This version is based on vLLM 0.13.0 and supports Intel® Gaudi® v1.23.0.

The release includes experimental dynamic quantization for MatMul and KV‑cache operations. This feature improves performance, with minimal expected impact on accuracy. To enable the feature, see the Dynamic Quantization for MatMul and KV‑cache Operations section.

This release also introduces support for the following models supported on Gaudi 3:

Additionally, the following models were successfully validated:

For the list of all supported models, see Validated Models.

Known bugs

  • At long contexts (≥32k), Llama‑4 (MoE; Scout/Maverick) intermittently hits RuntimeError: shape mismatch in attention/KV cache paths at the prefill→decode boundary caused by the : #680 and #684. Fix will be addressed in newer version.

What's Changed

Read more

vLLM-Gaudi for vLLM-v0.11.2

03 Dec 10:52
f9b6446

Choose a tag to compare

Highlights

This version is based on vLLM 0.11.2 and supports Intel® Gaudi® v1.22.2 and Intel® Gaudi® v1.23.0.

This release introduces the production-ready vLLM Hardware Plugin for Intel® Gaudi®, a community-driven integration layer based on the vLLM v1 architecture. It enables efficient, high-performance large language model (LLM) inference on Intel® Gaudi® AI accelerators. The plugin is an alternative to the vLLM fork, which reaches end of life with this release and will be deprecated in v1.24.0, remaining functional only for legacy use cases. We strongly encourage all fork users to begin planning their migration to the plugin.

The plugin provides feature parity with the fork, including mature, production-ready implementations of Automatic Prefix Caching (APC) and async scheduler. Two legacy features - multi-step scheduling and delayed sampling - have been discontinued, as their functionality is now covered by the async scheduler.

For more details on the plugin's implementation, see Plugin System.

To start using the plugin, follow the Basic Quick Start Guide and explore the rest of this documentation.

What's Changed

Read more

vLLM-Gaudi for vLLM-v0.10.1

18 Aug 14:24
ab65f9b

Choose a tag to compare

Add t.compile config (#62)

Signed-off-by: Kacper Pietkun <kpietkun@habana.ai>

vLLM-Gaudi for vLLM-v0.10.0

01 Aug 18:19

Choose a tag to compare

update README to use v0.10.0 vllm

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>