Releases: vllm-project/vllm-gaudi
vLLM-Gaudi for vLLM-v0.16.0
vLLM Gaudi Plugin v0.16.0 Release Notes
Overview
This release is based on vLLM v0.16.0 and supports Intel® Gaudi® Software v1.23.0.
Highlights
- Added validated support for the following models: Qwen3-VL, DeepSeek OCR, MiniMax-M2, Ovis, Mistral-Large-3, and Hunyuan V1.
- Improved performance by introducing backported bug fixes, mamba improvements, and model weight loading speeds.
- Enhanced quantization to force CPU loading for INC quantization to prevent OOM.
- Introduced various improvements to UBI/RHEL Docker images, server defaults, and Coverity fixes.
New Model Support and Updates
- Change Qwen3-VL to use HPUMMEncoderAttention (#1060)
- Enable caching for Qwen3 MoE op (#1068)
- Fix Qwen3-VL MoE execution failure (#1028)
- Enable DeepSeek OCR model (#954)
- Add dotsocr and seedoss (#977)
- Add MiniMax-M2 support (#964)
- Add Ovis model support with default buckets (#846)
- Enable Mistral-Large-3-675B-Instruct-2512 model (#871)
- Add Hunyuan V1 model support (Dense & MoE bf16/FP8) (#875)
Performance
- [GAUDISW-246429] hpu_mamba_chunk_scan_combined_varlen improvements (#1074)
- Improve model weight loading speed (#807)
- Fix warmup regression (#962)
Attention and KV Cache
- Instead of changing KV cache shape, transpose state in conv1d (#1065)
- [GAUDISW-245713] Remove bucket densification for long ctx; Edge buckets only for long ctx (#915)
- Temporarily disable chunked attention (#981)
- Multimodal model embedding fixes (#759)
- [CT] Add FP8 GQA Support (#874)
- [CT] Fix CT Config to honor
fp8_incKV cache dtype (#929)
Quantization
- Force CPU loading for INC quantization to prevent OOM during weight loading (#1055)
- Fix INC patching
_gatetwice (#955) - [GAUDISW-246337] Added config with scale method: maxabs_pcs_pow2 for dynamic quant (#949)
Plugin Core
- Source
use_qk_normparameter directly from config (#1084) - Fix last_chunk_indices calculations (#1023)
- Fix mamba cumsum padded calculations (#1021)
- Fix redundant transpose in HPUMambaMixer2 (#1015)
- Fix HPUMambaMixer2 inheritance dependency (#1016)
- Add _MAMBA_PAD_BLOCK_ID (#951)
- Enable OffloadingConnector on HPU. (#827)
- GPT OSS Integration Code (#887)
- Fix async scheduler + unified attention failure on Qwen2.5-VL (#931)
- Fix undefined behavior in copy_blocks when source and destination blocks overlap (#329)
Serving and Infrastructure
- Fix RHEL Dockerfile build order and remove obsolete TencentOS Dockerfile (#1056)
- Improve Docker autocalc linear recipe for long contexts (cherry-pick to 0.16.0) (#1041)
- Add
libfdt-develto UBI Dockerfile (#974) - Fix device detection when ENABLE_CONSOLE=true (#963)
Fixes
- Don't destroy server with logprobs (#1098)
- Coverity fix including security, null-like values, duplicates and typos (#1094)
- Fix param mismatch for
compute_nixl_compatibility_hash()(#1087) - Fix Topk Calculation in GPTOSS (#970)
- Fix reported version of vLLM (#811)
- Fixing _compile_region for nested attributes (#956)
- Fix sampler & TP>1 recompilations (#935)
- Restore default
temperature=0for the server after #32723 (#1037)
Full Changelog
| PR | Title | Author |
|---|---|---|
| #1098 | Don't destroy server with logprobs | @adobrzyn |
| #1094 | Coverity fix including security, null-like values, duplicates and typos | @adobrzyn |
| #1087 | fix param mismatch for compute_nixl_compatibility_hash() | @hsubramony |
| #1060 | Change Qwen3VL to use HPUMMEncoderAttention | @jiminha |
| #1068 | Enable caching for qwen3 moe op | @shepark |
| #1084 | use_qk_norm parameter sourced directly from config | @rsmyrek |
| #1056 | Fix RHEL Dockerfile build order and remove obsolete TencentOS Dockerfile | @PatrykWo |
| #1037 | Back temperature=0 for server as default after #32723 | @iboiko-habana |
| #1089 | Change upstream last_good_commit 89a77b10846fd96273cce78d86d2556ea582d26e | @iboiko-habana |
| #1041 | Improve docker autocalc linear recipe for long contexts (cherry-pick to 0.16.0) | @nngokhale |
| #1080 | Port of #1050 for CI unblocking | @iboiko-habana |
| #1074 | hpu_mamba_chunk_scan_combined_varlen improvements | @PatrykWilczewski |
| #1057 | Add ci test for granite-4-h-small to v0.16.0 | @microslaw |
| #1065 | Instead of changing kv cache shape, transpose state in conv1d | @jmamzax |
| #1023 | Fix last_chunk_indices calculations | @jbyczkow |
| #1021 | Fix mamba cumsum padded calculations | @jkaniecki |
| #999 | Fix redundant transpose in HPUMambaMixer2 (#1015) | @ksmusz |
| #1019 | Fixes for #33559 and #34103 | @iboiko-habana |
| #1055 | Force CPU loading for INC quantization to prevent OOM during weight loading | @agrabow |
| #1016 | Fix HPUMambaMixer2 inheritance dependency | @jbyczkow |
| #1028 | Fix qwen3 vl moe execution failure | @shepark |
| #1042 | Adding ci_calibration_smoke_tests.sh into v0.16.0 | @iboiko-habana |
| #971 | UBI images improvements | @ghandoura |
| #954 | Enable deepseek ocr model | @HeJunyan |
| #977 | Add dotsocr and seedoss | @tianyuan211 |
| #975 | Monkey-patch of Attention.forward | @tzielinski-habana |
| #824 | Adjust pre-merge workflow to support merge queue trigger event | @bmyrcha |
| #970 | Fix Topk Calculation in GPTOSS | @SKRohit |
| #981 | Temporarily disable chunked attention | @adobrzyn |
| #982 | adding FIX_FOR_VLLM_CUSTOM to CI | @iboiko-habana |
| #974 | Add libfdt-devel (new habanalabs-thunk dependency) to ubi dockerfile | @mmuszynskihabana |
| #930 | Fix for individual unit tests | @tzielinski-habana |
| #969 | CI cleanup 2 | @microslaw |
| ... |
vLLM-Gaudi for vLLM-v0.15.1
vLLM Gaudi Plugin v0.15.1 Release Notes
Overview
This release is based on vLLM v0.15.1 and supports Intel® Gaudi® Software v1.23.0.
Highlights
- Added validated support for Granite 4.0-h and Qwen3-VL (dense and MoE variants) on Intel Gaudi 3. Additionally, added significant Llama 4 stability fixes.
- Introduced full chunked prefill attention support for HPU, enabling better memory utilization on long sequences (#821).
- Integrated FlashAttention online merge in Unified Attention for improved prefill performance (#785).
- Added KV cache sharing support for HPU, enabling more efficient multi-query scenarios (#834).
- Introduced support for NVIDIA ModelOpt FP8 quantization format for dense models (#890).
- Added HPU ops for Mamba mixer2, causal conv1d, and SSD combined kernels enabling hybrid SSM-Transformer models, such as Granite 4.0-h (#886, #897).
- Added back-to-back matmul operation for improved Multi-Latent Attention (MLA) performance (#770).
- Introduced prefill-side KV layout and block size support for heterogeneous (disaggregated) inference via NIXL (#867).
New Model Support
- Add validated support for Qwen3-VL-32B-Instruct, Qwen3-VL-32B-Thinking, and Qwen3-VL-235B-A22B variants (Instruct, Thinking, FP8) on Gaudi 3 (#958)
- Register the
Qwen3VLMoeForConditionalGenerationmodel for Qwen3-VL MoE variants (#958) - Add IBM Granite 4.0-h small (hybrid SSM-Transformer) implementation for HPU (#897)
Performance
- Add FlashAttention online merge in Unified Attention for faster prefill (#785)
- Add back-to-back (b2b) matmul for improved MLA attention performance (#770)
- Support loading
q_scaleand usingfp8_fused_sdpafor MLA prefill (#909) - Remove bucket densification for long context; apply edge buckets only for long context scenarios (#980)
- Implement bucket corrector for Mamba chunk size (#886)
- Revert "skip HPU graphs for long prefills" to restore graph capture on long sequences (#850)
- Port initialization profiling noop to reduce startup overhead (#979)
Attention & KV Cache
- Add support for chunked attention on HPU (#821)
- Add KV cache sharing for HPU (#834)
- Enable support for prefill-side
kv_layoutandblock_sizeupdate for heterogeneous runs (#867) - Add new
VLLM_HPU_HETERO_KV_LAYOUTenvironment variable to control heterogeneous KV layout (#867) - Add heterogeneous HPU NIXL connector for disaggregated prefill/decode (#867)
- Add
hpu_attentionops module with attention operation implementations (#785) - Monkey-patch
Attention.forwardfor HPU-specific behavior (#973) - Platform: declare
support_hybrid_kv_cachecapability (#834)
Quantization
- Add support for ModelOpt FP8 quantization format for dense models (#890)
- Add
modeloptto platform supported quantization list (#890) - Add dynamic quantization configuration file example (#838)
Plugin Core
- Register new ops:
hpu_attention,hpu_grouped_topk_router,hpu_mamba_mixer2, andhpu_modelopt(#785, #897, #890) - Add
ops_selectormodule for HPU operation routing (#897) - Add
pytorch_implementationmodule with pure-PyTorch fallback ops (#897) - Add
causal_conv1d_pytorchandssd_combinedops for SSM/Mamba support (#897) - Add
hpu_grouped_topk_routerfor MoE grouped top-k routing (#897) - Source
use_qk_normparameter directly from config (#1035)
Serving & Infrastructure
- Add GitHub Actions
action.yamlfor PR detail workflows (#1030) - Add CI calibration smoke tests script (#853)
- Rename and consolidate CI e2e discoverable tests (#840)
- Fix Jenkins CI for Mistral model tests (#840)
- Restore
temperature=0as server default after vLLM #32723 (#1038) - Backport RHEL/UBI Dockerfile improvements (#1049)
Fixes
- Fix Llama 4 apply-patches flow, QK flatten positional encoding, and address performance drop (#942)
- Fix Llama 4 shape mismatch for 32k+ context window (#842, #855)
- Fix Qwen2.5-VL accuracy regression (#831)
- Fix Qwen3-VL multimodal model embedding issues (#958)
- Fix DeepSeek tensor device mismatch (#1029)
- Force CPU loading for INC quantization to prevent OOM during weight loading (#1005)
- Fix INC patching
_gatetwice (#955, #1020) - Fix HPU model runner
profile_runto work with dynamic kv-cache scales (#852) - Fix measurement config file generation in
calibrate_model.shscripts (#853) - Revert padding value change for
block_listand slot list (#1007) - Fix multimodal budget divergence from upstream vLLM (#837)
- Fix hourly
KeyError: <PlatformEnum.OOT: 6>error (#968) - Fix
torch.compilein data-parallel mode (#722) - Correct sliding window enabling logic (#805)
- Interleaved sliding window fix (#805)
- Fix Mamba cumsum padded calculations (#1022)
- Fix redundant transpose in HPUMambaMixer2 (#999, #1014)
- Fix Qwen3-VL MoE execution failure (#992)
- Fix
last_chunk_indicescalculations (#1024)
Security
CVE-2025-69872 (diskcache 5.6.3): vLLM currently depends on diskcache version 5.6.3, which has been reported as affected by CVE-2025-69872. The vulnerability remains unresolved upstream as of the day of this release. According to initial analysis, the vLLM architecture does not expose the vulnerable code path, meaning vLLM is not impacted in practice, despite the dependency being formally flagged.
Deprecation & Breaking Changes
- Remove
tests/models/utils.pyto clean up unused test utilities (#864) VLLM_HPU_HETERO_KV_LAYOUTenvironment variable is now required for heterogeneous (disaggregated) prefill/decode with NIXL (#867)- Remove bucket densification for long context workloads; only edge buckets are applied (#980)
Full Changelog
| PR | Title | Author |
|---|---|---|
| #805 | Interleaved sliding window fix | @rsmyrek |
| #722 | DP: Fix for torch.compile | @xuechendi |
| #770 | Add b2b matmul | @linoybu |
| #785 | Add FlashAttention online merge in Unified Attention | @kzawora-intel |
| #805 | Correct sliding window enabling | @jbyczkow |
| #821 | Add support for chunked attention | @kfojcik-intel |
| #831 | Resolve qwen25 vl accuracy regression | @tvoas |
| #834 | KV cache sharing for HPU | @jakub-sochacki |
| #837 | Fix diverge from vllm in multiModalBudget | @linoybu |
| #838 | Add dynamic quantization configuration file example | @dudilester |
| #840 | Jenkins CI fix for Mistral | @iboiko-habana |
| #850 | Revert "skip HPU graphs for long prefills" | @adobrzyn |
| #851 | Fix for vLLM #32077 | @iboiko-habana |
| #852 | Fix HPU model runner profile_run to work with dynamic kv-cache scales | @dudilester |
| #853 | Fix measurement config file generation in calibrate_model.sh | @nirda7 |
| #864 | Remove unused test utils | @microslaw |
| #867 | Enable support for prefill side kv_layout and block_size update | @yeonsily |
| #876 | Refactor for vLLM #30623 and small fix for #32238 | @iboiko-habana |
| #886 | Implement bucket corrector for Mamba chunk size | @jbyczkow |
| #890 | Support for modelopt FP8 quantization format for dense models | @skavulya |
| #897 | HPU Granite 4.0-h small implementation | @jbyczkow |
| #905 | CODEOWNERS update | @kzawora-intel |
| #909 | Support loading q_scale and using fp8_fused_sdpa for MLA prefill | @lkk12014402 |
| #917 | Fix for hourly KeyError: PlatformEnum.OOT | @tzielinski-habana |
| #920 | Update compatibility matrix and refine installation instructions | @PatrykWo |
| #942 | Llama4 apply patches + QK flatten pos + perf drop fix | @Luca-Calabria |
| #943 | Update Dockerfiles and documentation for v0.15.1 release | @PatrykWo |
| #958 | Qwen3_VL - multimodal model embedding fixes | @slokesha |
| #968 | Fix for hourly KeyError: PlatformEnum.OOT: 6 | @tzielinski-habana |
| #973 | Monkey-patch Attention.forward | @tzielinski-habana |
| #979 | Port: Initialization profiling noop | @adobrzyn |
| #980 | Remove bucket densification for long ctx; Edge buckets only | @kfojcik-intel |
| #1003 | Remove duplicate path | @adobrzyn |
| #1005 | Force CPU loading for INC quantization to prevent OOM | @kamil-kaczor |
| #1007 | Revert padding value change for block_list and slot list | @kamil-kaczor |
| #1020 | Fix INC patching _gate twice | @kamil-kaczor |
| #1029 | Fix tensor device mismatch in deepseek | @kamil-kaczor |
| #1030 | Adding action.yaml | @iboiko-habana |
| #992 | Fix qwen3 vl moe execution failure | @shepark |
| #1014 | Fixing redundant transpose in HPUMambaMixer2 | @ksmusz |
| #1022 | Fix mamba cumsum padded calculations | @jkaniecki |
| #1024 | last_chunk_indices calculations fix | @jbyczkow |
| #1035 | use_qk_norm parameter sourced directly from config | @rsmyrek |
| #1038 | Back temperature=0 for server as default | @iboiko-habana |
| #1049 | Backport RHEL/UBI Dockerfile improvements | @PatrykWo |
New Contributors
Welcome to the following first-time contributors to vLLM Gaudi Plugin! 🎉
- @linoybu — b2b matmul and multimodal budget fix (#770, #837)
- @microslaw — Test utilities cleanup (#864)
- @nirda7 — Calibration script fixes (#853)
- @tzielinski-habana — Platform stability fixes and Attention.forward monkey-patch (#917, #968, #973)
- @yeonsily — Heterogeneous KV layout support (#867)
- @jkaniecki — Mamba cumsum padded calculations fix (#1022)
- @shepark — Qwen3-VL MoE execution fix (#992)
vLLM-Gaudi for vLLM-v0.14.1
Highlights
This version is based on vLLM 0.14.1 and supports Intel® Gaudi® v1.23.0.
The release enables support for Qwen3-VL, and initial support for Granite 4.0-h.
What's Changed
- Update action to change CODEOWNERS for new release branch by @PatrykWo in #745
- Apply hw aligned scale by @lkk12014402 in #734
- [Attention Metadata Overhaul 1/N] Extract metadata update to HPUAttentionMetadataProcessor by @kzawora-intel in #526
- [FIX_FOR_VLLM_LATEST] Quick fix for PR30684 by @iboiko-habana in #742
- Change neural version by @adobrzyn in #754
- Fix for PR30684 by @iboiko-habana in #757
- [GAUDISW-243560] Monkey-patching _get_attn_scale for the Llama4Attention layer by @rsmyrek in #758
- [FIX_FOR_VLLM_LATEST] tokenizer fix for #31285 by @iboiko-habana in #764
- Fix async_scheduling + batched prefill by @tianmu-li in #740
- Documentation: Fix missing back navigation arrow on mobile devices by @mhelf-intel in #766
- [FIX_FOR_VLLM_LATEST] Fix structured_output after use_async_scheduling default usage in #27614 by @iboiko-habana in #768
- [GAUDISW-244336] Add missing long ctx prompt buckets by @kfojcik-intel in #739
- Fix repetition penalty crash in decode phase by @pawel-olejniczak in #769
- Update lmcache examples by @hsubramony in #748
- [Bugfix] Handle spec decode optionals in unified batch by @kzawora-intel in #782
- Load KV scales for FP8 MLA by @yiliu30 in #763
- [FIX_FOR_VLLM_LATEST] Fix block_size used in eagle by @pawel-olejniczak in #773
- WA shared bias in UA by @adobrzyn in #727
- skip HPU graphs for long prefills by @yangulei in #780
- create HPUConv3D class, which replaces unfold with view. by @skaulintel in #786
- Fix the docker image path by @mhelf-intel in #691
- Fix for Llama4 static quantization by @vidyasiv in #707
- Exponential max number in range not over bmax by @adobrzyn in #795
- Fix Mixtral 8x22B benchmark error, Add EXTRA_BENCH_ARGS by @nngokhale in #796
- Add ucx test by @pi314ever in #711
- Unified Attention - multi-step low-level profiling by @kzawora-intel in #791
- Prefill batching logic to handle chunked prefill/prefix caching for HPU by @hlin99 in #753
- Update Dockerfiles and workflows for v1.23.0 release, including PyTor… by @PatrykWo in #802
- modify conv3d permute by @skaulintel in #794
- Add
MoeMatmulto dynamic op support list by @yiliu30 in #817 - [FIX_FOR_VLLM_LATEST] fixes for #31747, #30519, #32003, #31916 and test cases disablement for #31998 and #32254 by @iboiko-habana in #797
- No num seqs over max in fallback buckets by @adobrzyn in #816
- Implement profile_run method in HPU model runner by @xwu-intel in #775
- Upgrade transformers>= 4.56.0, <5 by @iboiko-habana in #767
- Update CODEOWNERS by @PatrykWo in #808
- Use actual block count for bucketing in contiguous PA mode by @pawel-olejniczak in #792
- disable async scheduler when spec decode is on for hpu_model_runner by @iboiko-habana in #825
- fix ubi docker: use --nobest flag to resolve boost dependency conflic… by @PatrykWo in #810
- Doc updates cherry-picked from 0.13.0 by @mhelf-intel in #799
- Fix dummy_mm_item TypeError when warmup MM model by @jinyouzhi in #822
- Resolve crash when using caching with mm models by @tvoas in #823
- Fix INC patch for new version by @yiliu30 in #829
- Enable HPU Fused SDPA for Qwen3-VL vision attention using attention masks by @slokesha in #787
- Exponential max decode blocks fix for non-contiguous pa scenario by @adobrzyn in #818
- Add conditional runner selection based on PR title for discover_runne… by @PatrykWo in #841
- fix empty buckets issue for enforce eager mode by @yangulei in #761
- fix calibration for fp8 MoE models by @yangulei in #832
- Update configurations for Bielik-4.5B model integration by @PatrykWo in #804
- [GAUDISW-244752] add dynamic scale for V-Cache on Hiddden dim by @dudilester in #749
- Added Qwen3 Test by @slokesha in #736
- Port: Resolve qwen25 vl accuracy regression #831 by @adobrzyn in #869
- port remove gather and scatter to v0.14.0 release by @skaulintel in #858
- Correct sliding window enabling by @jbyczkow in #854
- Implement bucket corrector for Mamba chunk size - v0.14.1 by @jbyczkow in #885
- Revert "skip HPU graphs for long prefills" (#850) by @adobrzyn in #888
- Cherry-picks to enable Llama4 Maverick by @rsmyrek in #882
- cherry-pick chunked attention from #821 + 32k+ context window fix from #855 by @Luca-Calabria in #881
- Fix a shape mismatch in mrope position slicing by @shepark in #894
- Hpu granite 4.0-h small implementation by @jbyczkow in #883
- Fix MultiModalBudget error by @adobrzyn in #892
- Qwen3vl accuracy fixes by @libinta in #884
- Granite 4.0-h small - cleanup by @jbyczkow in #900
- Fix Mamba Metadata padding by @jbyczkow in #901
- Update qwen2_5_vl attention forward by @shepark in #908
- Fix warmup for granite40 by @michalkuligowski in #899
- Fix for Llama4 Maverick performance drop by @jkaniecki in #904
- Fix for coverity by @adobrzyn in #910
- Enable VLLM_USE_NAIVE_MAMBA_CACHE_SHARING by default by @jbyczkow in #922
- Initializatrion profiling noop by @michalkuligowski in #916
- Remove bucket densification for long ctx; Edge buckets only for long ctx by @kfojcik-intel in #918
- Update Dockerfiles and documentation for v0.14.1 release by @PatrykWo in #919
New Contributors
- @skaulintel made their first contribution in #786
- @vidyasiv made their first contribution in #707
- @pi314ever made their first contribution in #711
- @hlin99 made their first contribution in #753
- @jinyouzhi made their first contribution in #822
- @shepark made their first contribution in #894
Full Changelog: v0.13.0.post1...v0.14.1
vllm-Gaudi v0.13.0.post1
This version is a hotfix release on top of vLLM-Gaudi for vLLM-v0.13.0
What's Changed
- Port of Add MoeMatmul to dynamic op support list #817 by @iboiko-habana in #819
- Add support for chunked attention (#597) by @kfojcik-intel in #809
- Port of #829: Fix INC patch for new version by @iboiko-habana in #839
- Fix Llama4 shape mismatch for 32k+ context window by @afierka-intel in #842
Full Changelog: v0.13.0...v0.13.0.post1
vLLM-Gaudi for vLLM-v0.13.0
Highlights
This version is based on vLLM 0.13.0 and supports Intel® Gaudi® v1.23.0.
The release includes experimental dynamic quantization for MatMul and KV‑cache operations. This feature improves performance, with minimal expected impact on accuracy. To enable the feature, see the Dynamic Quantization for MatMul and KV‑cache Operations section.
This release also introduces support for the following models supported on Gaudi 3:
- bielik-11b-v2.6-instruct
- bielik-1.5b-v3.0-instruct
- bielik-4.5b-v3.0-instruct
- deepseek-ai/DeepSeek-R1-Distill-Llama-70B
- Qwen/Qwen2.5-7B-Instruct
- Qwen/Qwen2.5-14B-Instruct
- Qwen/Qwen2.5-32B-Instruct
- Qwen/Qwen2.5-VL-7B-Instruct
- Qwen/Qwen3-0.6B
Additionally, the following models were successfully validated:
- meta-llama/Meta-Llama-3.1-8B
- meta-llama/Meta-Llama-3.1-70B
- meta-llama/Meta-Llama-3.1-70B-Instruct
- meta-llama/Meta-Llama-3.1-405B
- meta-llama/Meta-Llama-3.1-405B-Instruct
- meta-llama/Meta-Llama-3.3-70B
- mistralai/Mistral-7B-Instruct-v0.3
For the list of all supported models, see Validated Models.
Known bugs
- At long contexts (≥32k), Llama‑4 (MoE; Scout/Maverick) intermittently hits RuntimeError: shape mismatch in attention/KV cache paths at the prefill→decode boundary caused by the : #680 and #684. Fix will be addressed in newer version.
What's Changed
- add commit-id to distinguish image and container for each PR by @xuechendi in #85
- [Upstream fix] Fix after #23041 from upstream by @adobrzyn in #87
- Change warmup scenario for execute dummy scenario by @adobrzyn in #54
- remove enable_prompt_adapter in test to fix by @xuechendi in #91
- Fix jenkins - remove failed test and fix later / update API by @xuechendi in #79
- [Upstream fix] Fix after #23262 from upstream - Make new_block_ids None if empty by @adobrzyn in #93
- Enable multimodal support + qwen2.5-vl by @attafosu in #92
- Fix upstream PR 22668 that added additional arg to is_kv_cache_dtype_supported by @mswiniarsk in #96
- Port defragmentation support from vllm-fork PR #1568 by @madamczyk-intel in #94
- [Upstream fix] Fix after #22711 by @adobrzyn in #102
- Reduce number of compilations when dynamic shapes is used by @anko-intel in #90
- Warmup fix - for non contiguous PA runs, don't take more context blocks than possible by @adobrzyn in #97
- [UT] Fix test args for bucketing tests by @adobrzyn in #105
- [SW-236088] Add sampler unit tests by @kamil-kaczor in #99
- Avoid copying dynamic slice of sampling_metadata tensors by @mswiniarsk in #88
- Fix mm encoder inputs for mix-modalities in input batch by @attafosu in #103
- Fix decode profiling by @kamil-kaczor in #106
- fix upstream PR 23749 by @xuechendi in #108
- Fix the failing introduced by upstream 22685 by @xuechendi in #110
- fix an argument issue introduced by recent vllm upstream and add CI by @xuechendi in #111
- Port G2 scaling convert from vllm-fork #1505 by @xuechendi in #112
- Enable Spec Decode for HPU v1 - Part1(basic workflow + eagle) by @xuechendi in #81
- fix qwen3-30B-A3B-FP8 - The number of dims cannot be packed into CompleteArgumentSpec:65535 by @xuechendi in #113
- [FIX HOURLY Failure] transformer 4.56.0 is not compatible with INC by @xuechendi in #117
- Remove test_load_model_weights_inplace by @kzawora-intel in #48
- [BUG fix]Fix spec_decode introduced long graph compilation issue by @xuechendi in #127
- [Bugfix] Warmup with continuous PA by @adobrzyn in #126
- Disable warmup for defragmentator by @mswiniarsk in #132
- Merging vllm docker implementation to vllm-gaudi (v1) by @PatrykWo in #125
- Enable embedding feature by @slokesha in #120
- Revert "Enable embedding feature" by @adobrzyn in #140
- [Bugfix] Remove reqs without logits - merge prefill case by @adobrzyn in #137
- Update CODEOWNERS by @mgawarkiewicz-intel in #144
- Fix warmup break when max decode bucket bs > max num seq by @taran2210 in #107
- Add tests for custom op registration by @Kacper-Pietkun in #109
- Enable embedding feature by @slokesha in #141
- Update CODEOWNERS file by @vivekgoe in #143
- [Merged Prefill] Warmup for merged prefill by @adobrzyn in #104
- Experimental support for Unified Attention by @madamczyk-intel in #133
- Introducing sampler warmup as separate warmup step by @ksmusz in #131
- Add support for LoRA by @vivekgoe in #51
- Add data parallel support by @wuxun-zhang in #80
- Increase allowed line length to 120 + reformat accordingly by @kzawora-intel in #130
- [FIX HOURLY]Remove DP test from Hourly by @xuechendi in #147
- Update CODEOWNERS by @afierka-intel in #135
- Enable sampler compilation by @Kacper-Pietkun in #95
- Add DP into CI by @wuxun-zhang in #146
- Add TESTOWNERS by @kzawora-intel in #153
- Patch FusedMoE forward to avoid dynamo recompilations by @kdamaszk in #158
- [CI] Jenkins false positive bugfix by @kzawora-intel in #159
- Fix dummy decode input for DP by @wuxun-zhang in #151
- [Quick fix for CI]fix CI break on Qwen2.5-vl and update docker image by @xuechendi in #161
- initial port for nixl by @hsubramony in #100
- update nixl version in requirements by @hsubramony in #163
- Re-quantize FP8 model with INC by @yiliu30 in #114
- [Feature][SpecDecode][Part2] Eagle3,MTP enabling, accept_rate improvement by @xuechendi in #142
- [BUGFIX] qwen2.5-vl failed after PR24444, provide a temp solution by @xuechendi in #162
- Reenabling llama4 models by @afierka-intel in #128
- Allow building vllm-plugin docker with upstream torch by @mmuszynskihabana in #155
- [HOURLY FIX] For upstream PR-24548 changes by @xuechendi in #166
- [BUGFIX] warmup failed after PR104, propose fix in this PR by @xuechendi in #148
- TESTOWNERS update by @adobrzyn in http...
vLLM-Gaudi for vLLM-v0.11.2
Highlights
This version is based on vLLM 0.11.2 and supports Intel® Gaudi® v1.22.2 and Intel® Gaudi® v1.23.0.
This release introduces the production-ready vLLM Hardware Plugin for Intel® Gaudi®, a community-driven integration layer based on the vLLM v1 architecture. It enables efficient, high-performance large language model (LLM) inference on Intel® Gaudi® AI accelerators. The plugin is an alternative to the vLLM fork, which reaches end of life with this release and will be deprecated in v1.24.0, remaining functional only for legacy use cases. We strongly encourage all fork users to begin planning their migration to the plugin.
The plugin provides feature parity with the fork, including mature, production-ready implementations of Automatic Prefix Caching (APC) and async scheduler. Two legacy features - multi-step scheduling and delayed sampling - have been discontinued, as their functionality is now covered by the async scheduler.
For more details on the plugin's implementation, see Plugin System.
To start using the plugin, follow the Basic Quick Start Guide and explore the rest of this documentation.
What's Changed
- add commit-id to distinguish image and container for each PR by @xuechendi in #85
- [Upstream fix] Fix after #23041 from upstream by @adobrzyn in #87
- Change warmup scenario for execute dummy scenario by @adobrzyn in #54
- remove enable_prompt_adapter in test to fix by @xuechendi in #91
- Fix jenkins - remove failed test and fix later / update API by @xuechendi in #79
- [Upstream fix] Fix after #23262 from upstream - Make new_block_ids None if empty by @adobrzyn in #93
- Enable multimodal support + qwen2.5-vl by @attafosu in #92
- Fix upstream PR 22668 that added additional arg to is_kv_cache_dtype_supported by @mswiniarsk in #96
- Port defragmentation support from vllm-fork PR #1568 by @madamczyk-intel in #94
- [Upstream fix] Fix after #22711 by @adobrzyn in #102
- Reduce number of compilations when dynamic shapes is used by @anko-intel in #90
- Warmup fix - for non contiguous PA runs, don't take more context blocks than possible by @adobrzyn in #97
- [UT] Fix test args for bucketing tests by @adobrzyn in #105
- [SW-236088] Add sampler unit tests by @kamil-kaczor in #99
- Avoid copying dynamic slice of sampling_metadata tensors by @mswiniarsk in #88
- Fix mm encoder inputs for mix-modalities in input batch by @attafosu in #103
- Fix decode profiling by @kamil-kaczor in #106
- fix upstream PR 23749 by @xuechendi in #108
- Fix the failing introduced by upstream 22685 by @xuechendi in #110
- fix an argument issue introduced by recent vllm upstream and add CI by @xuechendi in #111
- Port G2 scaling convert from vllm-fork #1505 by @xuechendi in #112
- Enable Spec Decode for HPU v1 - Part1(basic workflow + eagle) by @xuechendi in #81
- fix qwen3-30B-A3B-FP8 - The number of dims cannot be packed into CompleteArgumentSpec:65535 by @xuechendi in #113
- [FIX HOURLY Failure] transformer 4.56.0 is not compatible with INC by @xuechendi in #117
- Remove test_load_model_weights_inplace by @kzawora-intel in #48
- [BUG fix]Fix spec_decode introduced long graph compilation issue by @xuechendi in #127
- [Bugfix] Warmup with continuous PA by @adobrzyn in #126
- Disable warmup for defragmentator by @mswiniarsk in #132
- Merging vllm docker implementation to vllm-gaudi (v1) by @PatrykWo in #125
- Enable embedding feature by @slokesha in #120
- Revert "Enable embedding feature" by @adobrzyn in #140
- [Bugfix] Remove reqs without logits - merge prefill case by @adobrzyn in #137
- Update CODEOWNERS by @mgawarkiewicz-intel in #144
- Fix warmup break when max decode bucket bs > max num seq by @taran2210 in #107
- Add tests for custom op registration by @Kacper-Pietkun in #109
- Enable embedding feature by @slokesha in #141
- Update CODEOWNERS file by @vivekgoe in #143
- [Merged Prefill] Warmup for merged prefill by @adobrzyn in #104
- Experimental support for Unified Attention by @madamczyk-intel in #133
- Introducing sampler warmup as separate warmup step by @ksmusz in #131
- Add support for LoRA by @vivekgoe in #51
- Add data parallel support by @wuxun-zhang in #80
- Increase allowed line length to 120 + reformat accordingly by @kzawora-intel in #130
- [FIX HOURLY]Remove DP test from Hourly by @xuechendi in #147
- Update CODEOWNERS by @afierka-intel in #135
- Enable sampler compilation by @Kacper-Pietkun in #95
- Add DP into CI by @wuxun-zhang in #146
- Add TESTOWNERS by @kzawora-intel in #153
- Patch FusedMoE forward to avoid dynamo recompilations by @kdamaszk in #158
- [CI] Jenkins false positive bugfix by @kzawora-intel in #159
- Fix dummy decode input for DP by @wuxun-zhang in #151
- [Quick fix for CI]fix CI break on Qwen2.5-vl and update docker image by @xuechendi in #161
- initial port for nixl by @hsubramony in #100
- update nixl version in requirements by @hsubramony in #163
- Re-quantize FP8 model with INC by @yiliu30 in #114
- [Feature][SpecDecode][Part2] Eagle3,MTP enabling, accept_rate improvement by @xuechendi in #142
- [BUGFIX] qwen2.5-vl failed after PR24444, provide a temp solution by @xuechendi in #162
- Reenabling llama4 models by @afierka-intel in #128
- Allow building vllm-plugin docker with upstream torch by @mmuszynskihabana in #155
- [HOURLY FIX] For upstream PR-24548 changes by @xuechendi in #166
- [BUGFIX] warmup failed after PR104, propose fix in this PR by @xuechendi in #148
- TESTOWNERS update by @adobrzyn in #165
- [TEMP-WA] Skip Qwen3-30B-A3B in tests - Bug introduced in upstream #24772 by @attafosu in #168
- [CI FIX]Fix issue introduced by upstream PR #23974 by @xuechendi in #172
- [CI FIX] Fix issue introduced by upstream #24745 by @xuechendi in #174
- [BUG][Disable CI] Disable DP test due recent upstream change failed HPU DP by @xuechendi in #177
- Fully overlap model execution by @tianmu-li in #134
- Added fix for VLLM_WEIGHT_LOAD_FORCE_SYNC by @tianmu-li in #173
- Introduce VLLM_SCALE_ADJUSTMENT by @xinyu-intel in #164
- Support Ray distributed executor by @xinyu-intel in #169
- Bug f...
vLLM-Gaudi for vLLM-v0.10.1
Add t.compile config (#62) Signed-off-by: Kacper Pietkun <kpietkun@habana.ai>
vLLM-Gaudi for vLLM-v0.10.0
update README to use v0.10.0 vllm Signed-off-by: Chendi.Xue <chendi.xue@intel.com>