06 Mar 13:27

PatrykWo

da2ce40

vLLM-Gaudi for vLLM-v0.16.0 Latest

Latest

vLLM Gaudi Plugin v0.16.0 Release Notes

Overview

This release is based on vLLM v0.16.0 and supports Intel® Gaudi® Software v1.23.0.

Highlights

Added validated support for the following models: Qwen3-VL, DeepSeek OCR, MiniMax-M2, Ovis, Mistral-Large-3, and Hunyuan V1.
Improved performance by introducing backported bug fixes, mamba improvements, and model weight loading speeds.
Enhanced quantization to force CPU loading for INC quantization to prevent OOM.
Introduced various improvements to UBI/RHEL Docker images, server defaults, and Coverity fixes.

New Model Support and Updates

Change Qwen3-VL to use HPUMMEncoderAttention (#1060)
Enable caching for Qwen3 MoE op (#1068)
Fix Qwen3-VL MoE execution failure (#1028)
Enable DeepSeek OCR model (#954)
Add dotsocr and seedoss (#977)
Add MiniMax-M2 support (#964)
Add Ovis model support with default buckets (#846)
Enable Mistral-Large-3-675B-Instruct-2512 model (#871)
Add Hunyuan V1 model support (Dense & MoE bf16/FP8) (#875)

Performance

[GAUDISW-246429] hpu_mamba_chunk_scan_combined_varlen improvements (#1074)
Improve model weight loading speed (#807)
Fix warmup regression (#962)

Attention and KV Cache

Instead of changing KV cache shape, transpose state in conv1d (#1065)
[GAUDISW-245713] Remove bucket densification for long ctx; Edge buckets only for long ctx (#915)
Temporarily disable chunked attention (#981)
Multimodal model embedding fixes (#759)
[CT] Add FP8 GQA Support (#874)
[CT] Fix CT Config to honor fp8_inc KV cache dtype (#929)

Quantization

Force CPU loading for INC quantization to prevent OOM during weight loading (#1055)
Fix INC patching _gate twice (#955)
[GAUDISW-246337] Added config with scale method: maxabs_pcs_pow2 for dynamic quant (#949)

Plugin Core

Source use_qk_norm parameter directly from config (#1084)
Fix last_chunk_indices calculations (#1023)
Fix mamba cumsum padded calculations (#1021)
Fix redundant transpose in HPUMambaMixer2 (#1015)
Fix HPUMambaMixer2 inheritance dependency (#1016)
Add _MAMBA_PAD_BLOCK_ID (#951)
Enable OffloadingConnector on HPU. (#827)
GPT OSS Integration Code (#887)
Fix async scheduler + unified attention failure on Qwen2.5-VL (#931)
Fix undefined behavior in copy_blocks when source and destination blocks overlap (#329)

Serving and Infrastructure

Fix RHEL Dockerfile build order and remove obsolete TencentOS Dockerfile (#1056)
Improve Docker autocalc linear recipe for long contexts (cherry-pick to 0.16.0) (#1041)
Add libfdt-devel to UBI Dockerfile (#974)
Fix device detection when ENABLE_CONSOLE=true (#963)

Fixes

Don't destroy server with logprobs (#1098)
Coverity fix including security, null-like values, duplicates and typos (#1094)
Fix param mismatch for compute_nixl_compatibility_hash() (#1087)
Fix Topk Calculation in GPTOSS (#970)
Fix reported version of vLLM (#811)
Fixing _compile_region for nested attributes (#956)
Fix sampler & TP>1 recompilations (#935)
Restore default temperature=0 for the server after #32723 (#1037)

Full Changelog

PR	Title	Author
#1098	Don't destroy server with logprobs	@adobrzyn
#1094	Coverity fix including security, null-like values, duplicates and typos	@adobrzyn
#1087	fix param mismatch for compute_nixl_compatibility_hash()	@hsubramony
#1060	Change Qwen3VL to use HPUMMEncoderAttention	@jiminha
#1068	Enable caching for qwen3 moe op	@shepark
#1084	use_qk_norm parameter sourced directly from config	@rsmyrek
#1056	Fix RHEL Dockerfile build order and remove obsolete TencentOS Dockerfile	@PatrykWo
#1037	Back temperature=0 for server as default after #32723	@iboiko-habana
#1089	Change upstream last_good_commit 89a77b10846fd96273cce78d86d2556ea582d26e	@iboiko-habana
#1041	Improve docker autocalc linear recipe for long contexts (cherry-pick to 0.16.0)	@nngokhale
#1080	Port of #1050 for CI unblocking	@iboiko-habana
#1074	hpu_mamba_chunk_scan_combined_varlen improvements	@PatrykWilczewski
#1057	Add ci test for granite-4-h-small to v0.16.0	@microslaw
#1065	Instead of changing kv cache shape, transpose state in conv1d	@jmamzax
#1023	Fix last_chunk_indices calculations	@jbyczkow
#1021	Fix mamba cumsum padded calculations	@jkaniecki
#999	Fix redundant transpose in HPUMambaMixer2 (#1015)	@ksmusz
#1019	Fixes for #33559 and #34103	@iboiko-habana
#1055	Force CPU loading for INC quantization to prevent OOM during weight loading	@agrabow
#1016	Fix HPUMambaMixer2 inheritance dependency	@jbyczkow
#1028	Fix qwen3 vl moe execution failure	@shepark
#1042	Adding ci_calibration_smoke_tests.sh into v0.16.0	@iboiko-habana
#971	UBI images improvements	@ghandoura
#954	Enable deepseek ocr model	@HeJunyan
#977	Add dotsocr and seedoss	@tianyuan211
#975	Monkey-patch of Attention.forward	@tzielinski-habana
#824	Adjust pre-merge workflow to support merge queue trigger event	@bmyrcha
#970	Fix Topk Calculation in GPTOSS	@SKRohit
#981	Temporarily disable chunked attention	@adobrzyn
#982	adding FIX_FOR_VLLM_CUSTOM to CI	@iboiko-habana
#974	Add libfdt-devel (new habanalabs-thunk dependency) to ubi dockerfile	@mmuszynskihabana
#930	Fix for individual unit tests	@tzielinski-habana
#969	CI cleanup 2	@microslaw
...

Contributors

HeJunyan, yafshar, and 40 other contributors

Assets 2

27 Feb 13:23

PatrykWo

v0.15.1

5d6a2db

vLLM-Gaudi for vLLM-v0.15.1

vLLM Gaudi Plugin v0.15.1 Release Notes

Overview

This release is based on vLLM v0.15.1 and supports Intel® Gaudi® Software v1.23.0.

Highlights

Added validated support for Granite 4.0-h and Qwen3-VL (dense and MoE variants) on Intel Gaudi 3. Additionally, added significant Llama 4 stability fixes.
Introduced full chunked prefill attention support for HPU, enabling better memory utilization on long sequences (#821).
Integrated FlashAttention online merge in Unified Attention for improved prefill performance (#785).
Added KV cache sharing support for HPU, enabling more efficient multi-query scenarios (#834).
Introduced support for NVIDIA ModelOpt FP8 quantization format for dense models (#890).
Added HPU ops for Mamba mixer2, causal conv1d, and SSD combined kernels enabling hybrid SSM-Transformer models, such as Granite 4.0-h (#886, #897).
Added back-to-back matmul operation for improved Multi-Latent Attention (MLA) performance (#770).
Introduced prefill-side KV layout and block size support for heterogeneous (disaggregated) inference via NIXL (#867).

New Model Support

Add validated support for Qwen3-VL-32B-Instruct, Qwen3-VL-32B-Thinking, and Qwen3-VL-235B-A22B variants (Instruct, Thinking, FP8) on Gaudi 3 (#958)
Register the Qwen3VLMoeForConditionalGeneration model for Qwen3-VL MoE variants (#958)
Add IBM Granite 4.0-h small (hybrid SSM-Transformer) implementation for HPU (#897)

Performance

Add FlashAttention online merge in Unified Attention for faster prefill (#785)
Add back-to-back (b2b) matmul for improved MLA attention performance (#770)
Support loading q_scale and using fp8_fused_sdpa for MLA prefill (#909)
Remove bucket densification for long context; apply edge buckets only for long context scenarios (#980)
Implement bucket corrector for Mamba chunk size (#886)
Revert "skip HPU graphs for long prefills" to restore graph capture on long sequences (#850)
Port initialization profiling noop to reduce startup overhead (#979)

Attention & KV Cache

Add support for chunked attention on HPU (#821)
Add KV cache sharing for HPU (#834)
Enable support for prefill-side kv_layout and block_size update for heterogeneous runs (#867)
Add new VLLM_HPU_HETERO_KV_LAYOUT environment variable to control heterogeneous KV layout (#867)
Add heterogeneous HPU NIXL connector for disaggregated prefill/decode (#867)
Add hpu_attention ops module with attention operation implementations (#785)
Monkey-patch Attention.forward for HPU-specific behavior (#973)
Platform: declare support_hybrid_kv_cache capability (#834)

Quantization

Add support for ModelOpt FP8 quantization format for dense models (#890)
Add modelopt to platform supported quantization list (#890)
Add dynamic quantization configuration file example (#838)

Plugin Core

Register new ops: hpu_attention, hpu_grouped_topk_router, hpu_mamba_mixer2, and hpu_modelopt (#785, #897, #890)
Add ops_selector module for HPU operation routing (#897)
Add pytorch_implementation module with pure-PyTorch fallback ops (#897)
Add causal_conv1d_pytorch and ssd_combined ops for SSM/Mamba support (#897)
Add hpu_grouped_topk_router for MoE grouped top-k routing (#897)
Source use_qk_norm parameter directly from config (#1035)

Serving & Infrastructure

Add GitHub Actions action.yaml for PR detail workflows (#1030)
Add CI calibration smoke tests script (#853)
Rename and consolidate CI e2e discoverable tests (#840)
Fix Jenkins CI for Mistral model tests (#840)
Restore temperature=0 as server default after vLLM #32723 (#1038)
Backport RHEL/UBI Dockerfile improvements (#1049)

Fixes

Fix Llama 4 apply-patches flow, QK flatten positional encoding, and address performance drop (#942)
Fix Llama 4 shape mismatch for 32k+ context window (#842, #855)
Fix Qwen2.5-VL accuracy regression (#831)
Fix Qwen3-VL multimodal model embedding issues (#958)
Fix DeepSeek tensor device mismatch (#1029)
Force CPU loading for INC quantization to prevent OOM during weight loading (#1005)
Fix INC patching _gate twice (#955, #1020)
Fix HPU model runner profile_run to work with dynamic kv-cache scales (#852)
Fix measurement config file generation in calibrate_model.sh scripts (#853)
Revert padding value change for block_list and slot list (#1007)
Fix multimodal budget divergence from upstream vLLM (#837)
Fix hourly KeyError: <PlatformEnum.OOT: 6> error (#968)
Fix torch.compile in data-parallel mode (#722)
Correct sliding window enabling logic (#805)
Interleaved sliding window fix (#805)
Fix Mamba cumsum padded calculations (#1022)
Fix redundant transpose in HPUMambaMixer2 (#999, #1014)
Fix Qwen3-VL MoE execution failure (#992)
Fix last_chunk_indices calculations (#1024)

Security

CVE-2025-69872 (diskcache 5.6.3): vLLM currently depends on diskcache version 5.6.3, which has been reported as affected by CVE-2025-69872. The vulnerability remains unresolved upstream as of the day of this release. According to initial analysis, the vLLM architecture does not expose the vulnerable code path, meaning vLLM is not impacted in practice, despite the dependency being formally flagged.

Deprecation & Breaking Changes

Remove tests/models/utils.py to clean up unused test utilities (#864)
VLLM_HPU_HETERO_KV_LAYOUT environment variable is now required for heterogeneous (disaggregated) prefill/decode with NIXL (#867)
Remove bucket densification for long context workloads; only edge buckets are applied (#980)

Full Changelog

PR	Title	Author
#805	Interleaved sliding window fix	@rsmyrek
#722	DP: Fix for torch.compile	@xuechendi
#770	Add b2b matmul	@linoybu
#785	Add FlashAttention online merge in Unified Attention	@kzawora-intel
#805	Correct sliding window enabling	@jbyczkow
#821	Add support for chunked attention	@kfojcik-intel
#831	Resolve qwen25 vl accuracy regression	@tvoas
#834	KV cache sharing for HPU	@jakub-sochacki
#837	Fix diverge from vllm in multiModalBudget	@linoybu
#838	Add dynamic quantization configuration file example	@dudilester
#840	Jenkins CI fix for Mistral	@iboiko-habana
#850	Revert "skip HPU graphs for long prefills"	@adobrzyn
#851	Fix for vLLM #32077	@iboiko-habana
#852	Fix HPU model runner profile_run to work with dynamic kv-cache scales	@dudilester
#853	Fix measurement config file generation in calibrate_model.sh	@nirda7
#864	Remove unused test utils	@microslaw
#867	Enable support for prefill side kv_layout and block_size update	@yeonsily
#876	Refactor for vLLM #30623 and small fix for #32238	@iboiko-habana
#886	Implement bucket corrector for Mamba chunk size	@jbyczkow
#890	Support for modelopt FP8 quantization format for dense models	@skavulya
#897	HPU Granite 4.0-h small implementation	@jbyczkow
#905	CODEOWNERS update	@kzawora-intel
#909	Support loading q_scale and using fp8_fused_sdpa for MLA prefill	@lkk12014402
#917	Fix for hourly KeyError: PlatformEnum.OOT	@tzielinski-habana
#920	Update compatibility matrix and refine installation instructions	@PatrykWo
#942	Llama4 apply patches + QK flatten pos + perf drop fix	@Luca-Calabria
#943	Update Dockerfiles and documentation for v0.15.1 release	@PatrykWo
#958	Qwen3_VL - multimodal model embedding fixes	@slokesha
#968	Fix for hourly KeyError: PlatformEnum.OOT: 6	@tzielinski-habana
#973	Monkey-patch Attention.forward	@tzielinski-habana
#979	Port: Initialization profiling noop	@adobrzyn
#980	Remove bucket densification for long ctx; Edge buckets only	@kfojcik-intel
#1003	Remove duplicate path	@adobrzyn
#1005	Force CPU loading for INC quantization to prevent OOM	@kamil-kaczor
#1007	Revert padding value change for block_list and slot list	@kamil-kaczor
#1020	Fix INC patching _gate twice	@kamil-kaczor
#1029	Fix tensor device mismatch in deepseek	@kamil-kaczor
#1030	Adding action.yaml	@iboiko-habana
#992	Fix qwen3 vl moe execution failure	@shepark
#1014	Fixing redundant transpose in HPUMambaMixer2	@ksmusz
#1022	Fix mamba cumsum padded calculations	@jkaniecki
#1024	last_chunk_indices calculations fix	@jbyczkow
#1035	use_qk_norm parameter sourced directly from config	@rsmyrek
#1038	Back temperature=0 for server as default	@iboiko-habana
#1049	Backport RHEL/UBI Dockerfile improvements	@PatrykWo

New Contributors

Welcome to the following first-time contributors to vLLM Gaudi Plugin! 🎉

@linoybu — b2b matmul and multimodal budget fix (#770, #837)
@microslaw — Test utilities cleanup (#864)
@nirda7 — Calibration script fixes (#853)
@tzielinski-habana — Platform stability fixes and Attention.forward monkey-patch (#917, #968, #973)
@yeonsily — Heterogeneous KV layout support (#867)
@jkaniecki — Mamba cumsum padded calculations fix (#1022)
@shepark — Qwen3-VL MoE execution fix (#992)

Contributors

xuechendi, skavulya, and 22 other contributors

Assets 2

06 Feb 16:27

mgawarkiewicz-intel

v0.14.1

c4ecd71

vLLM-Gaudi for vLLM-v0.14.1

Highlights

This version is based on vLLM 0.14.1 and supports Intel® Gaudi® v1.23.0.

The release enables support for Qwen3-VL, and initial support for Granite 4.0-h.

What's Changed

Update action to change CODEOWNERS for new release branch by @PatrykWo in #745
Apply hw aligned scale by @lkk12014402 in #734
[Attention Metadata Overhaul 1/N] Extract metadata update to HPUAttentionMetadataProcessor by @kzawora-intel in #526
[FIX_FOR_VLLM_LATEST] Quick fix for PR30684 by @iboiko-habana in #742
Change neural version by @adobrzyn in #754
Fix for PR30684 by @iboiko-habana in #757
[GAUDISW-243560] Monkey-patching _get_attn_scale for the Llama4Attention layer by @rsmyrek in #758
[FIX_FOR_VLLM_LATEST] tokenizer fix for #31285 by @iboiko-habana in #764
Fix async_scheduling + batched prefill by @tianmu-li in #740
Documentation: Fix missing back navigation arrow on mobile devices by @mhelf-intel in #766
[FIX_FOR_VLLM_LATEST] Fix structured_output after use_async_scheduling default usage in #27614 by @iboiko-habana in #768
[GAUDISW-244336] Add missing long ctx prompt buckets by @kfojcik-intel in #739
Fix repetition penalty crash in decode phase by @pawel-olejniczak in #769
Update lmcache examples by @hsubramony in #748
[Bugfix] Handle spec decode optionals in unified batch by @kzawora-intel in #782
Load KV scales for FP8 MLA by @yiliu30 in #763
[FIX_FOR_VLLM_LATEST] Fix block_size used in eagle by @pawel-olejniczak in #773
WA shared bias in UA by @adobrzyn in #727
skip HPU graphs for long prefills by @yangulei in #780
create HPUConv3D class, which replaces unfold with view. by @skaulintel in #786
Fix the docker image path by @mhelf-intel in #691
Fix for Llama4 static quantization by @vidyasiv in #707
Exponential max number in range not over bmax by @adobrzyn in #795
Fix Mixtral 8x22B benchmark error, Add EXTRA_BENCH_ARGS by @nngokhale in #796
Add ucx test by @pi314ever in #711
Unified Attention - multi-step low-level profiling by @kzawora-intel in #791
Prefill batching logic to handle chunked prefill/prefix caching for HPU by @hlin99 in #753
Update Dockerfiles and workflows for v1.23.0 release, including PyTor… by @PatrykWo in #802
modify conv3d permute by @skaulintel in #794
Add MoeMatmul to dynamic op support list by @yiliu30 in #817
[FIX_FOR_VLLM_LATEST] fixes for #31747, #30519, #32003, #31916 and test cases disablement for #31998 and #32254 by @iboiko-habana in #797
No num seqs over max in fallback buckets by @adobrzyn in #816
Implement profile_run method in HPU model runner by @xwu-intel in #775
Upgrade transformers>= 4.56.0, <5 by @iboiko-habana in #767
Update CODEOWNERS by @PatrykWo in #808
Use actual block count for bucketing in contiguous PA mode by @pawel-olejniczak in #792
disable async scheduler when spec decode is on for hpu_model_runner by @iboiko-habana in #825
fix ubi docker: use --nobest flag to resolve boost dependency conflic… by @PatrykWo in #810
Doc updates cherry-picked from 0.13.0 by @mhelf-intel in #799
Fix dummy_mm_item TypeError when warmup MM model by @jinyouzhi in #822
Resolve crash when using caching with mm models by @tvoas in #823
Fix INC patch for new version by @yiliu30 in #829
Enable HPU Fused SDPA for Qwen3-VL vision attention using attention masks by @slokesha in #787
Exponential max decode blocks fix for non-contiguous pa scenario by @adobrzyn in #818
Add conditional runner selection based on PR title for discover_runne… by @PatrykWo in #841
fix empty buckets issue for enforce eager mode by @yangulei in #761
fix calibration for fp8 MoE models by @yangulei in #832
Update configurations for Bielik-4.5B model integration by @PatrykWo in #804
[GAUDISW-244752] add dynamic scale for V-Cache on Hiddden dim by @dudilester in #749
Added Qwen3 Test by @slokesha in #736
Port: Resolve qwen25 vl accuracy regression #831 by @adobrzyn in #869
port remove gather and scatter to v0.14.0 release by @skaulintel in #858
Correct sliding window enabling by @jbyczkow in #854
Implement bucket corrector for Mamba chunk size - v0.14.1 by @jbyczkow in #885
Revert "skip HPU graphs for long prefills" (#850) by @adobrzyn in #888
Cherry-picks to enable Llama4 Maverick by @rsmyrek in #882
cherry-pick chunked attention from #821 + 32k+ context window fix from #855 by @Luca-Calabria in #881
Fix a shape mismatch in mrope position slicing by @shepark in #894
Hpu granite 4.0-h small implementation by @jbyczkow in #883
Fix MultiModalBudget error by @adobrzyn in #892
Qwen3vl accuracy fixes by @libinta in #884
Granite 4.0-h small - cleanup by @jbyczkow in #900
Fix Mamba Metadata padding by @jbyczkow in #901
Update qwen2_5_vl attention forward by @shepark in #908
Fix warmup for granite40 by @michalkuligowski in #899
Fix for Llama4 Maverick performance drop by @jkaniecki in #904
Fix for coverity by @adobrzyn in #910
Enable VLLM_USE_NAIVE_MAMBA_CACHE_SHARING by default by @jbyczkow in #922
Initializatrion profiling noop by @michalkuligowski in #916
Remove bucket densification for long ctx; Edge buckets only for long ctx by @kfojcik-intel in #918
Update Dockerfiles and documentation for v0.14.1 release by @PatrykWo in #919

New Contributors

@skaulintel made their first contribution in #786
@vidyasiv made their first contribution in #707
@pi314ever made their first contribution in #711
@hlin99 made their first contribution in #753
@jinyouzhi made their first contribution in #822
@shepark made their first contribution in #894

Full Changelog: v0.13.0.post1...v0.14.1

Contributors

jinyouzhi, pi314ever, and 27 other contributors

Assets 2

23 Jan 13:32

mgawarkiewicz-intel

v0.13.0.post1

906abe3

vllm-Gaudi v0.13.0.post1

This version is a hotfix release on top of vLLM-Gaudi for vLLM-v0.13.0

What's Changed

Port of Add MoeMatmul to dynamic op support list #817 by @iboiko-habana in #819
Add support for chunked attention (#597) by @kfojcik-intel in #809
Port of #829: Fix INC patch for new version by @iboiko-habana in #839
Fix Llama4 shape mismatch for 32k+ context window by @afierka-intel in #842

Full Changelog: v0.13.0...v0.13.0.post1

Contributors

kfojcik-intel, afierka-intel, and iboiko-habana

Assets 2

09 Jan 13:13

PatrykWo

v0.13.0

de3d735

vLLM-Gaudi for vLLM-v0.13.0

Highlights

This version is based on vLLM 0.13.0 and supports Intel® Gaudi® v1.23.0.

The release includes experimental dynamic quantization for MatMul and KV‑cache operations. This feature improves performance, with minimal expected impact on accuracy. To enable the feature, see the Dynamic Quantization for MatMul and KV‑cache Operations section.

This release also introduces support for the following models supported on Gaudi 3:

Additionally, the following models were successfully validated:

For the list of all supported models, see Validated Models.

Known bugs

At long contexts (≥32k), Llama‑4 (MoE; Scout/Maverick) intermittently hits RuntimeError: shape mismatch in attention/KV cache paths at the prefill→decode boundary caused by the : #680 and #684. Fix will be addressed in newer version.

What's Changed

add commit-id to distinguish image and container for each PR by @xuechendi in #85
[Upstream fix] Fix after #23041 from upstream by @adobrzyn in #87
Change warmup scenario for execute dummy scenario by @adobrzyn in #54
remove enable_prompt_adapter in test to fix by @xuechendi in #91
Fix jenkins - remove failed test and fix later / update API by @xuechendi in #79
[Upstream fix] Fix after #23262 from upstream - Make new_block_ids None if empty by @adobrzyn in #93
Enable multimodal support + qwen2.5-vl by @attafosu in #92
Fix upstream PR 22668 that added additional arg to is_kv_cache_dtype_supported by @mswiniarsk in #96
Port defragmentation support from vllm-fork PR #1568 by @madamczyk-intel in #94
[Upstream fix] Fix after #22711 by @adobrzyn in #102
Reduce number of compilations when dynamic shapes is used by @anko-intel in #90
Warmup fix - for non contiguous PA runs, don't take more context blocks than possible by @adobrzyn in #97
[UT] Fix test args for bucketing tests by @adobrzyn in #105
[SW-236088] Add sampler unit tests by @kamil-kaczor in #99
Avoid copying dynamic slice of sampling_metadata tensors by @mswiniarsk in #88
Fix mm encoder inputs for mix-modalities in input batch by @attafosu in #103
Fix decode profiling by @kamil-kaczor in #106
fix upstream PR 23749 by @xuechendi in #108
Fix the failing introduced by upstream 22685 by @xuechendi in #110
fix an argument issue introduced by recent vllm upstream and add CI by @xuechendi in #111
Port G2 scaling convert from vllm-fork #1505 by @xuechendi in #112
Enable Spec Decode for HPU v1 - Part1(basic workflow + eagle) by @xuechendi in #81
fix qwen3-30B-A3B-FP8 - The number of dims cannot be packed into CompleteArgumentSpec:65535 by @xuechendi in #113
[FIX HOURLY Failure] transformer 4.56.0 is not compatible with INC by @xuechendi in #117
Remove test_load_model_weights_inplace by @kzawora-intel in #48
[BUG fix]Fix spec_decode introduced long graph compilation issue by @xuechendi in #127
[Bugfix] Warmup with continuous PA by @adobrzyn in #126
Disable warmup for defragmentator by @mswiniarsk in #132
Merging vllm docker implementation to vllm-gaudi (v1) by @PatrykWo in #125
Enable embedding feature by @slokesha in #120
Revert "Enable embedding feature" by @adobrzyn in #140
[Bugfix] Remove reqs without logits - merge prefill case by @adobrzyn in #137
Update CODEOWNERS by @mgawarkiewicz-intel in #144
Fix warmup break when max decode bucket bs > max num seq by @taran2210 in #107
Add tests for custom op registration by @Kacper-Pietkun in #109
Enable embedding feature by @slokesha in #141
Update CODEOWNERS file by @vivekgoe in #143
[Merged Prefill] Warmup for merged prefill by @adobrzyn in #104
Experimental support for Unified Attention by @madamczyk-intel in #133
Introducing sampler warmup as separate warmup step by @ksmusz in #131
Add support for LoRA by @vivekgoe in #51
Add data parallel support by @wuxun-zhang in #80
Increase allowed line length to 120 + reformat accordingly by @kzawora-intel in #130
[FIX HOURLY]Remove DP test from Hourly by @xuechendi in #147
Update CODEOWNERS by @afierka-intel in #135
Enable sampler compilation by @Kacper-Pietkun in #95
Add DP into CI by @wuxun-zhang in #146
Add TESTOWNERS by @kzawora-intel in #153
Patch FusedMoE forward to avoid dynamo recompilations by @kdamaszk in #158
[CI] Jenkins false positive bugfix by @kzawora-intel in #159
Fix dummy decode input for DP by @wuxun-zhang in #151
[Quick fix for CI]fix CI break on Qwen2.5-vl and update docker image by @xuechendi in #161
initial port for nixl by @hsubramony in #100
update nixl version in requirements by @hsubramony in #163
Re-quantize FP8 model with INC by @yiliu30 in #114
[Feature][SpecDecode][Part2] Eagle3,MTP enabling, accept_rate improvement by @xuechendi in #142
[BUGFIX] qwen2.5-vl failed after PR24444, provide a temp solution by @xuechendi in #162
Reenabling llama4 models by @afierka-intel in #128
Allow building vllm-plugin docker with upstream torch by @mmuszynskihabana in #155
[HOURLY FIX] For upstream PR-24548 changes by @xuechendi in #166
[BUGFIX] warmup failed after PR104, propose fix in this PR by @xuechendi in #148
TESTOWNERS update by @adobrzyn in http...

Contributors

jerrychenhf, yafshar, and 55 other contributors

Assets 2

03 Dec 10:52

PatrykWo

v0.11.2

f9b6446

vLLM-Gaudi for vLLM-v0.11.2

Highlights

This version is based on vLLM 0.11.2 and supports Intel® Gaudi® v1.22.2 and Intel® Gaudi® v1.23.0.

This release introduces the production-ready vLLM Hardware Plugin for Intel® Gaudi®, a community-driven integration layer based on the vLLM v1 architecture. It enables efficient, high-performance large language model (LLM) inference on Intel® Gaudi® AI accelerators. The plugin is an alternative to the vLLM fork, which reaches end of life with this release and will be deprecated in v1.24.0, remaining functional only for legacy use cases. We strongly encourage all fork users to begin planning their migration to the plugin.

The plugin provides feature parity with the fork, including mature, production-ready implementations of Automatic Prefix Caching (APC) and async scheduler. Two legacy features - multi-step scheduling and delayed sampling - have been discontinued, as their functionality is now covered by the async scheduler.

For more details on the plugin's implementation, see Plugin System.

To start using the plugin, follow the Basic Quick Start Guide and explore the rest of this documentation.

What's Changed

add commit-id to distinguish image and container for each PR by @xuechendi in #85
[Upstream fix] Fix after #23041 from upstream by @adobrzyn in #87
Change warmup scenario for execute dummy scenario by @adobrzyn in #54
remove enable_prompt_adapter in test to fix by @xuechendi in #91
Fix jenkins - remove failed test and fix later / update API by @xuechendi in #79
[Upstream fix] Fix after #23262 from upstream - Make new_block_ids None if empty by @adobrzyn in #93
Enable multimodal support + qwen2.5-vl by @attafosu in #92
Fix upstream PR 22668 that added additional arg to is_kv_cache_dtype_supported by @mswiniarsk in #96
Port defragmentation support from vllm-fork PR #1568 by @madamczyk-intel in #94
[Upstream fix] Fix after #22711 by @adobrzyn in #102
Reduce number of compilations when dynamic shapes is used by @anko-intel in #90
Warmup fix - for non contiguous PA runs, don't take more context blocks than possible by @adobrzyn in #97
[UT] Fix test args for bucketing tests by @adobrzyn in #105
[SW-236088] Add sampler unit tests by @kamil-kaczor in #99
Avoid copying dynamic slice of sampling_metadata tensors by @mswiniarsk in #88
Fix mm encoder inputs for mix-modalities in input batch by @attafosu in #103
Fix decode profiling by @kamil-kaczor in #106
fix upstream PR 23749 by @xuechendi in #108
Fix the failing introduced by upstream 22685 by @xuechendi in #110
fix an argument issue introduced by recent vllm upstream and add CI by @xuechendi in #111
Port G2 scaling convert from vllm-fork #1505 by @xuechendi in #112
Enable Spec Decode for HPU v1 - Part1(basic workflow + eagle) by @xuechendi in #81
fix qwen3-30B-A3B-FP8 - The number of dims cannot be packed into CompleteArgumentSpec:65535 by @xuechendi in #113
[FIX HOURLY Failure] transformer 4.56.0 is not compatible with INC by @xuechendi in #117
Remove test_load_model_weights_inplace by @kzawora-intel in #48
[BUG fix]Fix spec_decode introduced long graph compilation issue by @xuechendi in #127
[Bugfix] Warmup with continuous PA by @adobrzyn in #126
Disable warmup for defragmentator by @mswiniarsk in #132
Merging vllm docker implementation to vllm-gaudi (v1) by @PatrykWo in #125
Enable embedding feature by @slokesha in #120
Revert "Enable embedding feature" by @adobrzyn in #140
[Bugfix] Remove reqs without logits - merge prefill case by @adobrzyn in #137
Update CODEOWNERS by @mgawarkiewicz-intel in #144
Fix warmup break when max decode bucket bs > max num seq by @taran2210 in #107
Add tests for custom op registration by @Kacper-Pietkun in #109
Enable embedding feature by @slokesha in #141
Update CODEOWNERS file by @vivekgoe in #143
[Merged Prefill] Warmup for merged prefill by @adobrzyn in #104
Experimental support for Unified Attention by @madamczyk-intel in #133
Introducing sampler warmup as separate warmup step by @ksmusz in #131
Add support for LoRA by @vivekgoe in #51
Add data parallel support by @wuxun-zhang in #80
Increase allowed line length to 120 + reformat accordingly by @kzawora-intel in #130
[FIX HOURLY]Remove DP test from Hourly by @xuechendi in #147
Update CODEOWNERS by @afierka-intel in #135
Enable sampler compilation by @Kacper-Pietkun in #95
Add DP into CI by @wuxun-zhang in #146
Add TESTOWNERS by @kzawora-intel in #153
Patch FusedMoE forward to avoid dynamo recompilations by @kdamaszk in #158
[CI] Jenkins false positive bugfix by @kzawora-intel in #159
Fix dummy decode input for DP by @wuxun-zhang in #151
[Quick fix for CI]fix CI break on Qwen2.5-vl and update docker image by @xuechendi in #161
initial port for nixl by @hsubramony in #100
update nixl version in requirements by @hsubramony in #163
Re-quantize FP8 model with INC by @yiliu30 in #114
[Feature][SpecDecode][Part2] Eagle3,MTP enabling, accept_rate improvement by @xuechendi in #142
[BUGFIX] qwen2.5-vl failed after PR24444, provide a temp solution by @xuechendi in #162
Reenabling llama4 models by @afierka-intel in #128
Allow building vllm-plugin docker with upstream torch by @mmuszynskihabana in #155
[HOURLY FIX] For upstream PR-24548 changes by @xuechendi in #166
[BUGFIX] warmup failed after PR104, propose fix in this PR by @xuechendi in #148
TESTOWNERS update by @adobrzyn in #165
[TEMP-WA] Skip Qwen3-30B-A3B in tests - Bug introduced in upstream #24772 by @attafosu in #168
[CI FIX]Fix issue introduced by upstream PR #23974 by @xuechendi in #172
[CI FIX] Fix issue introduced by upstream #24745 by @xuechendi in #174
[BUG][Disable CI] Disable DP test due recent upstream change failed HPU DP by @xuechendi in #177
Fully overlap model execution by @tianmu-li in #134
Added fix for VLLM_WEIGHT_LOAD_FORCE_SYNC by @tianmu-li in #173
Introduce VLLM_SCALE_ADJUSTMENT by @xinyu-intel in #164
Support Ray distributed executor by @xinyu-intel in #169
Bug f...

Contributors

yafshar, xuechendi, and 45 other contributors

Assets 2

18 Aug 14:24

xuechendi

v0.10.1

ab65f9b

vLLM-Gaudi for vLLM-v0.10.1

Add t.compile config (#62)

Signed-off-by: Kacper Pietkun <kpietkun@habana.ai>

Assets 2

01 Aug 18:19

xuechendi

v0.10.0

3bcafaf

vLLM-Gaudi for vLLM-v0.10.0

update README to use v0.10.0 vllm

Signed-off-by: Chendi.Xue <chendi.xue@intel.com>

Assets 2

Releases: vllm-project/vllm-gaudi

vLLM-Gaudi for vLLM-v0.16.0

vLLM Gaudi Plugin v0.16.0 Release Notes

Overview

Highlights

New Model Support and Updates

Performance

Attention and KV Cache

Quantization

Plugin Core

Serving and Infrastructure

Fixes

Full Changelog

Contributors

Uh oh!

vLLM-Gaudi for vLLM-v0.15.1

vLLM Gaudi Plugin v0.15.1 Release Notes

Overview

Highlights

New Model Support

Performance

Attention & KV Cache

Quantization

Plugin Core

Serving & Infrastructure

Fixes

Security

Deprecation & Breaking Changes

Full Changelog

New Contributors

Contributors

Uh oh!

vLLM-Gaudi for vLLM-v0.14.1

Highlights

What's Changed

New Contributors

Contributors

Uh oh!

vllm-Gaudi v0.13.0.post1

What's Changed

Contributors

Uh oh!

vLLM-Gaudi for vLLM-v0.13.0

Highlights

Known bugs

What's Changed

Contributors

Uh oh!

vLLM-Gaudi for vLLM-v0.11.2

Highlights

What's Changed

Contributors

Uh oh!

vLLM-Gaudi for vLLM-v0.10.1

Uh oh!

vLLM-Gaudi for vLLM-v0.10.0

Uh oh!