[WIP] Add FP32 softmax support in unified attention by afierka-intel · Pull Request #1040 · vllm-project/vllm-gaudi

afierka-intel · 2026-02-25T11:38:02Z

No description provided.

Signed-off-by: Artur Fierka <artur.fierka@intel.com>

github-actions · 2026-02-25T11:38:18Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

github-actions · 2026-02-25T11:38:31Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

This is a Draft PR. Please mark it as 'Ready for Review' to trigger the CI.

Copilot

Pull request overview

Adds initial FP32-softmax enablement for the HPU unified attention path by promoting the QK logits computation to float32 under a feature flag and relaxing an existing backend restriction.

Changes:

Add fp32_softmax handling to unified attention partial paths (causal/shared/unique), including optional out= buffers for QK matmul.
Insert graph breaks in fp32 paths to control compilation boundaries.
Remove the fp32 softmax “unsupported feature” gate for HPUUnifiedAttentionImpl.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File	Description
`vllm_gaudi/extension/unified.py`	Adds FP32-softmax branches for QK logits computation across partial attention routines (causal/shared/unique).
`vllm_gaudi/attention/backends/hpu_attn.py`	Removes the `fp32 softmax` unsupported-feature check for the unified attention backend implementation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-25T11:44:13Z

vllm_gaudi/extension/unified.py

+        torch._dynamo.graph_break()
+    else:
+        attn = torch.matmul(query, key.transpose(-1, -2))
    attn = attn.flatten(0, 1)


When fp32_softmax is enabled and the matmul output is float32, bias is still passed in the original dtype and then added to attn. Aligning bias to attn.dtype before the add avoids mixed-dtype adds and matches the established handling in extension/ops.py (which casts block/position bias when attn.dtype != bias.dtype).

Suggested change

attn = attn.flatten(0, 1)

attn = attn.flatten(0, 1)

if attn.dtype != bias.dtype:

bias = bias.to(attn.dtype)

Copilot · 2026-02-25T11:44:14Z

vllm_gaudi/extension/unified.py

+    # TODO: add downcasting attn to original dtype
    attn = torch.matmul(attn.unflatten(0, (kv_heads if not is_mla else num_heads, -1)), value).flatten(0, 1)



Same as the causal path: with fp32_softmax enabled, the attention weights/value product is currently performed in float32 (TODO mentions missing downcast). This will tend to make the merged output float32 as well. Consider downcasting the exp weights and/or the attn output back to the original dtype (typically value.dtype / model dtype) after computing local_sum/local_max in fp32.

Copilot · 2026-02-25T11:44:14Z

vllm_gaudi/extension/unified.py

+        torch._dynamo.graph_break()
+    else:
+        attn = torch.matmul(query, key.transpose(-1, -2))
    attn = attn + bias.unsqueeze(1).unsqueeze(1).unsqueeze(1)


If fp32_softmax is enabled, attn becomes float32 but bias is still in the original dtype when added here. To avoid mixed-dtype adds (and match the pattern used elsewhere in the codebase), cast this bias term to attn.dtype before adding.

Suggested change

attn = attn + bias.unsqueeze(1).unsqueeze(1).unsqueeze(1)

attn = attn + bias.to(attn.dtype).unsqueeze(1).unsqueeze(1).unsqueeze(1)

Copilot · 2026-02-25T11:44:14Z

vllm_gaudi/extension/unified.py

    attn = torch.exp(attn - block_max.unsqueeze(-1))
+    # TODO: (afierka) add downcasting attn to original dtype
    block_sum = attn.sum(-1)
    attn = torch.matmul(attn, value)



The fp32_softmax path currently leaves the exp(attn) weights and the subsequent attn @ value in float32 (TODO). Besides the output dtype change, this also interacts with block2batch() later since block_mapping_2d is built in query.dtype; mixed-dtype matmul can be problematic on HPU. Consider downcasting back to the original dtype before the matmul with value / block2batch, and/or ensure block_mapping_2d is created in the same dtype as the tensor it multiplies.

Copilot · 2026-02-25T11:44:14Z

vllm_gaudi/attention/backends/hpu_attn.py

@@ -929,7 +929,6 @@ def __init__(
            'Sliding window': sliding_window is not None,
            'non-GQA attention': num_kv_heads is None,
            'Encoder attn': attn_type != AttentionType.DECODER,


This PR removes the fp32 softmax unsupported-feature gate for HPUUnifiedAttentionImpl, but HPUUnifiedMLAImpl still rejects get_config().fp32_softmax later in this same file. If the intention is to support fp32 softmax across unified attention (including MLA), that remaining gate will still raise NotImplementedError for MLA configurations; consider removing or appropriately gating it as well.

Suggested change

'Encoder attn': attn_type != AttentionType.DECODER,

'Encoder attn': attn_type != AttentionType.DECODER,

'fp32 softmax': get_config().fp32_softmax,

Copilot · 2026-02-25T11:44:15Z

vllm_gaudi/extension/unified.py

+            if get_config().fp32_softmax:
+                s_attn = torch.empty(hpu_ops.matmul_shape(q, k.transpose(-1, -2)), dtype=torch.float32, device=q.device)
+            s_attn = torch.matmul(q, k.transpose(-1, -2), out=s_attn)
+            s_attn = s_attn + b.unsqueeze(0).unsqueeze(0)


In the use_output_tensor_in_matmulqk path, when fp32_softmax is enabled s_attn becomes float32, but the bias slice b is still in the original dtype. On HPU we already explicitly cast biases in the analogous path in extension/ops.py; doing the same here avoids mixed-dtype adds (and potential kernel/type issues) and makes the intent explicit. Consider casting b.unsqueeze(0).unsqueeze(0) to s_attn.dtype (or float32 when fp32_softmax) before adding.

Suggested change

s_attn = s_attn + b.unsqueeze(0).unsqueeze(0)

bias_term = b.unsqueeze(0).unsqueeze(0).to(s_attn.dtype)

s_attn = s_attn + bias_term

Copilot · 2026-02-25T11:44:15Z

vllm_gaudi/extension/unified.py


+        # TODO: add downcasting attn to original dtype


With fp32_softmax enabled, the code leaves the exp(attn) weights / unnormalized weighted-V in float32 (see TODO). This changes the dtype of the partial outputs and will propagate to unified_attn/unified_mla outputs (e.g., division by a float32 sum yields float32), which can break downstream layers that expect the model dtype and also increases memory/compute. Add an explicit downcast back to the original dtype at an appropriate point (commonly: keep max/sum in fp32 for stability, but cast the exp weights and/or final attention output back to query/value.dtype).

Suggested change

# TODO: add downcasting attn to original dtype

# Keep max/sum in fp32 for stability when fp32_softmax is enabled,

# but cast the exponentiated attention weights back to the original

# value dtype before the attention matmul to avoid dtype propagation.

if get_config().fp32_softmax and s_attn.dtype == torch.float32:

s_attn = s_attn.to(v.dtype)

[WIP] Add FP32 softmax support in unified attention

793d5d5

Signed-off-by: Artur Fierka <artur.fierka@intel.com>

Copilot AI review requested due to automatic review settings February 25, 2026 11:38

Merge branch 'main' into dev/afierka/fp32-unified

af20d8d

Copilot started reviewing on behalf of afierka-intel February 25, 2026 11:38 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add FP32 softmax support in unified attention#1040

[WIP] Add FP32 softmax support in unified attention#1040
afierka-intel wants to merge 2 commits intomainfrom
dev/afierka/fp32-unified

afierka-intel commented Feb 25, 2026

Uh oh!

github-actions bot commented Feb 25, 2026

Uh oh!

github-actions bot commented Feb 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Copilot AI Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		# TODO: add downcasting attn to original dtype
		attn = torch.matmul(attn.unflatten(0, (kv_heads if not is_mla else num_heads, -1)), value).flatten(0, 1)

	attn = attn + bias.unsqueeze(1).unsqueeze(1).unsqueeze(1)
	attn = attn + bias.to(attn.dtype).unsqueeze(1).unsqueeze(1).unsqueeze(1)

	'Encoder attn': attn_type != AttentionType.DECODER,
	'Encoder attn': attn_type != AttentionType.DECODER,
	'fp32 softmax': get_config().fp32_softmax,

	s_attn = s_attn + b.unsqueeze(0).unsqueeze(0)
	bias_term = b.unsqueeze(0).unsqueeze(0).to(s_attn.dtype)
	s_attn = s_attn + bias_term

-        # TODO: add downcasting attn to original dtype
+            # Keep max/sum in fp32 for stability when fp32_softmax is enabled,
+            # but cast the exponentiated attention weights back to the original
+            # value dtype before the attention matmul to avoid dtype propagation.
+            if get_config().fp32_softmax and s_attn.dtype == torch.float32:
+                s_attn = s_attn.to(v.dtype)

Conversation

afierka-intel commented Feb 25, 2026

Uh oh!

github-actions bot commented Feb 25, 2026

🚧 CI Blocked

Uh oh!

github-actions bot commented Feb 25, 2026

🚧 CI Blocked

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants