Nothing to attend to after combining the causal mask and the padding mask in static batching #808

JiangJiaWei1103 · 2025-09-04T23:18:35Z

JiangJiaWei1103
Sep 4, 2025

Hi all,

I’m implementing the Llama 3 model architecture and ran into a problem with the masking mechanism during the prefill phase.

Since I’m using static batching, all sequences in the batch are padded to the same length. For scaled dot-product attention, I apply both a padding mask and a causal mask.

Consider this example:

Sequence length = 2
Max sequence length in the current batch = 3
Padding mask

[0, 1, 1]

Causal mask

 [[1 0 0]
  [1 1 0]
  [1 1 1]]

Using the attention bias trick, the combined bias looks like this:

 [[-inf -inf -inf]
  [-inf  0   -inf]
  [-inf  0   0]]

Here, the sequence starts with a single pad token in the first position, so there’s nothing valid to attend to at that step.

Then I add the bias:

attn_scores += attn_bias

This masks out invalid positions. The issue is that the first row becomes all -inf, which means after softmax the result is all nan. That propagates forward, making the hidden state at that position nan too. Passing this into the next decoder layer is clearly invalid.

My question: Is it reasonable to replace these nan outputs with zeros, or is there a standard approach/reference for handling this situation?

Thanks a lot.

JiangJiaWei1103 · 2025-09-06T07:23:09Z

JiangJiaWei1103
Sep 6, 2025
Author

(Update, 2025/09/06)

Finally found this interesting thread, but I'm still curious about whether there's any recommended way to handle this problem in modern LLM architectures. Thanks!

0 replies

rasbt · 2025-09-06T15:39:56Z

rasbt
Sep 6, 2025
Maintainer

Hi there,

I haven't had a chance to read through the thread you linked, yet. But let me share a few thoughts.
First, the bias trick is a good one actually!

In addition to that, I recently implemented a batched version with left-padding for Qwen3 here: https://github.com/rasbt/reasoning-from-scratch/blob/main/reasoning_from_scratch/qwen3_batched.py

(I recommend looking at a file diff between qwen3.py and qwen3_batched.py to see the relevant lines more easily)

So, what I had to do here is to implement a more stable version of softmax that uses a large negative value instead of -inf to avoid the 0 issue:

        # More numerically stable attention
        attn_scores = queries @ keys.transpose(2, 3)
        # Use large negative sentinel instead of -inf for stable softmax when a row is fully masked
        attn_scores = attn_scores.masked_fill(mask, -1e9)
        attn_scores = attn_scores / (self.head_dim ** 0.5)
        attn_weights = torch.softmax(attn_scores, dim=-1)
        # Zero out masked positions post-softmax and renormalize to keep sums ~1 where possible
        attn_weights = attn_weights.masked_fill(mask, 0.0)
        denom = attn_weights.sum(dim=-1, keepdim=True).clamp(min=1e-9)
        attn_weights = attn_weights / denom

It seems to work relatively well, but I am not sure if that's the best solution. I'd be happy to hear any suggestions or feedback.

1 reply

JiangJiaWei1103 Sep 10, 2025
Author

Hi @rasbt,

Apologies for the delayed reply, and thank you very much for your clarification and for sharing the detailed references. I really appreciate it. I’ll take the time to study them carefully. In addition, I also tried the right padding implementation from the official Llama 3 here, which works well for me. However, I still need to experiment on efficiency. Thanks again for your guidance!

casinca · 2025-09-06T16:37:15Z

casinca
Sep 6, 2025

Thanks @rasbt that's good to know you are also using a small value, I was myself using torch.finfo().min based on this HF thread (which is the same thing).
I wasn't sure how people with more experience would do it either.

2 replies

rasbt Sep 11, 2025
Maintainer

Good call, I guess I could improve it with best practices. Thanks! (Adding it here: rasbt/reasoning-from-scratch#29)

d-kleine Sep 11, 2025

It was also used in the optimized KV cache version 🙂
https://github.com/rasbt/reasoning-from-scratch/blob/61ea1a43ae605b6a533b36dfaa4530e5da31d30f/reasoning_from_scratch/qwen3_optimized.py#L216

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nothing to attend to after combining the causal mask and the padding mask in static batching #808

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Nothing to attend to after combining the causal mask and the padding mask in static batching #808

Uh oh!

JiangJiaWei1103 Sep 4, 2025

Replies: 3 comments · 3 replies

Uh oh!

JiangJiaWei1103 Sep 6, 2025 Author

Uh oh!

rasbt Sep 6, 2025 Maintainer

Uh oh!

JiangJiaWei1103 Sep 10, 2025 Author

Uh oh!

casinca Sep 6, 2025

Uh oh!

rasbt Sep 11, 2025 Maintainer

Uh oh!

d-kleine Sep 11, 2025

JiangJiaWei1103
Sep 4, 2025

Replies: 3 comments 3 replies

JiangJiaWei1103
Sep 6, 2025
Author

rasbt
Sep 6, 2025
Maintainer

JiangJiaWei1103 Sep 10, 2025
Author

casinca
Sep 6, 2025

rasbt Sep 11, 2025
Maintainer