Why doesn't the backward of the fused attention kernel account for the normalization constant in the softmax function? #4629

jeffwillette · 2024-09-02T04:56:17Z

jeffwillette
Sep 2, 2024

The fused softmax tutorial (https://triton-lang.org/main/getting-started/tutorials/06-fused-attention.html#sphx-glr-getting-started-tutorials-06-fused-attention-py) shows a backward implementation.

In this implementation, when calculating the derivative with respect to $V$, we would expect to see the attention matrix $A$, as we are calculating $\frac{\partial}{\partial V} AV = A$. However, it looks as if the implementation ignores the normalization constant (sum over the rows) of the attention matrix, and just calculates $A$ as $QK^\top - max(QK^\top, dim=1)$ instead of $\frac{QK^\top - max(QK^\top, dim=1)}{sum(QK^\top - max(QK^\top, dim=1), dim=1)}$

triton/python/tutorials/06-fused-attention.py

Lines 235 to 245 in 0e3cadd

    
           qkT = tl.dot(k, qT) 
        
           pT = tl.math.exp2(qkT - m[None, :]) 
        
           # Autoregressive masking. 
        
           if MASK: 
        
               mask = (offs_m[None, :] >= offs_n[:, None]) 
        
               pT = tl.where(mask, pT, 0.0) 
        
           do = tl.load(do_ptrs) 
        
           # Compute dV. 
        
           ppT = pT 
        
           ppT = ppT.to(tl.float16) 
        
           dv += tl.dot(ppT, do)

Tests pass, and this appears to be equivalent to the eager attention backward. My question is why? Is there some line I am missing which incorporates the normalization constant, or is it just safely ignored because it doesn't change the output that much?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why doesn't the backward of the fused attention kernel account for the normalization constant in the softmax function? #4629

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Why doesn't the backward of the fused attention kernel account for the normalization constant in the softmax function? #4629

Uh oh!

Uh oh!

jeffwillette Sep 2, 2024

Replies: 0 comments

jeffwillette
Sep 2, 2024