Qwen3-Next: Computation of Gated Delta Rule #902

d-kleine · 2025-11-04T14:17:32Z

d-kleine
Nov 4, 2025

I find it hard to understand how the gated delta rule is exactly computed. Specifically, where can I find the exact computation or formula for the alpha parameter in the gated delta rule? Is this described in any related paper?

LLMs-from-scratch/ch04/08_deltanet/README.md

Lines 187 to 192 in 488bef7

    
           ### NEW: Compute delta rule gates 
        
           beta = torch.sigmoid(self.W_beta(x)) 
        
           alpha = -self.A_log.exp().view(1, 1, -1) * F.softplus( 
        
               self.W_alpha(x) + self.dt_bias 
        
           ) 
        
           gate = self.W_gate(x)

alpha seems to be called g in the HuggingFace implementation of Qwen3-Next gated delta rule implementation:
https://github.com/huggingface/transformers/blob/dd4e048e75d61512a92faba59d7651aad1ce9519/src/transformers/models/qwen3_next/modular_qwen3_next.py#L594-L596

Additionally, for Qwen3-Next it seems that alpha and beta have activation functions applied – are these different from what is shown in the explanatory image?

Answered by rasbt

Nov 4, 2025

In the LitGPT code, I think they called it gk for "gate for step k" (whereas it is "alpha for step t" in the paper).

But if you consider gk.float().exp() later, I think that corresponds to the paper's $\alpha_t$

In my code I am calling it alpha:

    alpha = -self.A_log.exp().view(1, 1, -1) * F.softplus(
        self.W_alpha(x) + self.dt_bias
    )

But this is more of a pre-alpha. The real alpha comes later in

    S = S * a_t.exp()

Maybe to make this clear, I could rename it as follows?

    alpha_log = -self.A_log.exp().view(1, 1, -1) * F.softplus(self.W_alpha(x) + self.dt_bias)
    alpha = alpha_log.exp()

View full answer

rasbt · 2025-11-04T14:26:01Z

rasbt
Nov 4, 2025
Maintainer

You might like the original Gated DeltaNet code here, (based on the LitGPT library I helped develop a few years ago): https://github.com/NVlabs/GatedDeltaNet/blob/main/lit_gpt/gated_delta_net.py

1 reply

d-kleine Nov 4, 2025
Author

According to the "Improving Mamba2 with Delta Rule" paper, it says that alpha is a value between 0 and 1 that varies with $t$.

I might be wrong, but for me it seems that

-self.A_log.exp().view(1, 1, -1) * F.softplus( 
     self.W_alpha(x) + self.dt_bias 
 )

is the gate (corresponding to g in the HF implementation)

rasbt · 2025-11-04T16:41:09Z

rasbt
Nov 4, 2025
Maintainer

In the LitGPT code, I think they called it gk for "gate for step k" (whereas it is "alpha for step t" in the paper).

But if you consider gk.float().exp() later, I think that corresponds to the paper's $\alpha_t$

In my code I am calling it alpha:

    alpha = -self.A_log.exp().view(1, 1, -1) * F.softplus(
        self.W_alpha(x) + self.dt_bias
    )

But this is more of a pre-alpha. The real alpha comes later in

    S = S * a_t.exp()

Maybe to make this clear, I could rename it as follows?

    alpha_log = -self.A_log.exp().view(1, 1, -1) * F.softplus(self.W_alpha(x) + self.dt_bias)
    alpha = alpha_log.exp()

16 replies

rasbt Nov 6, 2025
Maintainer

Thanks for the awesome feedback and discussion you two. I'll think about how to best update the figure (in the next few days) to make it more informative. (I'll also update the A_log init, good catch @d-kleine )

d-kleine Nov 10, 2025
Author

Once you update the figure, there’s one small thing in the text to fix as well:

LLMs-from-scratch/ch04/08_deltanet/README.md

Line 300 in 488bef7

    
           - β (`alpha`) regulates how much the current token at time step *t* updates the memory.

This should be beta.

d-kleine Nov 11, 2025
Author

I might have found another interesting detail:

The MLA in Kimi Linear seems to be based on based on DeepSeek-V3, which also uses a latent representation of query, called $c_t$.

Also, as a side note, Kimi Linear uses for the last transformer blocks a 2:1 ratio, which I found out through the config:
https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct/blob/main/config.json

rasbt Nov 11, 2025
Maintainer

@d-kleine

The MLA in Kimi Linear seems to be based on based on DeepSeek-V3, which also uses a latent representation of query, called

Do you specifically mean the "which also uses a latent representation of query" aspect? I had mentioned that in my "The Big Architecture Comparison Article":

(As a side note, the queries are also compressed, but only during training, not inference.)

I think it was only during training and not during inference, but I have to go back and double check the code.

Also, as a side note, Kimi Linear uses for the last transformer blocks a 2:1 ratio,

Interesting, I haven't checked, but off the top of my head, you think it's because it otherwise doesn't work with their number of blocks (i.e., it not being divisible by 4?)

d-kleine Nov 11, 2025
Author

I think it was only during training and not during inference, but I have to go back and double check the code.

Yes, you are right. It's explained in Figure 3 in the DeepSeek v2 paper:
https://arxiv.org/abs/2405.04434

-> "(...) Through jointly compressing the keys and values into a latent vector, MLA significantly reduces the KV cache during inference."

Interesting, I haven't checked, but off the top of my head, you think it's because it otherwise doesn't work with their number of blocks (i.e., it not being divisible by 4?)

Yeah, the odd number of transformer blocks was what wondered me, therefore I have checked the config. Just an interesting hidden detail I wanted to share. 🙂

casinca · 2025-11-04T17:56:10Z

casinca
Nov 4, 2025

From my limited experience of having re-implemented Qwen3-Next and my understanding, you are right g in HF implementation is alpha for Sebastian impl and it's gk in the original code, but it's not the gate (if you meant by that the gated SiLU connection)

Edit: beta is indeed activated by a sigmoid which isn't in shown in the Qwen image, alpha can be considered "activated" since it's squashed too but not by a traditional sigmoid

The image you linked from Songlin is just DeltaNet

For the alpha formula it's derived from eq. 4 of this paper: https://arxiv.org/abs/2312.00752.
I was also confused by the goal, in my case I made it as a verbose function so I can remember why I was doing that later on.

It's basically what Sebastian wrote above.

def compute_alpha_factor(log_A, a, dt_bias):
    """
    Calculates the state decay factor alpha following Qwen3-Next/SSM-style formula.

    Alpha is the exponential decay factor applied to the previous state memory in Gated Delta Rule.
    This controls how much of the previous state memory we keep or forget.

    alpha = e^(-A * Δt) (can be seen as e^(-Rate * Time)) where A > 0 and Δt > 0:
    - A is learned as log_A and then exponentiated (e^log_A) to ensure positivity.
    - Δt is passed through a softplus to ensure positivity.
    both positivity ensures that alpha via e^ is always in (0, 1) as a final decay factor.

    Δt is the result of the affine function Wx + dt with "a" as Wx (this makes Δt dynamic per token and thus the decay)
    Δt represents how much duration to apply the decay (time step).

    args:
        log_A: (num_v_heads,) represents the base (log) decay rate per value head (will be a constant per head)
        a: (b, seq_len, num_v_heads) the tokens to num_v_heads projections (will be dynamic per token)
        dt_bias: (num_v_heads,) learnable bias for time step Δt

    returns:
        alpha: (b, seq_len, num_v_heads) final decay factor per token, range (0, 1)
    """
    A = torch.exp(log_A)  # retrieves positive A from the learned logarithm
    delta_t = torch.nn.functional.softplus(a + dt_bias)  # Δt

    alpha = torch.exp(-A * delta_t)  # e^(-Rate * Time)
    return alpha

4 replies

d-kleine Nov 4, 2025
Author

Yeah, the back and forth between log and exp really confused me, plus that the decay factor is referred to as alpha when the code doesn’t have any comment or variable named that.

casinca Nov 4, 2025

I didn't had a look in Kimi Linear but I get it now why you were mentioning activation from Sebastian's picture.
They indeed seem to squash into a (0,1) range via a classic sigmoid, unlike what Qwen is doing, which is more sophisticated inspired by Mamba.
It's true that Alpha is also called "gating term" in regards to GDN, it adds confusion with "gate" from the SiLU gate.

d-kleine Nov 4, 2025
Author

It's true that Alpha is also called "gating term" in regards to GDN, it adds confusion with "gate" from the SiLU gate.

Exactly! 😄

rasbt Nov 5, 2025
Maintainer

Yeah, the back and forth between log and exp really confused me,

I think that was probably just for stability, like log softmax.

plus that the decay factor is referred to as alpha when the code doesn’t have any comment or variable named that.

Yes, it would preferable to use the terminology that is described in the paper, otherwise it's just unnecessarily confusing. Especially since the term "gate" is already used for other things.

Qwen3-Next: Computation of Gated Delta Rule #902

Uh oh!

d-kleine Nov 4, 2025

Replies: 3 comments · 21 replies

Uh oh!

rasbt Nov 4, 2025 Maintainer

Uh oh!

Uh oh!

d-kleine Nov 4, 2025 Author

Uh oh!

Uh oh!

rasbt Nov 4, 2025 Maintainer

Uh oh!

rasbt Nov 6, 2025 Maintainer

Uh oh!

d-kleine Nov 10, 2025 Author

Uh oh!

d-kleine Nov 11, 2025 Author

Uh oh!

rasbt Nov 11, 2025 Maintainer

Uh oh!

d-kleine Nov 11, 2025 Author

Uh oh!

Uh oh!

casinca Nov 4, 2025

Uh oh!

d-kleine Nov 4, 2025 Author

Uh oh!

casinca Nov 4, 2025

Uh oh!

d-kleine Nov 4, 2025 Author

Uh oh!

rasbt Nov 5, 2025 Maintainer

d-kleine
Nov 4, 2025

Replies: 3 comments 21 replies

rasbt
Nov 4, 2025
Maintainer

d-kleine Nov 4, 2025
Author

rasbt
Nov 4, 2025
Maintainer

rasbt Nov 6, 2025
Maintainer

d-kleine Nov 10, 2025
Author

d-kleine Nov 11, 2025
Author

rasbt Nov 11, 2025
Maintainer

d-kleine Nov 11, 2025
Author

casinca
Nov 4, 2025

d-kleine Nov 4, 2025
Author

d-kleine Nov 4, 2025
Author

rasbt Nov 5, 2025
Maintainer