-
|
I find it hard to understand how the gated delta rule is exactly computed. Specifically, where can I find the exact computation or formula for the alpha parameter in the gated delta rule? Is this described in any related paper? LLMs-from-scratch/ch04/08_deltanet/README.md Lines 187 to 192 in 488bef7 alpha seems to be called Additionally, for Qwen3-Next it seems that alpha and beta have activation functions applied – are these different from what is shown in the explanatory image? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 21 replies
-
|
You might like the original Gated DeltaNet code here, (based on the LitGPT library I helped develop a few years ago): https://github.com/NVlabs/GatedDeltaNet/blob/main/lit_gpt/gated_delta_net.py |
Beta Was this translation helpful? Give feedback.
-
|
In the LitGPT code, I think they called it But if you consider In my code I am calling it alpha: alpha = -self.A_log.exp().view(1, 1, -1) * F.softplus(
self.W_alpha(x) + self.dt_bias
)But this is more of a pre-alpha. The real alpha comes later in S = S * a_t.exp()Maybe to make this clear, I could rename it as follows? alpha_log = -self.A_log.exp().view(1, 1, -1) * F.softplus(self.W_alpha(x) + self.dt_bias)
alpha = alpha_log.exp() |
Beta Was this translation helpful? Give feedback.
-
|
From my limited experience of having re-implemented Qwen3-Next and my understanding, you are right Edit: beta is indeed activated by a sigmoid which isn't in shown in the Qwen image, alpha can be considered "activated" since it's squashed too but not by a traditional sigmoid The image you linked from Songlin is just DeltaNet For the alpha formula it's derived from eq. 4 of this paper: https://arxiv.org/abs/2312.00752. It's basically what Sebastian wrote above. def compute_alpha_factor(log_A, a, dt_bias):
"""
Calculates the state decay factor alpha following Qwen3-Next/SSM-style formula.
Alpha is the exponential decay factor applied to the previous state memory in Gated Delta Rule.
This controls how much of the previous state memory we keep or forget.
alpha = e^(-A * Δt) (can be seen as e^(-Rate * Time)) where A > 0 and Δt > 0:
- A is learned as log_A and then exponentiated (e^log_A) to ensure positivity.
- Δt is passed through a softplus to ensure positivity.
both positivity ensures that alpha via e^ is always in (0, 1) as a final decay factor.
Δt is the result of the affine function Wx + dt with "a" as Wx (this makes Δt dynamic per token and thus the decay)
Δt represents how much duration to apply the decay (time step).
args:
log_A: (num_v_heads,) represents the base (log) decay rate per value head (will be a constant per head)
a: (b, seq_len, num_v_heads) the tokens to num_v_heads projections (will be dynamic per token)
dt_bias: (num_v_heads,) learnable bias for time step Δt
returns:
alpha: (b, seq_len, num_v_heads) final decay factor per token, range (0, 1)
"""
A = torch.exp(log_A) # retrieves positive A from the learned logarithm
delta_t = torch.nn.functional.softplus(a + dt_bias) # Δt
alpha = torch.exp(-A * delta_t) # e^(-Rate * Time)
return alpha |
Beta Was this translation helpful? Give feedback.

In the LitGPT code, I think they called it
gkfor "gate for step k" (whereas it is "alpha for step t" in the paper).But if you consider$\alpha_t$
gk.float().exp()later, I think that corresponds to the paper'sIn my code I am calling it alpha:
But this is more of a pre-alpha. The real alpha comes later in
Maybe to make this clear, I could rename it as follows?