Fixes attn-bias order; passes window size

LoserCheems · LoserCheems · commit f5ebc3568eed · 2025-10-27T16:57:29.000+08:00
Corrects parenthesis to apply the matrix scaling before transpose when building the attention bias, aligning with the intended formula and improving numerical stability/broadcasting.

Passes window size into the attention kernel to enable proper windowed masking and behavior.
diff --git a/examples/modeling/modeling_doge.py b/examples/modeling/modeling_doge.py
@@ -218,7 +218,7 @@ def forward(
             value_states.transpose(1, 2).reshape(value_states.shape[0], value_states.shape[-2], -1)
         )
         # original formula is exp(A * softplus(delta V)), but for numerical stability, it is changed to A * softplus(delta V)
-        attn_bias = self.A * F.softplus(dt_states).transpose(-1, -2).unsqueeze(-2).to(hidden_states.dtype)
+        attn_bias = (self.A * F.softplus(dt_states)).transpose(-1, -2).unsqueeze(-2).to(hidden_states.dtype)
 
         attention_interface: Callable = flash_dynamic_mask_attention_forward
 
@@ -230,6 +230,7 @@ def forward(
             attention_mask=attention_mask,
             attention_bias=attn_bias,
             scale=self.scaling,
+            window_size=self.window_size,
         )
 
         attn_output = attn_output.reshape(*input_shape, -1).contiguous()

Original file line number	Diff line number	Diff line change
`@@ -218,7 +218,7 @@ def forward(`
`218`	`218`	`value_states.transpose(1, 2).reshape(value_states.shape[0], value_states.shape[-2], -1)`
`219`	`219`	`)`
`220`	`220`	`# original formula is exp(A * softplus(delta V)), but for numerical stability, it is changed to A * softplus(delta V)`
`221`		`- attn_bias = self.A * F.softplus(dt_states).transpose(-1, -2).unsqueeze(-2).to(hidden_states.dtype)`
	`221`	`+ attn_bias = (self.A * F.softplus(dt_states)).transpose(-1, -2).unsqueeze(-2).to(hidden_states.dtype)`
`222`	`222`
`223`	`223`	`attention_interface: Callable = flash_dynamic_mask_attention_forward`
`224`	`224`
`@@ -230,6 +230,7 @@ def forward(`
`230`	`230`	`attention_mask=attention_mask,`
`231`	`231`	`attention_bias=attn_bias,`
`232`	`232`	`scale=self.scaling,`
	`233`	`+ window_size=self.window_size,`
`233`	`234`	`)`
`234`	`235`
`235`	`236`	`attn_output = attn_output.reshape(*input_shape, -1).contiguous()`