You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Simplifies bias-gradient handling by deriving accumulation from the bias sequence-length condition, removing the redundant parameter and related plumbing.
Aligns zero-init of bias buffers with provided tensor options (no forced float), preventing mixed-precision dtype mismatches and improving correctness for MQA/GQA bias shapes.
Streamlines the backward API with no intended behavior changes beyond dtype fix.
0 commit comments