Memodo is an linear attention solution that combining the advantages of both RWKV and DeltaNet.
Just use memodo.MemodoLayer
, this is an subclass of torch.nn.Module
.
Memodo use the General Delta Rule directly:
S -> S * diag(i) + S * a^T * b + c^T * d
return r * S
With Dynamic Token Shift:
d[t] = sigmoid(silu(lerp(x[t], x[t - 1], w1) * w2) * w3)
x[t] = lerp(x[t], x[t - 1], d[t])
And gated residual:
R -> R + Block(x) * sigmoid(silu(LayerNorm(R) * w1) * w2)