Memodo: An linear attention solution

Memodo is an linear attention solution that combining the advantages of both RWKV and DeltaNet.

Usage

Just use memodo.MemodoLayer, this is an subclass of torch.nn.Module.

Memodo use the General Delta Rule directly:

S -> S * diag(i) + S * a^T * b + c^T * d
return r * S

With Dynamic Token Shift:

d[t] = sigmoid(silu(lerp(x[t], x[t - 1], w1) * w2) * w3)
x[t] = lerp(x[t], x[t - 1], d[t])

And gated residual:

R -> R + Block(x) * sigmoid(silu(LayerNorm(R) * w1) * w2)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
memodo.py		memodo.py
setup.py		setup.py