Skip to content
Discussion options

You must be logged in to vote

In the attention, the positions of the tokens is encoded via RoPE (i.e. rotations of the hidden state). Since the RoPE encoding is additive, we can "shift" cached keys by applying RoPE using the delta in the new and old positions. We don't apply it for the values (V) because they are not RoPEd explicitly

This operation is not mathematically equivalent to recomputing the new context from scratch, but it is much faster and seems to produce reasonable results for some reason

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@mjkpolo
Comment options

@ggerganov
Comment options

Answer selected by mjkpolo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants