Commit 7692c67
committed
Update on "implement position encoding for shifted tokens"
In AttentionSink, it uses tokens' positions in the KVCache instead of the actual text. When tokens get shifted in KVCache, it needs to update q and k's position embedding.
In the original [implementation](https://github.com/mit-han-lab/streaming-llm) of AttentionSink with Rope, it caches the original q and k in KVCache and apply position embedding during inference.
This PR adds `RopeWithAttentionSink`. It assumes that q and k are already encoded with their original position. When we shift tokens, we reapply the position delta. This has two benefits:
- minimize our code since our existing `llama_transformer` applies rope embedding before doing KVCache update
- avoid performance regression when tokens are not shifted because we don't need to reapply position encoding in KVCache for them
Differential Revision: [D65366440](https://our.internmc.facebook.com/intern/diff/D65366440/)
[ghstack-poisoned]1 parent 51d27f4 commit 7692c67
File tree
1 file changed
+2
-2
lines changed- examples/models/llama/source_transformation
1 file changed
+2
-2
lines changedLines changed: 2 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
136 | 136 | | |
137 | 137 | | |
138 | 138 | | |
139 | | - | |
140 | | - | |
| 139 | + | |
| 140 | + | |
141 | 141 | | |
142 | 142 | | |
143 | 143 | | |
| |||
0 commit comments