You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update base for Update on "implement position encoding for shifted tokens"
In AttentionSink, it uses tokens' positions in the KVCache instead of the actual text. When tokens get shifted in KVCache, it needs to update q and k's position embedding.
In the original [implementation](https://github.com/mit-han-lab/streaming-llm) of AttentionSink with Rope, it caches the original q and k in KVCache and apply position embedding during inference.
This PR adds `RopeWithAttentionSink`. It assumes that q and k are already encoded with their original position. When we shift tokens, we reapply the position delta. This has two benefits:
- minimize our code since our existing `llama_transformer` applies rope embedding before doing KVCache update
- avoid performance regression when tokens are not shifted because we don't need to reapply position encoding in KVCache for them
Differential Revision: [D65366440](https://our.internmc.facebook.com/intern/diff/D65366440/)
[ghstack-poisoned]
[PLEASE REMOVE] See [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests) for ExecuTorch PR guidelines.
3
+
4
+
[PLEASE REMOVE] If this PR closes an issue, please add a `Fixes #<issue-id>` line.
5
+
6
+
[PLEASE REMOVE] If this PR introduces a fix or feature that should be the upcoming release notes, please add a "Release notes: <area>" label. For a list of available release notes labels, check out [CONTRIBUTING.md's Pull Requests](https://github.com/pytorch/executorch/blob/main/CONTRIBUTING.md#pull-requests).
7
+
8
+
### Test plan
9
+
[PLEASE REMOVE] How did you test this PR? Please write down any manual commands you used and note down tests that you have written if applicable.
0 commit comments