Update on "implement position encoding for shifted tokens"

helunwencser · helunwencser · commit 7692c67fe4ae · 2024-11-04T15:20:29.000-08:00
In AttentionSink, it uses tokens' positions in the KVCache instead of the actual text. When tokens get shifted in KVCache, it needs to update q and k's position embedding. In the original [implementation](https://github.com/mit-han-lab/streaming-llm) of AttentionSink with Rope, it caches the original q and k in KVCache and apply position embedding during inference. This PR adds `RopeWithAttentionSink`. It assumes that q and k are already encoded with their original position. When we shift tokens, we reapply the position delta. This has two benefits: - minimize our code since our existing `llama_transformer` applies rope embedding before doing KVCache update - avoid performance regression when tokens are not shifted because we don't need to reapply position encoding in KVCache for them Differential Revision: [D65366440](https://our.internmc.facebook.com/intern/diff/D65366440/) [ghstack-poisoned]
diff --git a/examples/models/llama/source_transformation/attention_sink.py b/examples/models/llama/source_transformation/attention_sink.py
@@ -136,8 +136,8 @@ def forward(
         seq_len: int,
     ):
         """
-        Rerotate keys from original_position to new_position. This is done by rerotating
-        keys with (new_position * theta - original_position * theta) with the following matrix:
+        Rerotate q and k from original_position to new_position. This is done by rerotating q
+        and k with (new_position * theta - original_position * theta) with the following matrix:
         (cos(delta), -sin(delta)
          sin(delta), cos(delta))
          where delta = new_position * theta - original_position * theta