-
Notifications
You must be signed in to change notification settings - Fork 566
Open
Description
I was comparing the rotary embedding implementation in this repository with the implementations in the official Llama and Deepseek repositories using this Jupyter notebook: link. In Llama and Deepseek repositories, complex multiplication is used to perform the rotation of the q and k values, whereas it is implemented more explicitly here. Mathematically, I understand these methods are equivalent since:
- LHS: Used in Llama and Deepseek implementations
- RHS: Used in the GPT-Fast implementation
As demonstrated in the notebook, the complex multiplication approach is significantly faster. Maybe I'm missing something but is there a difference because of which the explicit method is preferred here?
Metadata
Metadata
Assignees
Labels
No labels