Attention Mechanisms
- Implemented: Multi-head Attention with scaled dot product from "Attention Is All You Need" paper
- Implemented: Grouped Query Attention as in Llma models. But unlike in Llama models, my implementation still uses scaled dot product
- Implemented: Multi-Head Latent Attention as in DeepSeek models.