Do you plan to implement Slim Attention? #1428

albertosottile started this conversation in Ideas

albertosottile
Mar 17, 2025

Or would you wait for upstream implementation in llama.cpp?

Context: https://www.reddit.com/r/LocalLLaMA/comments/1j9wkc2/slim_attention_cut_your_context_memory_in_half

Issue in llama.cpp: ggml-org#12359

Implementation in Python: https://colab.research.google.com/github/OpenMachine-ai/transformer-tricks/blob/main/notebooks/slimAttn_paper.ipynb

Replies: 1 comment

LostRuins
Mar 18, 2025
Maintainer

Definitely upstream implementation is preferred

0 replies

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment