forked from ggml-org/llama.cpp
-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
Description
Note: This issue was copied from ggml-org#13497
Original Author: @ggerganov
Original Issue Number: ggml-org#13497
Created: 2025-05-13T08:09:25Z
Following the optimization in ggml-org#13493, I realized that the defragmentation can become much better so that it can further improve the Flash Attention masking.
Currently we defrag the following cache like this:
# before defrag
00000000...11111.......2222222....2010212012012....
# after defrag
000000001111122222222010212012012..................I.e. we only "fill" the holes, but the sequences remain scattered. We can do better like this:
# new defrag
000000000000111111111222222222222..................
By doing so, the FA-vec masking logic will remain effective even after many generations.