Skip to content

kv-cache : improve defrag logic #199

@jakexcosme

Description

@jakexcosme

Note: This issue was copied from ggml-org#13497

Original Author: @ggerganov
Original Issue Number: ggml-org#13497
Created: 2025-05-13T08:09:25Z


Following the optimization in ggml-org#13493, I realized that the defragmentation can become much better so that it can further improve the Flash Attention masking.

Currently we defrag the following cache like this:

# before defrag
00000000...11111.......2222222....2010212012012....

# after defrag
000000001111122222222010212012012..................

I.e. we only "fill" the holes, but the sequences remain scattered. We can do better like this:

# new defrag
000000000000111111111222222222222..................

By doing so, the FA-vec masking logic will remain effective even after many generations.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions