kv-cache : improve defrag logic

**Note: This issue was copied from [https://github.com/ggml-org/llama.cpp/issues/13497](https://github.com/ggml-org/llama.cpp/issues/13497)**

**Original Author:** @ggerganov
**Original Issue Number:** #13497
**Created:** 2025-05-13T08:09:25Z

---

Following the optimization in #13493, I realized that the defragmentation can become much better so that it can further improve  the Flash Attention masking. 

Currently we defrag the following cache like this:

```bash
# before defrag
00000000...11111.......2222222....2010212012012....

# after defrag
000000001111122222222010212012012..................
```

I.e. we only "fill" the holes, but the sequences remain scattered. We can do better like this:

```
# new defrag
000000000000111111111222222222222..................
```

By doing so, the [FA-vec masking logic](#13493) will remain effective even after many generations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kv-cache : improve defrag logic #199

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kv-cache : improve defrag logic #199

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions