Compilade/refactor kv cache #9846

ykhrustalev · 2024-10-11T13:33:14Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

This will be necessary to support Jamba (and other recurrent models mixed with Attention). Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

* llama : begin work on support for variable GQA This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads. * llama : gracefully fail when not finding hybrid slot

* ggml : simplify SSM-related operators * llama : make recurrent state slot allocation contiguous * llama : adapt internal uses of batches to llama_ubatch

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

This removes the need for ggml_ssm_conv!!! But performance seems slighly worse on my system, especially for prompt processing. Maybe ggml_mul_mat isn't optimized for small row sizes? More performance testing is necessary until GGML_OP_SSM_CONV is removed. * ggml : make ggml_ssm_scan not modify its source tensors * llama : fix shared recurrent tail cell count for small ubatch sizes Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.

* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors The implementation already supported it, and this makes Mamba's conv step slightly faster.

This can be changed back later if the name change is wrong. I was renaming the functions anyway to generalize kv-cache-related functions to hybrid and recurrent model architectures. I think llama_past is a better name than llama_cache for a combined kv cache and recurrent state cache, because the states it contains pretty much always come before the newly-added ones for any particular sequence. Also 'llama_past_clear' sounds more obvious in what it does than 'llama_kv_cache_clear'. The future is what the models generate. (For embeddings, the kv cache isn't really used anyway) Still, I'm open to better suggestions.

This also slightly reduces the diff from the master branch

compilade added 30 commits April 3, 2024 20:47

wip: llama : separate recurrent states from the KV cache

271104c

This will be necessary to support Jamba (and other recurrent models mixed with Attention). Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

llama : use std::find for seq_nodes in llama_rs_cache

8db1e4d

llama : state checkpoints for recurrent models

0028010

llama : correctly handle more edge cases for the rs cache

0c8b3b2

Merge branch 'master' into compilade/refactor-kv-cache

d66849f

llama : rename many llama_kv_cache_* functions

a09db95

Merge branch 'master' into compilade/refactor-kv-cache

c460ff1

llama : remove useless return value for some llama_cache_* functions

b6fafd1

Merge branch 'master' into compilade/refactor-kv-cache

b7ec12e

Merge branch 'master' into compilade/refactor-kv-cache

3b57b55

llama : rethink recurrent state cell counts

7e13f19

* llama : begin work on support for variable GQA This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads. * llama : gracefully fail when not finding hybrid slot

llama : support Jamba

cbc743e

Merge branch 'master' into compilade/refactor-kv-cache

0fd13e9

llama : fix BERT inference without KV cache

61a88a1

convert-hf : check for unprocessed Jamba experts

ea2e63e

convert-hf : support Mini-Jamba conversion

fc59407

llama : fix Jamba quantization sanity checks

181dadf

llama : sequence-length-aware batch splitting

3a414b0

Merge branch 'master' into compilade/refactor-kv-cache

4e4c41e

llama : use equal-sequence-length sub-batches for recurrent models

3587a94

* ggml : simplify SSM-related operators * llama : make recurrent state slot allocation contiguous * llama : adapt internal uses of batches to llama_ubatch

Merge branch 'master' into compilade/refactor-kv-cache

5d3c7b9

llama : fix batch split output count for embeddings

72eea49

llama : minimize swaps when reordering logits

18d1c14

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

llama : fix edge case finding batch seq_id of split recurrent cell

61200ef

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

llama : avoid copies for simple batch splits

eb589d5

llama : fix .base() compilation error on Windows

17f6c1e

llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL

fee3c1d

* ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors The implementation already supported it, and this makes Mamba's conv step slightly faster.

Merge branch 'master' into compilade/refactor-kv-cache

6840ac0

compilade added 11 commits June 11, 2024 23:27

examples : replace llama_kv_cache_seq_* with llama_past_seq_*

43d8d4b

Merge branch 'master' into compilade/refactor-kv-cache

ff794f5

mamba : fix non-contiguous usage of ggml_silu

33425a7

Merge branch 'master' into compilade/refactor-kv-cache

10c3c41

Merge branch 'master' into compilade/refactor-kv-cache

9b38f8b

Merge branch 'master' into compilade/refactor-kv-cache

bc320ef

llama : session saving and reloading for hybrid models

fcb889c

Merge branch 'master' into compilade/refactor-kv-cache

a03e32a

convert_hf : fix Jamba conversion

9d3f44d

llama : fix mixed signedness comparison

5f62db7

llama : use unused n_embd_k_gqa in k_shift

375de5b

This also slightly reduces the diff from the master branch

ykhrustalev marked this pull request as draft October 11, 2024 13:33

github-actions bot added android Issues specific to Android examples python python script changes server ggml changes relating to the ggml tensor library for machine learning labels Oct 11, 2024

trigger build

d2b1d0e

ykhrustalev closed this Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Compilade/refactor kv cache #9846

Compilade/refactor kv cache #9846

Uh oh!

ykhrustalev commented Oct 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Compilade/refactor kv cache #9846

Compilade/refactor kv cache #9846

Uh oh!

Conversation

ykhrustalev commented Oct 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants