Commit 05242ff

and

Iwan Kawrakow

authored

Faster MLA prompt processing (#205)

* Do not allocate / report caches that are not used It is either the standard KV cache or MLA cache, not both. * Rename X_pe to X_rope Much easier to follow, at least for my brain, when we have X_rope : rotational position encoding X_nope : no position encoding instead of X_pe and X_nope, where I was wondering wtf is 'pe' and 'nope'. * WIP * WIP * WIP * WIP * Warn user when disabling MLA * MLA: compile time option to not use transposed KV cache Cuts KV cache size in nearly half at the expense of slower TG performance for long contexts (it becomes similar to no-MLA). --------- Co-authored-by: Iwan Kawrakow <[email protected]>

1 parent 1bbb543 commit 05242ffCopy full SHA for 05242ff

1 file changed

+156

-169

lines changed

src
- llama.cpp

1 file changed

+156

-169

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 05242ff

1 file changed

1 file changed

File tree

1 file changed

1 file changed

0 commit comments