Skip to content

Commit 0fb5ca6

Browse files
author
Guo, Xiang1
committed
src: llama-graph: MLA kv cache: fix split graph backend assignment when kv cache store on CPU
1 parent e298d2f commit 0fb5ca6

File tree

1 file changed

+4
-0
lines changed

1 file changed

+4
-0
lines changed

src/llama-graph.cpp

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1156,6 +1156,10 @@ ggml_tensor * llm_graph_context::build_attn_mha(
11561156
// for MLA with the absorption optimization, we need to "decompress" from MQA back to MHA
11571157
if (v_mla) {
11581158
kqv = ggml_mul_mat(ctx0, v_mla, kqv);
1159+
// all nodes between the KV store and the attention output are run on the CPU
1160+
if (!cparams.offload_kqv) {
1161+
ggml_backend_sched_set_tensor_backend(sched, kqv, backend_cpu);
1162+
}
11591163
}
11601164

11611165
cur = ggml_permute(ctx0, kqv, 0, 2, 1, 3);

0 commit comments

Comments
 (0)