Skip to content

Commit d385f76

Browse files
committed
ggml-cuda: add more detailed comments about concurrency
1 parent cfa1a02 commit d385f76

File tree

1 file changed

+9
-1
lines changed

1 file changed

+9
-1
lines changed

ggml/src/ggml-cuda/ggml-cuda.cu

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3764,7 +3764,15 @@ static void ggml_backend_cuda_graph_optimize(ggml_backend_t backend, ggml_cgraph
37643764
}
37653765
}
37663766

3767-
//Target Q, K, V
3767+
// Target Q, K, V for concurrency
3768+
// this is a more general way to find nodes which can be candidates for concurrency (although it has not been tested for anything else):
3769+
// 1. find fan-out (fork) nodes where the same input is used at least N times (in QKV, it would be "attn-norm")
3770+
// 2. find the join node, where 2 or more of the outputs are required (in QKV, this would "KQ" or "flash-attn")
3771+
// 3. account for all branches from the fork to the join
3772+
// 4. To extend lifetimes of the tensors, we interleave the branches (see below for more details)
3773+
// 5. save the original cgraph and restore it in graph_compute, to enable fusion within streams
3774+
// See discussion: https://github.com/ggml-org/llama.cpp/pull/16991#issuecomment-3522620030
3775+
37683776
const int min_fan_out = 3;
37693777
const int max_fan_out = 3;
37703778

0 commit comments

Comments
 (0)