Skip to content

Commit 966a6ce

Browse files
ikawrakowIwan Kawrakow
andauthored
Heuristics for mmq_id -> original threshold (#734)
Co-authored-by: Iwan Kawrakow <[email protected]>
1 parent 931f04a commit 966a6ce

File tree

1 file changed

+9
-1
lines changed

1 file changed

+9
-1
lines changed

ggml/src/ggml-cuda.cu

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2681,7 +2681,15 @@ static bool ggml_cuda_up_gate_unary(ggml_backend_cuda_context & ctx, ggml_tensor
26812681

26822682
ggml_tensor dst_row = *dst;
26832683

2684-
if (src1->ne[2] <= 2048 && // TODO: this depends on number of total vs number of active experts -> need to find optimum threshod
2684+
// The heuristics src1->ne[2] <= 32*src0->ne[2] to use the mul_mat_id implementation instead of the original version
2685+
// is derived from
2686+
// * DeepSeek-Lite: 64 total, 6 active experts
2687+
// * GPT-OSS-20B : 32 total, 4 active experts
2688+
// * Qwen3-30B-A3B: 128 total, 8 active experts
2689+
// My original hypothesis was that it is dependent on the total/active experts ratio, but from these 3 it
2690+
// looks like it really depends just on the total number of experts.
2691+
// TODO: verify with more models, or perhaps make the magic constant '32' to be defined via a compile time define.
2692+
if (src1->ne[2] <= 32*src0->ne[2] &&
26852693
ggml_is_quantized(src0_1->type) && src0_1->type == src0_2->type && src1->ne[1] == 1 && src1->ne[3] == 1 &&
26862694
ggml_cuda_can_use_mmq_id(src0_1->type, ggml_cuda_info().devices[ctx.device].cc, src1->ne[2])) {
26872695

0 commit comments

Comments
 (0)