Heuristics for mmq_id -> original threshold #734

ikawrakow · 2025-08-27T05:17:30Z

This is a follow up of #728.

For large enough u-batches the original implementation of the fused ffn_up_exps+ffn_gate_exps op becomes faster than the mmq_id implementation added in #728. In #728 a fixed threshold of u-batch = 2048 was used to transition to the original implementation. I have now investigated the speed of original vs mmq_id for 3 models with different number of total and active experts, and it looks like the best heuristics is to use mmq_id for u-batch <= 32 * total_experts. This PR makes this simple change.

This reverts commit 966a6ce.

Heuristics for mmq_id -> original threshold

adee949

ikawrakow merged commit 966a6ce into main Aug 27, 2025

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Nov 6, 2025

Revert "Heuristics for mmq_id -> original threshold (ikawrakow#734)"

f5df544

This reverts commit 966a6ce.

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Nov 6, 2025

Revert "Heuristics for mmq_id -> original threshold (ikawrakow#734)"

55677a8

This reverts commit 966a6ce.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Heuristics for mmq_id -> original threshold #734

Heuristics for mmq_id -> original threshold #734

Uh oh!

ikawrakow commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Heuristics for mmq_id -> original threshold #734

Heuristics for mmq_id -> original threshold #734

Uh oh!

Conversation

ikawrakow commented Aug 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants