Skip to content

Conversation

@ikawrakow
Copy link
Owner

This is a follow up of #728.

For large enough u-batches the original implementation of the fused ffn_up_exps+ffn_gate_exps op becomes faster than the mmq_id implementation added in #728. In #728 a fixed threshold of u-batch = 2048 was used to transition to the original implementation. I have now investigated the speed of original vs mmq_id for 3 models with different number of total and active experts, and it looks like the best heuristics is to use mmq_id for u-batch <= 32 * total_experts. This PR makes this simple change.

@ikawrakow ikawrakow merged commit 966a6ce into main Aug 27, 2025
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Nov 6, 2025
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants