-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Closed
Description
Background
As part of the work in PR #14898, the function build_moe_ffn_from_probs
was introduced to handle SmallThinker's unique architecture where the MoE router is positioned before the attention block. This has resulted in unfortunate code duplication with the existing build_moe_ffn
function.
The Task
As suggested in this code review comment, the proposed solution is to merge the logic of build_moe_ffn_from_probs
into the main build_moe_ffn
function. This can be achieved by:
- Modifying
build_moe_ffn
to accept an optionalggml_tensor *probs
parameter, which defaults tonullptr
. - Using this parameter as a "toggle":
- If
probs
is provided, the function should use it directly and skip the internal logits/probs calculation. - If
probs
isnullptr
, the function should behave as it currently does.
- If
- Carefully handle the divergent logic paths inside the unified function, especially regarding weight normalization and activation functions.
- Once the merge is complete and verified, the now-redundant
build_moe_ffn_from_probs
function should be removed.
I plan to submit a pull request addressing this within the next 1-3 days.
Context
- Original PR: Add support for SmallThinker model series #14898
- Relevant Discussion: Add support for SmallThinker model series #14898 (comment)
CISC
Metadata
Metadata
Assignees
Labels
No labels