-
Notifications
You must be signed in to change notification settings - Fork 0
UPSTREAM PR #17052: # Add Megrez-MoE Architecture Support ggml-org#16724 #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implements complete support for Megrez-MoE (Mixture of Experts) models: - Add LLM_ARCH_MEGREZ_MOE architecture enum and mappings - Implement build_mergez_moe_ffn() with sigmoid+bias gating - Add llm_build_megrez_moe class for full model graph construction - Support 31-layer architecture (layer 0: dense FFN, layers 1-30: MoE) - Implement expert sharing pattern with 64 experts, 6 used per token, 4 shared - Load all model hyperparameters and 372 tensors correctly - Configure NEOX RoPE type for proper positional encoding Tested with Megrez2-3x7B-A3B_Q4_K_M.gguf model. All 39 llama.cpp tests pass successfully. Output verified to match infinigence/llama.cpp reference implementation. Note: Use --no-warmup flag to avoid warmup memory allocation issue.
Megrez-MoE creates many intermediate tensors during MoE FFN construction: - sigmoid, add, reshape (3x), get_rows, sum_rows, div, view_2d, mul_mat operations - ggml_top_k internally calls ggml_argsort + ggml_view_4d (2 more tensors per layer) - Each of 30 MoE layers creates ~35 intermediate tensors during graph construction During warmup, the graph is built 3 times with different batch sizes, requiring sufficient memory pool space for all intermediate tensors. Add 4096 node overhead for LLM_ARCH_MEGREZ_MOE to accommodate these intermediate tensors (30 layers × 35 tensors/layer ≈ 1050 nodes, doubled for safety margin). This fixes the 'not enough space in the context's memory pool' error during warmup, allowing Megrez-MoE to work without the --no-warmup flag. Tested: - All 39 tests pass - Megrez-MoE works with warmup enabled (no crashes) - Other models (e.g., Gemma-2) are unaffected - Verified with outputs up to 100 tokens
- Move llm_build_megrez_moe from llama-model.cpp to src/models/megrez-moe.cpp - Add declaration to src/models/models.h - Update CMakeLists.txt to include megrez-moe.cpp in build - Resolve merge conflicts in llama-arch.cpp and llama-model.cpp - Fix PANGU_EMBED case statement closing braces The model loads successfully, all tests pass (40/40), and inference works correctly.
…oe_ffn - Remove custom build_mergez_moe_ffn implementation (100+ lines) - Use existing build_moe_ffn with LLAMA_EXPERT_GATING_FUNC_TYPE_SIGMOID - Pre-compute gate logits from pre_gate_hidden (Megrez-MoE's unique gating) - Pass pre-computed logits via probs_in parameter - Maintain exact same behavior and output quality This addresses review feedback to reuse existing MoE infrastructure instead of duplicating code. The sigmoid gating + bias after activation is already supported by build_moe_ffn.
0989d06 to
90cd13d
Compare
|
Access the complete analysis in the LOCI Dashboard Performance Analysis Summary: Megrez-MoE Architecture SupportOverviewThis analysis examines the performance impact of adding Megrez-MoE (Mixture of Experts) architecture support to llama.cpp. The changes introduce comprehensive MoE functionality including 64 experts with top-6 selection, shared experts, and sigmoid gating mechanisms. Key FindingsPerformance Impact
Power Consumption Analysis
Technical Analysis
Code Review Insights
ConclusionThe implementation successfully adds Megrez-MoE support with localized performance impact limited to architecture lookup operations. The observed regression represents an acceptable trade-off for enabling an entirely new model architecture family without affecting core inference performance. |
f89f4ca to
8ee0b21
Compare
- Restore PANGU_EMBED and COGVLM tensor mappings in llama-arch.cpp - Remove extra blank line in llama-context.cpp
f333350 to
9c4623f
Compare
Mirrored from ggml-org/llama.cpp#17052
Summary
This PR adds full support for the Megrez-MoE (Mixture of Experts) architecture to llama.cpp, enabling inference on Megrez2-3x7B models and similar MoE variants.
Architecture Details
Megrez-MoE is a Mixture of Experts architecture with:
Changes Made
1. Architecture Registration
LLM_ARCH_MEGREZ_MOEarchitecture enum2. MoE FFN Implementation
Implemented
build_mergez_moe_ffn()with:ggml_top_k()3. Model Loading
llm_build_megrez_moeclass4. Graph Memory Fix
graph_max_nodes()for Megrez-MoE (30 layers × 35 tensors ≈ 1050 nodes, doubled for safety)Testing
Comparison