UPSTREAM PR #17052: # Add Megrez-MoE Architecture Support ggml-org#16724 #103

DajanaV · 2025-11-06T13:42:03Z

Summary

This PR adds full support for the Megrez-MoE (Mixture of Experts) architecture to llama.cpp, enabling inference on Megrez2-3x7B models and similar MoE variants.

Architecture Details

Megrez-MoE is a Mixture of Experts architecture with:

64 experts with top-6 selection per layer
4 shared experts across all tokens
Sigmoid + bias gating mechanism (different from standard softmax gating)
30 MoE layers with 2048 embedding dimension
Context length up to 163,840 tokens

Changes Made

1. Architecture Registration

Added LLM_ARCH_MEGREZ_MOE architecture enum
Registered MoE-specific hyperparameters (expert counts, FFN dimensions)
Added tensor mapping for 64 experts × 30 layers

2. MoE FFN Implementation

Implemented build_mergez_moe_ffn() with:

Sigmoid gating with bias (unique to Megrez-MoE)
Top-K expert selection using ggml_top_k()
Shared experts processing for all tokens
Per-expert feed-forward computation

3. Model Loading

Added llm_build_megrez_moe class
Implemented hyperparameter loading (expert_count, expert_used_count, etc.)
Implemented tensor loading for all expert weights

4. Graph Memory Fix

Problem: Warmup crashed with "not enough space in context's memory pool"
Root Cause: MoE FFN creates ~35 intermediate tensors per layer (sigmoid, reshape, top_k, etc.)
Solution: Added 4096 node overhead to graph_max_nodes() for Megrez-MoE (30 layers × 35 tensors ≈ 1050 nodes, doubled for safety)

Testing

All 39 existing tests pass
No regression in other architectures (verified with Gemma-2)
Warmup works without crashes
Output generation verified (up to 200 tokens)
Performance: ~17 tokens/second on CPU

Comparison

# Without this PR:
$ ./build/bin/llama-cli -m Megrez2-3x7B.gguf -p "Test"
error: unknown model architecture: 'megrez-moe'

# With this PR:
$ ./build/bin/llama-cli -m Megrez2-3x7B.gguf -p "Test"
Works perfectly with warmup enabled

Implements complete support for Megrez-MoE (Mixture of Experts) models: - Add LLM_ARCH_MEGREZ_MOE architecture enum and mappings - Implement build_mergez_moe_ffn() with sigmoid+bias gating - Add llm_build_megrez_moe class for full model graph construction - Support 31-layer architecture (layer 0: dense FFN, layers 1-30: MoE) - Implement expert sharing pattern with 64 experts, 6 used per token, 4 shared - Load all model hyperparameters and 372 tensors correctly - Configure NEOX RoPE type for proper positional encoding Tested with Megrez2-3x7B-A3B_Q4_K_M.gguf model. All 39 llama.cpp tests pass successfully. Output verified to match infinigence/llama.cpp reference implementation. Note: Use --no-warmup flag to avoid warmup memory allocation issue.

Megrez-MoE creates many intermediate tensors during MoE FFN construction: - sigmoid, add, reshape (3x), get_rows, sum_rows, div, view_2d, mul_mat operations - ggml_top_k internally calls ggml_argsort + ggml_view_4d (2 more tensors per layer) - Each of 30 MoE layers creates ~35 intermediate tensors during graph construction During warmup, the graph is built 3 times with different batch sizes, requiring sufficient memory pool space for all intermediate tensors. Add 4096 node overhead for LLM_ARCH_MEGREZ_MOE to accommodate these intermediate tensors (30 layers × 35 tensors/layer ≈ 1050 nodes, doubled for safety margin). This fixes the 'not enough space in the context's memory pool' error during warmup, allowing Megrez-MoE to work without the --no-warmup flag. Tested: - All 39 tests pass - Megrez-MoE works with warmup enabled (no crashes) - Other models (e.g., Gemma-2) are unaffected - Verified with outputs up to 100 tokens

- Move llm_build_megrez_moe from llama-model.cpp to src/models/megrez-moe.cpp - Add declaration to src/models/models.h - Update CMakeLists.txt to include megrez-moe.cpp in build - Resolve merge conflicts in llama-arch.cpp and llama-model.cpp - Fix PANGU_EMBED case statement closing braces The model loads successfully, all tests pass (40/40), and inference works correctly.

…oe_ffn - Remove custom build_mergez_moe_ffn implementation (100+ lines) - Use existing build_moe_ffn with LLAMA_EXPERT_GATING_FUNC_TYPE_SIGMOID - Pre-compute gate logits from pre_gate_hidden (Megrez-MoE's unique gating) - Pass pre-computed logits via probs_in parameter - Maintain exact same behavior and output quality This addresses review feedback to reuse existing MoE infrastructure instead of duplicating code. The sigmoid gating + bias after activation is already supported by build_moe_ffn.

loci-agentic-ai · 2025-11-07T00:13:20Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: Megrez-MoE Architecture Support

Overview

This analysis examines the performance impact of adding Megrez-MoE (Mixture of Experts) architecture support to llama.cpp. The changes introduce comprehensive MoE functionality including 64 experts with top-6 selection, shared experts, and sigmoid gating mechanisms.

Key Findings

Performance Impact

Highest degradation identified: _M_const_cast function shows 209% Response Time increase (85 ns → 262 ns) and 277% Throughput degradation (64 ns → 241 ns)
Core function impact: The affected function is part of C++ Standard Library red-black tree operations used for architecture lookups, not directly impacting inference functions like llama_decode, llama_encode, or llama_tokenize
Inference performance: No direct impact on tokens per second expected since core inference functions remain unchanged

Power Consumption Analysis

Primary impact: build.bin.libllama.so shows minimal +0.427% power consumption increase
Secondary impact: build.bin.llama-cvector-generator shows negligible increase
Overall assessment: Power impact is contained and acceptable for the new functionality added

Technical Analysis

Flame graph insights: Performance regression concentrated in the function's own logic (92% of execution time), with additional overhead from stack protection checks and red-black tree iterator operations
CFG comparison: Code reorganization separated stack canary setup into distinct basic blocks, adding one extra branch instruction and changing memory access patterns
Root cause: Architecture map expansion increases iterator traversal overhead during model initialization

Code Review Insights

Memory allocation: Added 4096 graph nodes overhead for Megrez-MoE models to prevent warmup crashes
Architecture complexity: Introduced 25 new tensor mappings and sophisticated layer-specific expert allocation patterns
Implementation quality: Follows established MoE patterns while accommodating unique Megrez-MoE characteristics

Conclusion

The implementation successfully adds Megrez-MoE support with localized performance impact limited to architecture lookup operations. The observed regression represents an acceptable trade-off for enabling an entirely new model architecture family without affecting core inference performance.

- Restore PANGU_EMBED and COGVLM tensor mappings in llama-arch.cpp - Remove extra blank line in llama-context.cpp

DajanaV had a problem deploying to PROD__AL_DEMO November 6, 2025 13:42 — with GitHub Actions Failure

DajanaV had a problem deploying to PROD__AL_DEMO November 6, 2025 15:55 — with GitHub Actions Failure

DajanaV force-pushed the main branch from 95f6e9b to 40efe8b Compare November 6, 2025 17:09

tamarPal added 5 commits November 6, 2025 23:45

fix: remove trailing whitespace

90cd13d

DajanaV force-pushed the upstream-PR17052-branch_tamarPal-feature/megrez-moe branch from 0989d06 to 90cd13d Compare November 6, 2025 23:34

DajanaV temporarily deployed to PROD__AL_DEMO November 6, 2025 23:34 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 17 times, most recently from f89f4ca to 8ee0b21 Compare November 9, 2025 11:06

fix: resolve additional merge issues from rebase

2d2e419

- Restore PANGU_EMBED and COGVLM tensor mappings in llama-arch.cpp - Remove extra blank line in llama-context.cpp

DajanaV force-pushed the main branch from 8ee0b21 to 835aa10 Compare November 9, 2025 13:11

DajanaV force-pushed the main branch 30 times, most recently from f333350 to 9c4623f Compare November 18, 2025 09:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17052: # Add Megrez-MoE Architecture Support ggml-org#16724 #103

UPSTREAM PR #17052: # Add Megrez-MoE Architecture Support ggml-org#16724 #103

Uh oh!

DajanaV commented Nov 6, 2025

Uh oh!

loci-agentic-ai bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

UPSTREAM PR #17052: # Add Megrez-MoE Architecture Support ggml-org#16724 #103

Are you sure you want to change the base?

UPSTREAM PR #17052: # Add Megrez-MoE Architecture Support ggml-org#16724 #103

Uh oh!

Conversation

DajanaV commented Nov 6, 2025

Summary

Architecture Details

Changes Made

1. Architecture Registration

2. MoE FFN Implementation

3. Model Loading

4. Graph Memory Fix

Testing

Comparison

Uh oh!

loci-agentic-ai bot commented Nov 7, 2025

Performance Analysis Summary: Megrez-MoE Architecture Support

Overview

Key Findings

Performance Impact

Power Consumption Analysis

Technical Analysis

Code Review Insights

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants