Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 6, 2025

Mirrored from ggml-org/llama.cpp#17052

Summary

This PR adds full support for the Megrez-MoE (Mixture of Experts) architecture to llama.cpp, enabling inference on Megrez2-3x7B models and similar MoE variants.

Architecture Details

Megrez-MoE is a Mixture of Experts architecture with:

  • 64 experts with top-6 selection per layer
  • 4 shared experts across all tokens
  • Sigmoid + bias gating mechanism (different from standard softmax gating)
  • 30 MoE layers with 2048 embedding dimension
  • Context length up to 163,840 tokens

Changes Made

1. Architecture Registration

  • Added LLM_ARCH_MEGREZ_MOE architecture enum
  • Registered MoE-specific hyperparameters (expert counts, FFN dimensions)
  • Added tensor mapping for 64 experts × 30 layers

2. MoE FFN Implementation

Implemented build_mergez_moe_ffn() with:

  • Sigmoid gating with bias (unique to Megrez-MoE)
  • Top-K expert selection using ggml_top_k()
  • Shared experts processing for all tokens
  • Per-expert feed-forward computation

3. Model Loading

  • Added llm_build_megrez_moe class
  • Implemented hyperparameter loading (expert_count, expert_used_count, etc.)
  • Implemented tensor loading for all expert weights

4. Graph Memory Fix

  • Problem: Warmup crashed with "not enough space in context's memory pool"
  • Root Cause: MoE FFN creates ~35 intermediate tensors per layer (sigmoid, reshape, top_k, etc.)
  • Solution: Added 4096 node overhead to graph_max_nodes() for Megrez-MoE (30 layers × 35 tensors ≈ 1050 nodes, doubled for safety)

Testing

  • All 39 existing tests pass
  • No regression in other architectures (verified with Gemma-2)
  • Warmup works without crashes
  • Output generation verified (up to 200 tokens)
  • Performance: ~17 tokens/second on CPU

Comparison

# Without this PR:
$ ./build/bin/llama-cli -m Megrez2-3x7B.gguf -p "Test"
error: unknown model architecture: 'megrez-moe'

# With this PR:
$ ./build/bin/llama-cli -m Megrez2-3x7B.gguf -p "Test"
Works perfectly with warmup enabled

tamarPal added 5 commits November 6, 2025 23:45
Implements complete support for Megrez-MoE (Mixture of Experts) models:

- Add LLM_ARCH_MEGREZ_MOE architecture enum and mappings
- Implement build_mergez_moe_ffn() with sigmoid+bias gating
- Add llm_build_megrez_moe class for full model graph construction
- Support 31-layer architecture (layer 0: dense FFN, layers 1-30: MoE)
- Implement expert sharing pattern with 64 experts, 6 used per token, 4 shared
- Load all model hyperparameters and 372 tensors correctly
- Configure NEOX RoPE type for proper positional encoding

Tested with Megrez2-3x7B-A3B_Q4_K_M.gguf model.
All 39 llama.cpp tests pass successfully.
Output verified to match infinigence/llama.cpp reference implementation.

Note: Use --no-warmup flag to avoid warmup memory allocation issue.
Megrez-MoE creates many intermediate tensors during MoE FFN construction:
- sigmoid, add, reshape (3x), get_rows, sum_rows, div, view_2d, mul_mat operations
- ggml_top_k internally calls ggml_argsort + ggml_view_4d (2 more tensors per layer)
- Each of 30 MoE layers creates ~35 intermediate tensors during graph construction

During warmup, the graph is built 3 times with different batch sizes, requiring
sufficient memory pool space for all intermediate tensors.

Add 4096 node overhead for LLM_ARCH_MEGREZ_MOE to accommodate these intermediate
tensors (30 layers × 35 tensors/layer ≈ 1050 nodes, doubled for safety margin).

This fixes the 'not enough space in the context's memory pool' error during warmup,
allowing Megrez-MoE to work without the --no-warmup flag.

Tested:
- All 39 tests pass
- Megrez-MoE works with warmup enabled (no crashes)
- Other models (e.g., Gemma-2) are unaffected
- Verified with outputs up to 100 tokens
- Move llm_build_megrez_moe from llama-model.cpp to src/models/megrez-moe.cpp
- Add declaration to src/models/models.h
- Update CMakeLists.txt to include megrez-moe.cpp in build
- Resolve merge conflicts in llama-arch.cpp and llama-model.cpp
- Fix PANGU_EMBED case statement closing braces

The model loads successfully, all tests pass (40/40), and inference works correctly.
…oe_ffn

- Remove custom build_mergez_moe_ffn implementation (100+ lines)
- Use existing build_moe_ffn with LLAMA_EXPERT_GATING_FUNC_TYPE_SIGMOID
- Pre-compute gate logits from pre_gate_hidden (Megrez-MoE's unique gating)
- Pass pre-computed logits via probs_in parameter
- Maintain exact same behavior and output quality

This addresses review feedback to reuse existing MoE infrastructure
instead of duplicating code. The sigmoid gating + bias after activation
is already supported by build_moe_ffn.
@DajanaV DajanaV force-pushed the upstream-PR17052-branch_tamarPal-feature/megrez-moe branch from 0989d06 to 90cd13d Compare November 6, 2025 23:34
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: Megrez-MoE Architecture Support

Overview

This analysis examines the performance impact of adding Megrez-MoE (Mixture of Experts) architecture support to llama.cpp. The changes introduce comprehensive MoE functionality including 64 experts with top-6 selection, shared experts, and sigmoid gating mechanisms.

Key Findings

Performance Impact

  • Highest degradation identified: _M_const_cast function shows 209% Response Time increase (85 ns → 262 ns) and 277% Throughput degradation (64 ns → 241 ns)
  • Core function impact: The affected function is part of C++ Standard Library red-black tree operations used for architecture lookups, not directly impacting inference functions like llama_decode, llama_encode, or llama_tokenize
  • Inference performance: No direct impact on tokens per second expected since core inference functions remain unchanged

Power Consumption Analysis

  • Primary impact: build.bin.libllama.so shows minimal +0.427% power consumption increase
  • Secondary impact: build.bin.llama-cvector-generator shows negligible increase
  • Overall assessment: Power impact is contained and acceptable for the new functionality added

Technical Analysis

  • Flame graph insights: Performance regression concentrated in the function's own logic (92% of execution time), with additional overhead from stack protection checks and red-black tree iterator operations
  • CFG comparison: Code reorganization separated stack canary setup into distinct basic blocks, adding one extra branch instruction and changing memory access patterns
  • Root cause: Architecture map expansion increases iterator traversal overhead during model initialization

Code Review Insights

  • Memory allocation: Added 4096 graph nodes overhead for Megrez-MoE models to prevent warmup crashes
  • Architecture complexity: Introduced 25 new tensor mappings and sophisticated layer-specific expert allocation patterns
  • Implementation quality: Follows established MoE patterns while accommodating unique Megrez-MoE characteristics

Conclusion

The implementation successfully adds Megrez-MoE support with localized performance impact limited to architecture lookup operations. The observed regression represents an acceptable trade-off for enabling an entirely new model architecture family without affecting core inference performance.

@DajanaV DajanaV force-pushed the main branch 17 times, most recently from f89f4ca to 8ee0b21 Compare November 9, 2025 11:06
- Restore PANGU_EMBED and COGVLM tensor mappings in llama-arch.cpp
- Remove extra blank line in llama-context.cpp
@DajanaV DajanaV force-pushed the main branch 30 times, most recently from f333350 to 9c4623f Compare November 18, 2025 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants