Skip to content

Conversation

@tamarPal
Copy link
Contributor

@tamarPal tamarPal commented Nov 10, 2025

--Replaces closed PR #17052 (force-push issue). Only Megrez-MoE changes, fully tested.

Architecture

  • 64 routed experts (top-6) + 4 shared experts
  • Sigmoid + bias gating
  • 30 MoE layers, 2048-dim, 163K context

Changes

  1. Conversion (convert_hf_to_gguf.py): MegrezMoEModel class

    • Shared expert FFN: hidden_size × 2.75 = 5632
    • Gate bias mapping: e_score_correction_biasexp_probs_b
    • Expert tensor merging (64 experts/layer)
  2. Architecture (constants.py): MODEL_TENSORS.MEGREZ_MOE

  3. Inference (llama.cpp): MoE FFN + graph memory fix

Testing

-Real model: Infinigence/Megrez2-3x7B-A3B (13.92 GB → 14 GB GGUF)
-Conversion: 372 tensors, all parameters correct
-Inference: Coherent output, 5.77 tok/s
-All 40 tests pass

Example

python3 convert_hf_to_gguf.py models/Megrez2-3x7B-A3B/
./build/bin/llama-cli -m megrez2-3x7b-a3b-f16.gguf -p "Hello" -n 50

tamarPal added 11 commits November 9, 2025 19:55
Implements complete support for Megrez-MoE (Mixture of Experts) models:

- Add LLM_ARCH_MEGREZ_MOE architecture enum and mappings
- Implement build_mergez_moe_ffn() with sigmoid+bias gating
- Add llm_build_megrez_moe class for full model graph construction
- Support 31-layer architecture (layer 0: dense FFN, layers 1-30: MoE)
- Implement expert sharing pattern with 64 experts, 6 used per token, 4 shared
- Load all model hyperparameters and 372 tensors correctly
- Configure NEOX RoPE type for proper positional encoding

Tested with Megrez2-3x7B-A3B_Q4_K_M.gguf model.
All 39 llama.cpp tests pass successfully.
Output verified to match infinigence/llama.cpp reference implementation.

Note: Use --no-warmup flag to avoid warmup memory allocation issue.
Megrez-MoE creates many intermediate tensors during MoE FFN construction:
- sigmoid, add, reshape (3x), get_rows, sum_rows, div, view_2d, mul_mat operations
- ggml_top_k internally calls ggml_argsort + ggml_view_4d (2 more tensors per layer)
- Each of 30 MoE layers creates ~35 intermediate tensors during graph construction

During warmup, the graph is built 3 times with different batch sizes, requiring
sufficient memory pool space for all intermediate tensors.

Add 4096 node overhead for LLM_ARCH_MEGREZ_MOE to accommodate these intermediate
tensors (30 layers × 35 tensors/layer ≈ 1050 nodes, doubled for safety margin).

This fixes the 'not enough space in the context's memory pool' error during warmup,
allowing Megrez-MoE to work without the --no-warmup flag.

Tested:
- All 39 tests pass
- Megrez-MoE works with warmup enabled (no crashes)
- Other models (e.g., Gemma-2) are unaffected
- Verified with outputs up to 100 tokens
- Move llm_build_megrez_moe from llama-model.cpp to src/models/megrez-moe.cpp
- Add declaration to src/models/models.h
- Update CMakeLists.txt to include megrez-moe.cpp in build
- Resolve merge conflicts in llama-arch.cpp and llama-model.cpp
- Fix PANGU_EMBED case statement closing braces

The model loads successfully, all tests pass (40/40), and inference works correctly.
…oe_ffn

- Remove custom build_mergez_moe_ffn implementation (100+ lines)
- Use existing build_moe_ffn with LLAMA_EXPERT_GATING_FUNC_TYPE_SIGMOID
- Pre-compute gate logits from pre_gate_hidden (Megrez-MoE's unique gating)
- Pass pre-computed logits via probs_in parameter
- Maintain exact same behavior and output quality

This addresses review feedback to reuse existing MoE infrastructure
instead of duplicating code. The sigmoid gating + bias after activation
is already supported by build_moe_ffn.
- Restore PANGU_EMBED and COGVLM tensor mappings in llama-arch.cpp
- Remove extra blank line in llama-context.cpp
Restore HunYuanMoE code to upstream version - no modifications needed
@github-actions github-actions bot added model Model specific python python script changes labels Nov 10, 2025
@tamarPal tamarPal changed the title Add complete **Megrez-MoE** support: GGUF conversion + inference. Add complete Megrez-MoE support: GGUF conversion + inference. Nov 10, 2025
- Use explicit expert_layer_stride variable instead of hard-coded expression
- Apply clang-format to ensure consistent code style
- Fix trailing whitespace issues
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly, I think this PR is taking so much time for maintainers to review.

many parts of the code are not clean, I suggest that you should have a deeper look into how other models structure their code and follow the existing pattern.

tamarPal added 2 commits November 11, 2025 09:44
- Merge layer 0 into main loop (il=0 to n_layer)
- Add fallback values for n_ff_shexp and n_ff_exp
- Add safety checks for inp_out_ids and pre_gate_hidden
- Remove verbose comments and extra whitespace
Copy link
Collaborator

@pwilkin pwilkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run editorconfig-checker (https://github.com/editorconfig-checker/editorconfig-checker) and flake8 on your PR and fix the whitespace / indentation errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants