Port cpu moe options from mainline #672

TheLegendOfKitty · 2025-08-06T01:09:30Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Simple port of ggml-org/llama.cpp#15077.

Co-Authored-By: Parsa <[email protected]>

ikawrakow · 2025-08-07T05:31:38Z

Do people think that this really useful?

-cmoe is the same as -ot exps=CPU, and now we have -fmoe and -cmoe.

The regex required with -ot for -ncmoe X is slightly more complicated, but it isn't clear to me why keeping the first X routed MoE tensors on the CPU is better than any alternative option of having X layers on the CPU.

enbeec · 2025-08-07T07:59:08Z

Do people think that this really useful?

I did before I looked at mainline and realized it's just what I'm doing in bash, while wondering if it's even any faster than some alternatives.

The regex required with -ot for -ncmoe X is slightly more complicated, but it isn't clear to me why keeping the first X routed MoE tensors on the CPU is better than any alternative option of having X layers on the CPU.

Yeah, speaking as a noob here but with some user-facing experience as a dev, this feels like solving a papercut by potentially misleading people. Speaking as someone who has been on the internet, I get the feeling we're looking at something like blogspam making one piece of isolated advice into a pseudo best practice. The most credible source I can find for this exact advice are Unsloth guides for big MoE models and this Reddit post.

As I understand it with my beginner mind, it's a combo of FFN on CPU is fast and prefer to offload routed tensors to reduce PCIe chatter, probably with the caveat splitting layers increases PCIe chatter drastically doing roundtrips mid-layer. Like this top level comment on that Reddit post points out, this will work best when the CPU is heavily bottlenecked. The last thing I've seen is the people that do statistical analysis on activations but that seems like a lot, to me at least.

I do think we're going to see a lot of users like me trying to cram fairly sizeable MoEs into modest-to-meager hardware. I've got a 3090 on PCIe gen3, ~192GB DDR4 2666MHz and 16 threads dedicated to inference. Until I can afford to upgrade the host and leapfrog the GPU this a topic that really matters to me.

If someone with deeper technical knowledge can help fill in some of the details about the tradeoffs of whole-layer vs. expert-tensor offload, I can happily test on my setup and share results on Reddit to get some evidence-based thinking going out there. I also am fuzzy on exactly why I'm told to offload from GPU to CPU starting with the last layers and working forward -- prompt processing? I think there is a need for "hybrid MoE strategies" and it would be a good backdoor for getting people to think beyond VRAM+RAM+TC.

Also, I know there are people who have done statistical analysis of activations to guide offloading choices but that seems heavily workload dependent. I think any efforts on this front should be leading towards teaching someone how to read the architecture diagram and deduce which layers/tensors belong where on their hardware.

ikawrakow

Given the 3 🚀 , I'll merge.

* Port cpu moe options from mainline * Use strdup and int32_t to follow coding guidelines

* mxfp4: basics * mxfp4: Zen4 GEMM * mxfp4: repacked GEMM (AVX2/Zen4) * mxfp4: AVX2 GEMM * mxfp4: NEON GEMM * mxfp4: repacked GEMM (NEON) * mxfp4: Metal * Fix quantized K cache without FA (#680) * Prevent assert with quantized K cache and no FA * Fix MMQ when running with quantized K cache without FA --------- Co-authored-by: Iwan Kawrakow <[email protected]> * Fix for Deepseek r1 parsing (#676) * Implement function calling / tools for ik_llama.cpp for Kimi K2 * Implement basic tool choice * Backport llama.cpp tool calls support * Enhance function calls with improved chat parser and string utilities - Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling - Improve function calls parsing with fallback to llama.cpp builder pattern - Add string utility functions (starts_with, ends_with, find_partial_stop) - Update README with function calls testing instructions - Enhance Kimi K2 parser and function calls documentation - Add comprehensive test suite for function calls - Update CMakeLists.txt and Makefile for new components * Enhance function calling with unified streaming and parser improvements - Fix streaming content cleanup to prevent function syntax in output - Unify content extraction patterns with llama.cpp approach - Improve Kimi K2 parser robustness and partial content handling - Add comprehensive test coverage for function call scenarios - Optimize chat message parsing and diff computation * Replace hardcoded values in kimi_k2_parser.hpp with named constants - Add compile-time constants for all token format markers - Add compile-time constants for XML format markers - Add compile-time constants for simple format patterns - Replace all hardcoded string literals with named constants - Use compile-time length calculation to avoid manual counting - Improve maintainability and reduce magic numbers throughout parser * Fix duplicate common_chat_parse definition - Remove duplicate implementation from chat-parser.cpp - Keep single implementation in chat.cpp following llama.cpp patterns - Resolves linker error: multiple definition of common_chat_parse * Fix JSON assertion failure in function call parsing - Add proper validation that 'function' field is an object before accessing nested keys - Handle missing 'arguments' field gracefully with default "{}" - Prevents crash when parsing malformed tool call JSON structures * Add comprehensive Qwen3 XML tool calling support with unit tests - Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format - Add model detection and routing for Qwen3 vs Kimi-K2 formats - Create 8 comprehensive unit tests covering parsing, streaming, error handling - Fix token format cleaning bug in kimi_k2_parser.hpp processing order - Remove progressive parsing code and related utilities - Add tool injection support for Qwen3 format in server utils * Add DeepSeek R1 function calling support with comprehensive unit tests - Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp - Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp - Update function_calls.hpp with DeepSeek R1 integration and content extraction - Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models - Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration - Port exact implementation patterns from original llama.cpp for compatibility Key features: - Native DeepSeek R1 format: <｜tool▁calls▁begin｜>function<｜tool▁sep｜>name```json{}```<｜tool▁call▁end｜><｜tool▁calls▁end｜> - Reasoning content extraction from <think>...</think> tags - Multiple tool calls support with separate call blocks - Model detection for deepseek-r1, deepseek_r1 naming patterns - Integration with incremental parsing and streaming support * Add partial parsing support for JSON and regex - json-partial.h/cpp: JSON partial parsing functionality - regex-partial.h/cpp: Regex partial parsing functionality * Add format_chat integration tests for Qwen3 tool injection - Add test_qwen3_format_chat_integration() to validate tool injection pipeline - Test tool injection conditions and system message enhancement - Verify JSON formatting and anti-preamble instructions - Add comprehensive test documentation Tests confirm tool injection works correctly - conversational preamble issue is not in ik_llama.cpp but likely in UI configuration. * Fix Qwen3 tool call parsing - pass model name to parser Server was not passing model name to parse_chat_message_incremental(), causing Qwen3 to fall back to Kimi-K2 parser and return tool calls as content instead of proper tool_calls array. * Fix non-streaming path to use model-specific parsing Non-streaming responses were hardcoded to use Kimi-K2 format, causing Qwen3 XML tool calls to be returned as content instead of proper tool_calls array. Now uses same model detection as streaming path for consistency. * Update Qwen3 function call handling in server and tests - Enhanced server function call detection and response formatting - Improved test coverage for Qwen3 tool call scenarios - Refined XML parsing for better tool execution support * Add DeepSeek-R1 function call parsing support Implements comprehensive parsing for all 4 DeepSeek-R1 function call formats: - Format 1: Standard function call syntax (already supported) - Format 2: Alternative function call patterns (already supported) - Format 3: Tools array format - function\n```json\n{"tools": [...]} - Format 4: XML wrapped format - <tool_call>function</think>Name\n```json\n{...}```</tool_call> Key changes: - Added parse_deepseek_r1_tools_array() following original parse_prefixed_json_tool_call_array pattern - Added parse_deepseek_r1_xml_wrapped() following Hermes-2-Pro XML wrapper patterns - Integrated both parsers into exception handling chain for robust fallback - Added comprehensive TDD test coverage for all formats - Anonymized all confidential information while preserving functionality Resolves tool_calls_count=0 issue where DeepSeek-R1 models generated valid tool calls but server failed to parse them correctly. * Update function_calls.md documentation for DeepSeek-R1 Format 4 - Added Format 4 (XML wrapped) documentation with examples - Updated implementation notes with correct parser order (3→4→1→2) - Marked all DeepSeek-R1 formats as working (July 2025 update) - Updated test status for Format 3 and 4 as passing - Added parse_deepseek_r1_xml_wrapped() function reference - Corrected implementation file line numbers * Fix merge conflict in test-function-calls.cpp - Removed incomplete merge conflict marker from line 3027 - Ensured all tests compile and pass successfully - All DeepSeek-R1 formats (1-4) working correctly - All streaming and content cleaning tests passing * Fix DeepSeek R1 parsing issue with responses wrapped in think tags Restore missing consume_rest() call from working PR #648 implementation. When responses don't contain tool calls, remaining content after reasoning parsing must be preserved as displayable content. Fixes issue where entire responses wrapped in <think> tags resulted in empty content output. * Implement proper reasoning handling following original llama.cpp patterns - Add missing reasoning_format and reasoning_in_content fields to common_chat_syntax - Update try_parse_reasoning to match original llama.cpp logic exactly - Add TDD test case with reasoning_in_content=true for DeepSeek R1 - Following TDD: test should now pass with proper syntax configuration Based on original llama.cpp implementation patterns. * TDD SUCCESS: Fix DeepSeek R1 thinking tag termination issue ✅ Test passes with reasoning_in_content=true configuration - Content properly preserved: '<think>content</think>' displays fully - Reasoning field empty as expected - Following TDD: test-first approach validates the fix Next: Update server to automatically apply this configuration. * Complete server integration fix for DeepSeek R1 thinking tag termination - Server now automatically sets reasoning_in_content=true for DeepSeek R1 models - Fixes issue where responses wrapped in <think> tags appear empty to users * Add TDD test case for DeepSeek R1 thinking tag termination issue - Test reproduces the exact failure scenario reported by user - Validates that reasoning_in_content=true fixes the issue - Demonstrates empty content problem and working solution * Add remaining TDD test changes for DeepSeek R1 thinking tag fix * Add debug output after upstream merge * Remove temporary benchmark and debug files - Remove tests/benchmark-progressive-parsing.cpp (development tool, not part of core functionality) - Remove tests/reproduce_bug.sh (debugging script, not needed for PR) * Port cpu moe options from mainline (#672) * Port cpu moe options from mainline * Use strdup and int32_t to follow coding guidelines * maxfp4: CUDA dequantize * mxfp4: CUDA GEMV * mxfp4: CUDA MMQ * mxfp4: minor CUDA tweaks --------- Co-authored-by: Iwan Kawrakow <[email protected]> Co-authored-by: Anton Sokolchenko <[email protected]> Co-authored-by: Parsa <[email protected]>

TheLegendOfKitty added 2 commits August 5, 2025 18:02

Port cpu moe options from mainline

e6e8e16

Use strdup and int32_t to follow coding guidelines

841db22

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Aug 6, 2025

Port cpu moe options from mainline ikawrakow#672

6228689

Co-Authored-By: Parsa <[email protected]>

ikawrakow approved these changes Aug 8, 2025

View reviewed changes

ikawrakow merged commit 293f4aa into ikawrakow:main Aug 8, 2025

ikawrakow pushed a commit that referenced this pull request Aug 8, 2025

Port cpu moe options from mainline (#672)

fd8384e

* Port cpu moe options from mainline * Use strdup and int32_t to follow coding guidelines

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Port cpu moe options from mainline #672

Port cpu moe options from mainline #672

Uh oh!

TheLegendOfKitty commented Aug 6, 2025

Uh oh!

ikawrakow commented Aug 7, 2025

Uh oh!

enbeec commented Aug 7, 2025 •

edited

Loading

Uh oh!

ikawrakow left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Port cpu moe options from mainline #672

Port cpu moe options from mainline #672

Uh oh!

Conversation

TheLegendOfKitty commented Aug 6, 2025

Uh oh!

ikawrakow commented Aug 7, 2025

Uh oh!

enbeec commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ikawrakow left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

enbeec commented Aug 7, 2025 •

edited

Loading