Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 4, 2025

Mirrored from ggml-org/llama.cpp#16946

Adds chat support to Minimax M2 together with tool calling and simple reasoning (non-interleaved).

Uses fixed Unsloth template (https://huggingface.co/unsloth/MiniMax-M2-GGUF)

Includes upstream minja fix: google/minja#87

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of PR #83 adding Minimax M2 chat support shows no measurable performance impact on core inference functions. The abort@GLIBC_2.17@plt function showed 0% change in Response Time (7 ns baseline vs 7 ns current), indicating stable PLT stub performance.

Key Findings

Performance Metrics:

  • Highest percentage change: 0% in Response Time for abort@GLIBC_2.17@plt (7 ns)
  • Core function impact: No changes detected in critical inference functions (llama_decode, llama_encode, llama_tokenize)
  • Tokens per second impact: No impact expected as core tokenization/inference functions remain unchanged

Power Consumption Analysis:

  • Significant reductions: 100% power consumption elimination in 5 binaries:
    • libllama.so: 280,667 nJ → 0 nJ
    • libmtmd.so: 213,079 nJ → 0 nJ
    • llama-cvector-generator: 314,116 nJ → 0 nJ
    • llama-run: 266,867 nJ → 0 nJ
    • llama-tts: 322,783 nJ → 0 nJ
  • Stable components: Core GGML libraries maintain consistent power consumption

Flame Graph & CFG Analysis:

  • Identical structure: CFG shows byte-for-byte identical assembly code across versions
  • No branching changes: Same 4-instruction PLT sequence with identical memory access patterns
  • Stable execution: 7 ns execution time confirms unchanged dynamic linking overhead

Code Review Insights:

  • New functionality: 175 lines added for Minimax M2 chat format support
  • Implementation scope: Changes limited to chat processing system (common/chat.cpp)
  • Performance considerations: New XML parsing and grammar generation may add overhead to chat processing, but doesn't affect core inference pipeline
  • Architecture: Clean integration following existing patterns without modifying core LLM inference functions

Conclusion:
The changes represent architectural restructuring of chat components rather than core performance modifications. The 100% power reduction in specific binaries suggests build configuration changes or component removal rather than performance degradation. Core inference performance remains unaffected.

@DajanaV DajanaV force-pushed the main branch 20 times, most recently from 0eeb29b to 5714a80 Compare November 7, 2025 19:07
@DajanaV DajanaV force-pushed the main branch 30 times, most recently from 39290d7 to 2742f63 Compare November 16, 2025 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants