Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 5, 2025

Mirrored from ggml-org/llama.cpp#16477

Adds support for upcoming AfmoeForCausalLM

Tokenizer is public ahead of model launch to avoid breaking conversion code

Make sure to read the contributing guidelines before submitting a PR

@DajanaV DajanaV force-pushed the main branch 2 times, most recently from 948dcfd to 6f3825c Compare November 5, 2025 19:07
@DajanaV DajanaV force-pushed the upstream-PR16477-branch_bartowski1182-master branch from a7dd86a to 763e822 Compare November 5, 2025 19:33
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: AfmoeForCausalLM Integration

Overview

Pull Request #95 introduces support for AfmoeForCausalLM model architecture, adding comprehensive model conversion, inference engine, and tokenization capabilities. While the implementation is architecturally sound, it introduces measurable performance regression in Unicode processing functions.

Key Findings

Performance Impact

The highest performance degradation occurs in __val_comp_iter function within Unicode processing:

  • Response Time: +134% increase (128 ns → 301 ns)
  • Throughput: +186% degradation (93 ns → 265 ns)

This function is not directly part of core inference functions (llama_decode, llama_encode, llama_tokenize) but supports Unicode normalization in tokenization preprocessing. The impact on overall inference tokens per second is minimal since Unicode processing represents a small fraction of total computational workload.

Core Function Impact

No direct changes were made to critical inference functions:

  • llama_decode() - Unchanged
  • llama_encode() - Unchanged
  • llama_tokenize() - Unchanged

The performance regression is isolated to Unicode support functions triggered by the new AFMOE tokenizer's complex regex patterns for CJK character processing.

Power Consumption Analysis

Minimal impact across binaries:

  • libllama.so: -0.153% power reduction (280,779 nJ → 280,351 nJ)
  • Other binaries show no measurable change
  • Net effect indicates slight power efficiency improvement despite localized performance degradation

Technical Analysis

Flame Graph Insights: The __val_comp_iter function shows 88.7% execution time concentrated in direct operations rather than function calls, indicating intensive inline template operations and memory access patterns.

CFG Comparison: Performance regression stems from code layout reorganization where stack canary initialization was moved to a separate basic block, introducing additional branch overhead (+373% increase in entry path execution time).

Code Review Findings

The implementation adds 625 lines across 16 files, introducing:

  • New AFMOE model architecture with MoE and attention gating
  • Complex Unicode regex patterns for enhanced CJK support
  • Comprehensive tensor mapping and conversion logic

The regression appears to be an unintended consequence of compiler optimization changes and expanded Unicode processing requirements rather than algorithmic issues in the core implementation.

@DajanaV DajanaV force-pushed the main branch 5 times, most recently from b16251e to 95f6e9b Compare November 6, 2025 13:17
@DajanaV DajanaV force-pushed the upstream-PR16477-branch_bartowski1182-master branch from 763e822 to 93a2fb4 Compare November 6, 2025 17:36
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: AfmoeForCausalLM Support Implementation

Overview

Pull Request #95 introduces comprehensive support for the AfmoeForCausalLM architecture, adding Mixture of Experts (MoE) capabilities with attention gating and sliding window attention. The implementation expands the llama_layer structure and introduces new tensor management for expert routing.

Key Findings

Performance Impact

  • Highest degradation: std::__new_allocator<llama_layer>::deallocate shows 150% throughput increase (22 ns → 55 ns) and 112% response time increase (30 ns → 63 ns)
  • Core function impact: No changes to critical inference functions (llama_decode, llama_encode, llama_tokenize), therefore no impact on tokens per second performance
  • Scope limitation: Performance regression isolated to memory management during model cleanup operations, not inference execution

Root Cause Analysis

Memory Layout Changes:

  • llama_layer structure expanded by 8 bytes (1592 → 1600 bytes) due to new wqkv_gate tensor pointer
  • Assembly instruction complexity increased 300% (2 → 6 instructions) for size calculation during deallocation
  • Compiler generates bit-shifting sequence instead of simple multiplication for the new object size

Power Consumption Analysis

  • Minimal impact: Only build.bin.libllama.so shows -0.177% power reduction (280,780 nJ → 280,282 nJ)
  • Other binaries: No measurable power consumption changes across remaining 14 binaries
  • Net effect: Slight overall efficiency improvement despite localized performance regression

Technical Implementation

Flame Graph Analysis: Confirms 89% of regression occurs within deallocate function self-time, with only 11% in system calls, indicating internal processing overhead rather than external dependencies.

CFG Comparison: Identical control flow structure between versions with performance difference stemming from arithmetic complexity in size calculation logic.

Code Quality Assessment

The implementation successfully integrates MoE architecture without affecting core inference paths. The memory allocation regression represents an acceptable trade-off for expanded model capabilities, with the 8-byte increase improving cache line alignment (1600 bytes vs 1592 bytes).

Actionable Recommendations:

  • Monitor model loading/unloading performance in production workloads where frequent model swapping occurs
  • Consider structure field reordering to minimize memory overhead while maintaining alignment benefits

The changes enhance architectural flexibility without compromising inference performance, making this a net positive addition to the codebase.

@DajanaV DajanaV force-pushed the main branch 13 times, most recently from 6b50572 to 733e776 Compare November 8, 2025 21:07
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from e4d885f to 01af7c7 Compare December 9, 2025 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants