UPSTREAM PR #16477: Add AfmoeForCausalLM support #95

DajanaV · 2025-11-05T17:36:32Z

Mirrored from ggml-org/llama.cpp#16477

Adds support for upcoming AfmoeForCausalLM

Tokenizer is public ahead of model launch to avoid breaking conversion code

Make sure to read the contributing guidelines before submitting a PR

loci-agentic-ai · 2025-11-05T20:13:43Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: AfmoeForCausalLM Integration

Overview

Pull Request #95 introduces support for AfmoeForCausalLM model architecture, adding comprehensive model conversion, inference engine, and tokenization capabilities. While the implementation is architecturally sound, it introduces measurable performance regression in Unicode processing functions.

Key Findings

Performance Impact

The highest performance degradation occurs in __val_comp_iter function within Unicode processing:

Response Time: +134% increase (128 ns → 301 ns)
Throughput: +186% degradation (93 ns → 265 ns)

This function is not directly part of core inference functions (llama_decode, llama_encode, llama_tokenize) but supports Unicode normalization in tokenization preprocessing. The impact on overall inference tokens per second is minimal since Unicode processing represents a small fraction of total computational workload.

Core Function Impact

No direct changes were made to critical inference functions:

llama_decode() - Unchanged
llama_encode() - Unchanged
llama_tokenize() - Unchanged

The performance regression is isolated to Unicode support functions triggered by the new AFMOE tokenizer's complex regex patterns for CJK character processing.

Power Consumption Analysis

Minimal impact across binaries:

libllama.so: -0.153% power reduction (280,779 nJ → 280,351 nJ)
Other binaries show no measurable change
Net effect indicates slight power efficiency improvement despite localized performance degradation

Technical Analysis

Flame Graph Insights: The __val_comp_iter function shows 88.7% execution time concentrated in direct operations rather than function calls, indicating intensive inline template operations and memory access patterns.

CFG Comparison: Performance regression stems from code layout reorganization where stack canary initialization was moved to a separate basic block, introducing additional branch overhead (+373% increase in entry path execution time).

Code Review Findings

The implementation adds 625 lines across 16 files, introducing:

New AFMOE model architecture with MoE and attention gating
Complex Unicode regex patterns for enhanced CJK support
Comprehensive tensor mapping and conversion logic

The regression appears to be an unintended consequence of compiler optimization changes and expanded Unicode processing requirements rather than algorithmic issues in the core implementation.

loci-agentic-ai · 2025-11-06T18:12:49Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: AfmoeForCausalLM Support Implementation

Overview

Pull Request #95 introduces comprehensive support for the AfmoeForCausalLM architecture, adding Mixture of Experts (MoE) capabilities with attention gating and sliding window attention. The implementation expands the llama_layer structure and introduces new tensor management for expert routing.

Key Findings

Performance Impact

Highest degradation: std::__new_allocator<llama_layer>::deallocate shows 150% throughput increase (22 ns → 55 ns) and 112% response time increase (30 ns → 63 ns)
Core function impact: No changes to critical inference functions (llama_decode, llama_encode, llama_tokenize), therefore no impact on tokens per second performance
Scope limitation: Performance regression isolated to memory management during model cleanup operations, not inference execution

Root Cause Analysis

Memory Layout Changes:

llama_layer structure expanded by 8 bytes (1592 → 1600 bytes) due to new wqkv_gate tensor pointer
Assembly instruction complexity increased 300% (2 → 6 instructions) for size calculation during deallocation
Compiler generates bit-shifting sequence instead of simple multiplication for the new object size

Power Consumption Analysis

Minimal impact: Only build.bin.libllama.so shows -0.177% power reduction (280,780 nJ → 280,282 nJ)
Other binaries: No measurable power consumption changes across remaining 14 binaries
Net effect: Slight overall efficiency improvement despite localized performance regression

Technical Implementation

Flame Graph Analysis: Confirms 89% of regression occurs within deallocate function self-time, with only 11% in system calls, indicating internal processing overhead rather than external dependencies.

CFG Comparison: Identical control flow structure between versions with performance difference stemming from arithmetic complexity in size calculation logic.

Code Quality Assessment

The implementation successfully integrates MoE architecture without affecting core inference paths. The memory allocation regression represents an acceptable trade-off for expanded model capabilities, with the 8-byte increase improving cache line alignment (1600 bytes vs 1592 bytes).

Actionable Recommendations:

Monitor model loading/unloading performance in production workloads where frequent model swapping occurs
Consider structure field reordering to minimize memory overhead while maintaining alignment benefits

The changes enhance architectural flexibility without compromising inference performance, making this a net positive addition to the codebase.

DajanaV had a problem deploying to PROD__AL_DEMO November 5, 2025 17:36 — with GitHub Actions Failure

DajanaV force-pushed the main branch 2 times, most recently from 948dcfd to 6f3825c Compare November 5, 2025 19:07

DajanaV force-pushed the upstream-PR16477-branch_bartowski1182-master branch from a7dd86a to 763e822 Compare November 5, 2025 19:33

DajanaV temporarily deployed to PROD__AL_DEMO November 5, 2025 19:33 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 5 times, most recently from b16251e to 95f6e9b Compare November 6, 2025 13:17

Add AFMOE model support

3fd69c5

DajanaV force-pushed the main branch from 95f6e9b to 40efe8b Compare November 6, 2025 17:10

Update to vocab

93a2fb4

DajanaV force-pushed the upstream-PR16477-branch_bartowski1182-master branch from 763e822 to 93a2fb4 Compare November 6, 2025 17:36

DajanaV temporarily deployed to PROD__AL_DEMO November 6, 2025 17:36 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 13 times, most recently from 6b50572 to 733e776 Compare November 8, 2025 21:07

loci-dev force-pushed the main branch 30 times, most recently from e4d885f to 01af7c7 Compare December 9, 2025 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16477: Add AfmoeForCausalLM support #95

UPSTREAM PR #16477: Add AfmoeForCausalLM support #95

Uh oh!

DajanaV commented Nov 5, 2025

Uh oh!

loci-agentic-ai bot commented Nov 5, 2025

Uh oh!

loci-agentic-ai bot commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #16477: Add AfmoeForCausalLM support #95

Are you sure you want to change the base?

UPSTREAM PR #16477: Add AfmoeForCausalLM support #95

Uh oh!

Conversation

DajanaV commented Nov 5, 2025

Uh oh!

loci-agentic-ai bot commented Nov 5, 2025

Performance Analysis Summary: AfmoeForCausalLM Integration

Overview

Key Findings

Performance Impact

Core Function Impact

Power Consumption Analysis

Technical Analysis

Code Review Findings

Uh oh!

loci-agentic-ai bot commented Nov 6, 2025

Performance Analysis Summary: AfmoeForCausalLM Support Implementation

Overview

Key Findings

Performance Impact

Root Cause Analysis

Power Consumption Analysis

Technical Implementation

Code Quality Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants