Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 7, 2025

Mirrored from ggml-org/llama.cpp#16333

This is related to the PR ggml-org/llama.cpp#16239

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of llama.cpp version 7f3d9888 compared to baseline 5ce9b30a reveals minimal performance impact from PR #123, which enhances ARM CPU detection in the build system. The changes are confined to CMake configuration files and do not modify core inference functions.

Key Findings

Performance Metrics:
• Highest Response Time change: llm_graph_input_out_ids::can_reuse() in build.bin.libllama.so increased by 0.096% (+0.063 ns, from 65.10 ns to 65.16 ns)
• Highest Throughput change: std::_Optional_base constructor in build.bin.llama-cvector-generator improved by 0.17% (-0.04 ns, from 23.56 ns to 23.52 ns)

Core Function Impact:
The changes do not affect critical inference functions (llama_decode, llama_encode, llama_tokenize) that directly impact tokens per second performance. The modified functions are utility components for graph optimization and JSON processing, not part of the main inference pipeline.

Inference Performance:
No measurable impact on tokens per second expected. The affected functions are not in the tokenization or inference critical path. Core functions like llama_decode() show no performance changes.

Power Consumption:
Negligible changes across all 15 binaries with total consumption remaining constant at ~1.7 million nanojoules. Largest change: -0.61 nJ in build.bin.libllama.so (effectively zero impact).

Technical Analysis:
Flame Graph: llm_graph_input_out_ids::can_reuse() shows single-frame execution with 100% self-time, indicating highly optimized inline operations
CFG Comparison: Identical assembly code between versions confirms the 0.096% regression stems from external factors (binary layout, instruction cache alignment) rather than algorithmic changes
Code Review: PR #123 improves ARM CPU detection logic with better -march/-mcpu flag handling and enhanced GCC compatibility

Conclusion:
The performance variations are within measurement noise levels and represent build system improvements rather than functional regressions. No actionable performance optimizations required.

@DajanaV DajanaV force-pushed the main branch 19 times, most recently from aa2fc28 to 0ad40ce Compare November 9, 2025 17:06
@DajanaV DajanaV force-pushed the upstream-PR16333-branch_angt-inspect-march-and-mcpu-to-found-the-cpu branch from 267f8d5 to 475b1d3 Compare November 10, 2025 08:41
@DajanaV DajanaV force-pushed the main branch 6 times, most recently from 6aa5dc2 to 81cedf2 Compare November 10, 2025 16:10
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from a9fcc24 to ea62cd5 Compare December 10, 2025 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants