Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 7, 2025

Mirrored from ggml-org/llama.cpp#17081

This PR should fix ggml-org/llama.cpp#17076. Also added a test case in test-backend-ops to capture the bug

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 425c8a15 compared to base 0797ab8c reveals minimal performance variations across the llama.cpp codebase. The changes are primarily focused on a CUDA transpose copy bug fix with negligible impact on core inference performance.

Key Findings

Performance Metrics:

  • Highest Response Time Change: llm_graph_input_out_ids::can_reuse() improved by 0.096% (65.16 ns → 65.10 ns)
  • Highest Throughput Change: std::_Optional_base constructor degraded by 0.171% (23.52 ns → 23.56 ns)
  • Core Function Impact: No changes detected in critical inference functions (llama_decode, llama_encode, llama_tokenize)

Inference Performance Impact:
No impact on tokens per second expected. The modified functions are not part of the core tokenization/inference pipeline. The llm_graph_input_out_ids::can_reuse() function operates on graph optimization logic, while the std::_Optional_base constructor affects memory management utilities, neither directly influencing token processing throughput.

Power Consumption Analysis:
Negligible power consumption changes across all binaries (< 0.001% variation):

  • build.bin.libllama.so: -0.0003% (280,780 nJ → 280,779 nJ)
  • build.bin.llama-cvector-generator: -0.0004%
  • All other binaries show no measurable change

Flame Graph and CFG Analysis:
The llm_graph_input_out_ids::can_reuse() function exhibits identical assembly code between versions with a simple leaf-node execution pattern (single 65 ns block). The 0.06 ns improvement stems from microarchitectural optimizations rather than code changes, suggesting compiler or build environment enhancements.

GitHub Code Review:
The primary change addresses a CUDA transpose copy bug (PR #120) that previously caused crashes when nb00 == nb02. The fix changes the condition from nb00 < nb02 to nb00 <= nb02 and removes a problematic assertion. This resolves runtime failures for specific tensor layouts without affecting performance of existing operations.

Conclusion:
The analysis reveals stable performance with a targeted bug fix that enhances robustness without performance regressions. No actionable performance optimizations are required as the changes maintain existing efficiency while resolving edge-case crashes.

@DajanaV DajanaV force-pushed the main branch 27 times, most recently from 81cedf2 to 4c7638f Compare November 10, 2025 19:07
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 10ad295 to 84f6117 Compare December 7, 2025 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants