Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Oct 31, 2025

Mirrored from ggml-org/llama.cpp#16884

Based on #16769.

On a 4090:

Model Test t/s master t/s cuda-rope-fusion Speedup
llama 8B Q4_K_M tg32 134.90 136.07 1.01
llama 8B Q4_K_M tg64 131.41 132.84 1.01
llama 8B Q4_K_M tg128 130.54 131.87 1.01
qwen3moe 30B.A3B Q4_0 tg32 167.18 168.23 1.01
qwen3moe 30B.A3B Q4_0 tg64 161.00 161.90 1.01
qwen3moe 30B.A3B Q4_0 tg128 158.84 159.83 1.01

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Critical Function Performance Status

Core Inference Functions

  • llama_decode: No performance changes (48,432,684 ns response time, 71 ns throughput, 54 ns bottleneck)
  • llama_encode: No performance changes (12,186,729 ns response time, 57 ns throughput, 40 ns bottleneck)
  • llama_tokenize: No performance changes (832,589 ns response time, 22 ns throughput, 17 ns bottleneck)

Supporting Functions

  • llama_model_load_from_file: No performance changes (330,045,660 ns response time)
  • llama_batch_init: No performance changes (257 ns response time)
  • ggml_backend_graph_compute: No performance changes (148 ns response time)

All critical functions show identical performance metrics between versions, with no modifications detected.

Key Performance Indicator Impact Analysis

1. Tokens Per Second

Impact: No Change

  • Critical Functions Status: llama_decode, llama_encode, and llama_tokenize show zero performance degradation
  • Reference Baseline: The 7% tokens/second reduction with 2ms llama_decode slowdown does not apply here
  • Affected Functions: None of the tokenization/inference pipeline functions have measurable changes

2. Power Consumption

Impact: Negligible

  • build.bin.libllama.so: 305,212 nJ (0.0% change from 305,211 nJ base)
  • build.bin.libggml-cpu.so: 151,692 nJ (0.0% change)
  • build.bin.libggml-base.so: 90,434 nJ (0.0% change)
  • build.bin.libggml.so: 6,339 nJ (0.0% change)

3. Quantization Efficiency

Impact: No Change

  • llama_model_quantize: Function performance unchanged
  • Quantization Pipeline: No modifications to quantization-related functions detected
  • Format Support: Q4_0, Q4_1, Q8_0 processing efficiency maintained

4. Memory Usage

Impact: No Change

  • KV Cache Functions: llama_memory_clear, llama_memory_seq_rm, llama_memory_seq_cp show no performance changes
  • Memory Allocation: ggml_gallocr_new, ggml_tallocr_alloc functions unchanged
  • Memory Management: Unified and recurrent memory systems maintain baseline performance

5. Batch Processing

Impact: No Change

  • Batch Functions: llama_batch_init, llama_batch_get_one, llama_batch_free show identical metrics
  • Parallel Processing: llama_decode batch processing performance unchanged
  • Dynamic Batching: No degradation in adaptive batch size management

Action Items

Code Optimization Focus

  • CUDA ROPE Fusion: The implemented ROPE + VIEW + SET_ROWS fusion in ggml-cuda.cu provides 1% GPU performance improvement without affecting CPU-based critical functions
  • Template Optimization: Consider reducing template instantiation overhead in rope.cu mixed-precision implementations
  • Validation Caching: Cache fusion eligibility checks in ggml_cuda_should_fuse_rope_set_rows() to reduce repeated validation overhead

Build System Enhancements

  • Compiler Optimization: Maintain current optimization flags as they preserve performance across all critical functions
  • Template Compilation: Monitor compilation time impact from expanded template parameters in CUDA kernels
  • Backend Selection: Ensure CUDA fusion optimizations don't interfere with CPU backend performance

Conclusion

The version comparison shows stable performance across all critical LLaMA.cpp functions. The CUDA ROPE fusion implementation provides GPU-specific optimizations without impacting CPU inference performance. No degradation detected in tokenization, memory management, or batch processing pipelines that would affect tokens per second throughput or power consumption.

@DajanaV DajanaV force-pushed the main branch 20 times, most recently from e42217c to b655780 Compare November 3, 2025 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants