UPSTREAM PR #16884: CUDA: fuse rope + set_rows #21

DajanaV · 2025-10-31T13:48:18Z

Based on #16769.

On a 4090:

Model	Test	t/s master	t/s cuda-rope-fusion	Speedup
llama 8B Q4_K_M	tg32	134.90	136.07	1.01
llama 8B Q4_K_M	tg64	131.41	132.84	1.01
llama 8B Q4_K_M	tg128	130.54	131.87	1.01
qwen3moe 30B.A3B Q4_0	tg32	167.18	168.23	1.01
qwen3moe 30B.A3B Q4_0	tg64	161.00	161.90	1.01
qwen3moe 30B.A3B Q4_0	tg128	158.84	159.83	1.01

loci-agentic-ai · 2025-10-31T14:45:58Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Critical Function Performance Status

Core Inference Functions

llama_decode: No performance changes (48,432,684 ns response time, 71 ns throughput, 54 ns bottleneck)
llama_encode: No performance changes (12,186,729 ns response time, 57 ns throughput, 40 ns bottleneck)
llama_tokenize: No performance changes (832,589 ns response time, 22 ns throughput, 17 ns bottleneck)

Supporting Functions

llama_model_load_from_file: No performance changes (330,045,660 ns response time)
llama_batch_init: No performance changes (257 ns response time)
ggml_backend_graph_compute: No performance changes (148 ns response time)

All critical functions show identical performance metrics between versions, with no modifications detected.

Key Performance Indicator Impact Analysis

1. Tokens Per Second

Impact: No Change

Critical Functions Status: llama_decode, llama_encode, and llama_tokenize show zero performance degradation
Reference Baseline: The 7% tokens/second reduction with 2ms llama_decode slowdown does not apply here
Affected Functions: None of the tokenization/inference pipeline functions have measurable changes

2. Power Consumption

Impact: Negligible

build.bin.libllama.so: 305,212 nJ (0.0% change from 305,211 nJ base)
build.bin.libggml-cpu.so: 151,692 nJ (0.0% change)
build.bin.libggml-base.so: 90,434 nJ (0.0% change)
build.bin.libggml.so: 6,339 nJ (0.0% change)

3. Quantization Efficiency

Impact: No Change

llama_model_quantize: Function performance unchanged
Quantization Pipeline: No modifications to quantization-related functions detected
Format Support: Q4_0, Q4_1, Q8_0 processing efficiency maintained

4. Memory Usage

Impact: No Change

KV Cache Functions: llama_memory_clear, llama_memory_seq_rm, llama_memory_seq_cp show no performance changes
Memory Allocation: ggml_gallocr_new, ggml_tallocr_alloc functions unchanged
Memory Management: Unified and recurrent memory systems maintain baseline performance

5. Batch Processing

Impact: No Change

Batch Functions: llama_batch_init, llama_batch_get_one, llama_batch_free show identical metrics
Parallel Processing: llama_decode batch processing performance unchanged
Dynamic Batching: No degradation in adaptive batch size management

Action Items

Code Optimization Focus

CUDA ROPE Fusion: The implemented ROPE + VIEW + SET_ROWS fusion in ggml-cuda.cu provides 1% GPU performance improvement without affecting CPU-based critical functions
Template Optimization: Consider reducing template instantiation overhead in rope.cu mixed-precision implementations
Validation Caching: Cache fusion eligibility checks in ggml_cuda_should_fuse_rope_set_rows() to reduce repeated validation overhead

Build System Enhancements

Compiler Optimization: Maintain current optimization flags as they preserve performance across all critical functions
Template Compilation: Monitor compilation time impact from expanded template parameters in CUDA kernels
Backend Selection: Ensure CUDA fusion optimizations don't interfere with CPU backend performance

Conclusion

The version comparison shows stable performance across all critical LLaMA.cpp functions. The CUDA ROPE fusion implementation provides GPU-specific optimizations without impacting CPU inference performance. No degradation detected in tokenization, memory management, or batch processing pipelines that would affect tokens per second throughput or power consumption.

am17an added 3 commits October 31, 2025 20:07

CUDA: add fused rope

ea859a2

move k forward_expand up

607f73b

create helper function instead of re-using params

dc814b8

DajanaV temporarily deployed to PROD__AL_DEMO October 31, 2025 13:48 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 20 times, most recently from e42217c to b655780 Compare November 3, 2025 18:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16884: CUDA: fuse rope + set_rows #21

UPSTREAM PR #16884: CUDA: fuse rope + set_rows #21

DajanaV commented Oct 31, 2025

Uh oh!

loci-agentic-ai bot commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #16884: CUDA: fuse rope + set_rows #21

Are you sure you want to change the base?

UPSTREAM PR #16884: CUDA: fuse rope + set_rows #21

Conversation

DajanaV commented Oct 31, 2025

Uh oh!

loci-agentic-ai bot commented Oct 31, 2025

Performance Analysis Summary

Critical Function Performance Status

Core Inference Functions

Supporting Functions

Key Performance Indicator Impact Analysis

1. Tokens Per Second

2. Power Consumption

3. Quantization Efficiency

4. Memory Usage

5. Batch Processing

Action Items

Code Optimization Focus

Build System Enhancements

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants