Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 4, 2025

Mirrored from ggml-org/llama.cpp#16991

Possibly supersede #16786.

This PR adds support to run concurrent CUDA streams on single GPU setups.
At the moment this only targets the Q, K, V branch. I feel this is the "correct" approach in case the Q, K, V tensors are of different types/not in the same place in memory. The downside is that this approach doesn't come for free and there's some complexity involved, but I'm not an expert at the ggml graph and I feel it could be simplified.

Currently this is hidden by an env variable flag. To run you can use GGML_CUDA_ENABLE_GRAPH_OPT=1

TG Performance is in line with the previous PR (2-7% gain), we leave some performance on the table where we don't fuse operations in the parallel streams themselves (e.g. MUL_MAT + BIAS, RMS_NORM + MUL etc.), I couldn't find a simple enough way to enable fusion there.

Before:

Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model size params backend ngl fa test t/s
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg32 172.10 ± 0.05
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg64 164.89 ± 0.07
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg128 162.47 ± 0.05
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg32 124.67 ± 0.03
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg64 121.77 ± 0.21
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg128 121.21 ± 0.04
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg32 210.46 ± 0.07
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg64 207.49 ± 0.03
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg128 205.36 ± 0.03

After:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model size params backend ngl fa test t/s
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg32 181.60 ± 0.11
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg64 173.92 ± 0.05
qwen3moe 30B.A3B Q4_K - Medium 17.28 GiB 30.53 B CUDA 99 1 tg128 170.95 ± 0.03
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg32 128.16 ± 0.05
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg64 125.28 ± 0.03
llama 8B Q5_K - Small 5.21 GiB 8.03 B CUDA 99 1 tg128 124.18 ± 0.02
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg32 214.24 ± 0.08
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg64 211.05 ± 0.04
gpt-oss 20B MXFP4 MoE 11.27 GiB 20.91 B CUDA 99 1 tg128 208.83 ± 0.03

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

@DajanaV DajanaV force-pushed the main branch 28 times, most recently from 44faeaa to d7421a0 Compare November 8, 2025 09:08
@DajanaV DajanaV force-pushed the main branch 2 times, most recently from ef7ca13 to c65ae84 Compare November 14, 2025 15:09
@DajanaV DajanaV force-pushed the upstream-PR16991-branch_am17an-fused-qkv-stream branch from d3a8d93 to c9b06ad Compare November 15, 2025 02:09
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

The analysis examined PR #72 implementing CUDA stream-based concurrency for Q, K, V branch parallelization in llama.cpp. The changes introduce concurrent CUDA streams to improve GPU utilization, with demonstrated 2-7% throughput gains in benchmarks.

Performance Impact Assessment

Highest Performance Changes:

  • Response Time: llm_graph_input_out_ids::can_reuse() showed +0.096% (+0.063 ns)
  • Throughput Time: std::make_unique<llm_graph_input_attn_no_cache>() showed +0.111% (+0.078 ns)

Core Function Impact:
The performance changes do not affect critical inference functions (llama_decode, llama_encode, llama_tokenize) that directly impact tokens per second. The modified functions are utility functions in the graph construction layer, not the primary inference pipeline.

Power Consumption Analysis:
All binaries maintain stable power consumption with changes below 0.001%. The core inference library (build.bin.libllama.so) shows negligible variation at 280,731 nanojoules, indicating no significant computational intensity changes.

Flame Graph and CFG Analysis:
The can_reuse() function shows identical assembly code between versions with a flat execution profile (single 65ns operation). The 0.063ns performance difference represents measurement noise rather than algorithmic changes, likely caused by binary layout shifts affecting instruction cache alignment.

Code Review Insights:
The implementation adds sophisticated CUDA stream management infrastructure:

  • New concurrent event structures for stream synchronization
  • Graph optimization engine targeting 3-branch fan-out patterns
  • Per-stream memory pool management
  • Dynamic stream switching during execution

The changes are well-architected with appropriate safeguards and demonstrate measurable performance improvements in the target use case (Q, K, V parallelization).

Conclusion

The observed performance variations (sub-nanosecond changes) fall within measurement precision limits and do not impact inference performance. The CUDA concurrency implementation represents a positive enhancement to GPU utilization without introducing performance regressions in critical paths. No actionable performance optimizations are required based on the current analysis.

@DajanaV DajanaV force-pushed the main branch 21 times, most recently from a6141bf to e336e72 Compare November 17, 2025 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants