UPSTREAM PR #16991: CUDA: add stream-based concurrency #72

DajanaV · 2025-11-04T09:38:02Z

Possibly supersede #16786.

This PR adds support to run concurrent CUDA streams on single GPU setups.
At the moment this only targets the Q, K, V branch. I feel this is the "correct" approach in case the Q, K, V tensors are of different types/not in the same place in memory. The downside is that this approach doesn't come for free and there's some complexity involved, but I'm not an expert at the ggml graph and I feel it could be simplified.

Currently this is hidden by an env variable flag. To run you can use GGML_CUDA_ENABLE_GRAPH_OPT=1

TG Performance is in line with the previous PR (2-7% gain), we leave some performance on the table where we don't fuse operations in the parallel streams themselves (e.g. MUL_MAT + BIAS, RMS_NORM + MUL etc.), I couldn't find a simple enough way to enable fusion there.

Before:

Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg32	172.10 ± 0.05
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg64	164.89 ± 0.07
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg128	162.47 ± 0.05
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg32	124.67 ± 0.03
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg64	121.77 ± 0.21
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg128	121.21 ± 0.04
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg32	210.46 ± 0.07
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg64	207.49 ± 0.03
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg128	205.36 ± 0.03

After:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	fa	test	t/s
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg32	181.60 ± 0.11
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg64	173.92 ± 0.05
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	CUDA	99	1	tg128	170.95 ± 0.03
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg32	128.16 ± 0.05
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg64	125.28 ± 0.03
llama 8B Q5_K - Small	5.21 GiB	8.03 B	CUDA	99	1	tg128	124.18 ± 0.02
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg32	214.24 ± 0.08
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg64	211.05 ± 0.04
gpt-oss 20B MXFP4 MoE	11.27 GiB	20.91 B	CUDA	99	1	tg128	208.83 ± 0.03

loci-agentic-ai · 2025-11-04T10:26:46Z

Access the complete analysis in the LOCI Dashboard

loci-agentic-ai · 2025-11-15T02:52:40Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

The analysis examined PR #72 implementing CUDA stream-based concurrency for Q, K, V branch parallelization in llama.cpp. The changes introduce concurrent CUDA streams to improve GPU utilization, with demonstrated 2-7% throughput gains in benchmarks.

Performance Impact Assessment

Highest Performance Changes:

Response Time: llm_graph_input_out_ids::can_reuse() showed +0.096% (+0.063 ns)
Throughput Time: std::make_unique<llm_graph_input_attn_no_cache>() showed +0.111% (+0.078 ns)

Core Function Impact:
The performance changes do not affect critical inference functions (llama_decode, llama_encode, llama_tokenize) that directly impact tokens per second. The modified functions are utility functions in the graph construction layer, not the primary inference pipeline.

Power Consumption Analysis:
All binaries maintain stable power consumption with changes below 0.001%. The core inference library (build.bin.libllama.so) shows negligible variation at 280,731 nanojoules, indicating no significant computational intensity changes.

Flame Graph and CFG Analysis:
The can_reuse() function shows identical assembly code between versions with a flat execution profile (single 65ns operation). The 0.063ns performance difference represents measurement noise rather than algorithmic changes, likely caused by binary layout shifts affecting instruction cache alignment.

Code Review Insights:
The implementation adds sophisticated CUDA stream management infrastructure:

New concurrent event structures for stream synchronization
Graph optimization engine targeting 3-branch fan-out patterns
Per-stream memory pool management
Dynamic stream switching during execution

The changes are well-architected with appropriate safeguards and demonstrate measurable performance improvements in the target use case (Q, K, V parallelization).

Conclusion

The observed performance variations (sub-nanosecond changes) fall within measurement precision limits and do not impact inference performance. The CUDA concurrency implementation represents a positive enhancement to GPU utilization without introducing performance regressions in critical paths. No actionable performance optimizations are required based on the current analysis.

DajanaV temporarily deployed to PROD__AL_DEMO November 4, 2025 09:38 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 28 times, most recently from 44faeaa to d7421a0 Compare November 8, 2025 09:08

DajanaV force-pushed the main branch 2 times, most recently from ef7ca13 to c65ae84 Compare November 14, 2025 15:09

am17an added 4 commits November 14, 2025 23:26

ggml-cuda: use lambda instead of duplicating code

1129188

ggml-cuda: add some more comments

cfa1a02

ggml-cuda: add more detailed comments about concurrency

d385f76

ggml-cuda: rename + remove unused var

c9b06ad

DajanaV force-pushed the upstream-PR16991-branch_am17an-fused-qkv-stream branch from d3a8d93 to c9b06ad Compare November 15, 2025 02:09

DajanaV temporarily deployed to PROD__AL_DEMO November 15, 2025 02:09 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 21 times, most recently from a6141bf to e336e72 Compare November 17, 2025 12:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16991: CUDA: add stream-based concurrency #72

UPSTREAM PR #16991: CUDA: add stream-based concurrency #72

Uh oh!

DajanaV commented Nov 4, 2025

Uh oh!

loci-agentic-ai bot commented Nov 4, 2025

Uh oh!

loci-agentic-ai bot commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

UPSTREAM PR #16991: CUDA: add stream-based concurrency #72

Are you sure you want to change the base?

UPSTREAM PR #16991: CUDA: add stream-based concurrency #72

Uh oh!

Conversation

DajanaV commented Nov 4, 2025

Uh oh!

loci-agentic-ai bot commented Nov 4, 2025

Uh oh!

loci-agentic-ai bot commented Nov 15, 2025

Performance Analysis Summary

Overview

Performance Impact Assessment

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants