-
Notifications
You must be signed in to change notification settings - Fork 13.7k
CUDA: add stream-based concurrency #16991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
1e97a91 to
1c4d8f3
Compare
|
Sorry, I wanted to tell you this but I forgot: a long time ago I tried something similar, see #4719 . There the performance did not improve, I think the reason was the lack of CUDA graphs to reduce the overhead. |
|
Yeah, I think CUDA graphs are essential for this to work (hence this PR only looks at batch_size=1) |
1c4d8f3 to
70a5a01
Compare
|
Minimal changes to make this work on hip: If used for real, cudaStreamWaitEvent error needs to handled of course with
|
|
The almost exact same numbers make me think that this change is not launching the streams. I would expect a shift in performance either for the worse or the better. |
|
yeah ill run a trace on it later. |
70a5a01 to
2c3cfa9
Compare
12d5f82 to
d3a8d93
Compare
|
@ggerganov would you mind testing this on your DGX spark? I want to see if the low memory bandwidth GPUs benefit from this change |
I'm not really clear on what the problem is here you're trying to solve. If the order is: MUL_MAT+ADD+MUL_MAT+ADD+MUL_MAT+ADD, then you have the nodes conveniently consecutive (for fusion), the intermediate MUL_MAT outputs aren't needed and the ADDs will all have different outputs. This is the order ggml-vulkan will use and it gets both fusion and concurrency. |
|
My problem is that the buffer gets reused in this case. The graph assumes serial execution, and thinks the first mul-mats buffer is no longer required. (Assume the no fusion case for now) |
|
If you're not doing fusion, then you'd want graph_optimize to reorder these to MUL_MAT+MUL_MAT+MUL_MAT+ADD+ADD+ADD. Then the MUL_MAT results will stay live until the ADDs. |
|
Yeah that's what the current re-order does. But that doesn't allow for fusion. I don't want these two things to be intertwined. Ideally want something that just lets me extend the lifetime for a particular output till a certain node |
|
IMO they are fundamentally intertwined. The code that detects fusions looks for specific sequences of operations, and graph_optimize should generate or preserve those sequences. If the backend supports fusing something, then graph_optimize should make them consecutive both to make the fusion logic simpler and to shorten the lifetime of transient allocations. |
|
I think it will be good to isolate these two behaviours. If you see the graph above it can launch a concurrent graph from the mul-mat till set rows. We don't have fusion for that entire sequence, and reasoning about which output stays alive would involve inspecting the graph in any case. Secondly fusion is a common source of bugs in the cuda backend, I don't want to add another layer of complexity on top of it. |
This comment was marked as outdated.
This comment was marked as outdated.
|
Did you pass the env flag? |
Why of course not 🫠
Updated numbers
|
If my current understanding of ggml is correct, we should be able to get the same behavior (fusion + concurrency) on both vulkan + cuda, as Q should go out of life after flash-attention (and K + V go out of life after being inserted into the KV-cache). Have we root-caused this? |
|
@am17an please cherry-pick in bc93319 to fix the hip build. I can confirm that multiple streams are successfully launching kernels at the same time on hip, the mi100 performance is indeed just unchanged. On the other hand rx6800xt sees a small perf decrease:
But that dosent matter ofc as long as the pr remains hidden behind an envvar. |
d3a8d93 to
95da850
Compare
|
Sorry for the long radio silence, what is the current status of this PR? Is there still something to be done or do you want to move towards merging it? |
I don't quite follow what you're saying here, but maybe this is what you mean. Let's say the original graph is: With this order, you could fuse each of QMul+QAdd, KMul+KAdd, and VMul+VAdd, and run those three fusions on separate streams, and join before the Norms. I don't think there would be a memory allocation/aliasing problem. This is basically what ggml-vulkan will do today. However, if you want to extend the duration of the streams so that all Q work is on a stream, all K work is on a stream, all V work is on a stream, and only do a single join right before the final |
Yes, that's exactly what I want to do with this PR at least as we only have to orchestrate only one event with cudaEventWait
Let me clean up the code a bit and we can look at merging. Since this is behind an env flag for now it should not cause problems. When it's better tested we can look at enabling by default |
This comment was marked as outdated.
This comment was marked as outdated.
|
@am17an Should I compare |
|
Yes with the latest change it should be different from what @ORippler tested earlier |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
I'm sorry, I don't know how the PPL reported ok, there was a bug in launching streams. I fixed it now and results are more believable. For this run I manually tested llama-cli on a few prompts and they seem ok. 5090:
4090:
And just for comparison, this is without fusing ops inside a stream 5090:
4090:
|
This comment has been minimized.
This comment has been minimized.
|
Not sure why PP would be affected in your case, perhaps I need to rebase? |
|
I think for now let's ignore the DGX Spark numbers that I posted. I am observing some large variance between runs (even on |
ORippler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the effort that has gone in this PR so far! Left some comments, will try to finish tomorrow
| static bool enable_graph_optimization = [] { | ||
| const char * env = getenv("GGML_CUDA_GRAPH_OPT"); | ||
| return env != nullptr && atoi(env) == 1; | ||
| }(); | ||
|
|
||
| if (!enable_graph_optimization) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| static bool enable_graph_optimization = [] { | |
| const char * env = getenv("GGML_CUDA_GRAPH_OPT"); | |
| return env != nullptr && atoi(env) == 1; | |
| }(); | |
| if (!enable_graph_optimization) { | |
| return; | |
| } | |
| static bool enable_cgraph_optimization = [] { | |
| const char * env = getenv("GGML_CUDA_CGRAPH_OPT"); | |
| return env != nullptr && atoi(env) == 1; | |
| }(); | |
| if (!enable_cgraph_optimization) { | |
| return; | |
| } |
This makes it clearer we are not talking about optimizing CUDA Graphs, but rather ggml_cgraph objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the correct name because the function unfortunately named ggml_backend_cuda_graph_optimize, which has nothing to do with CUDA Graphs. I don't know how to reconcile these two at the moment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would tend to disagree, and would propose renaming ggml_backend_cuda_graph_optimize to ggml_backend_cuda_cgraph_optimize instead.
A backend's interface is defined fully in ggml-backend-impl.h, and that is what the cuda backend needs to adhere to (and does). Since the cuda backend has to handle ggml_cgraph and CUDA Graphs internally, I personally think moving forward it makes sense to explicitly encode what we are handling in which function, as this saves the look to the function declaration. Consequentially, I would also like to rename ggml_backend_cuda_graph_compute to ggml_backend_cuda_cgraph_compute. This makes the above suggestion not strictly tied to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about for now we keep this, and change it later when your points are addressed (possibly by yourself) in the larger ggml scheme of things?
| } | ||
| for (int src_idx = 0; src_idx < GGML_MAX_SRC; ++src_idx) { | ||
| const ggml_tensor * node = cgraph->nodes[node_idx]->src[src_idx]; | ||
| //TODO: check why nrows > 1 fails, probably related to CUDA graphs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious: how does this failure manifest itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
garbled outputs, I'm not really sure what happens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my side I could not repro garbled outputs for
GGML_CUDA_GRAPH_OPT=1 ./build-x64-linux-gcc-debug/bin/llama-cli -m /mnt/share/gguf/gpt-oss-20b-mxfp4.gguf -p "What is the Capital of Sweden? Please be very elaborative and jokey in your answer."
on commit 6562b7797 when reducing the test to if (node && !is_empty(node)).
Please specify a repro so we/I can try to root-cause (I am afraid of another heuristic that we do not fully understand, akin to the batch-size heuristic for disabling CUDA Graph). If it does not repro on your side either, I would say we enable this for pre-fill phase also (if it gives perf improvements)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repro is with qwen3-30b. I also can't repro it with gpt-oss. However, perf improvements won't come in prefill unless we can launch the streams concurrently which happens solely because of CUDA graphs at the moment. In the original graph (which we now come back to enable fusion) has the DFS form (Q things first, K next and V last).
If you can root-cause the repro, we can disable fusion within a stream in prefill phase (i.e. not revert to the orig graph), that would interleave execution. From what I notice it can speed up things about 4-5% too, but at the cost of higher peak memory usage
| for (int src_idx = 0; src_idx < GGML_MAX_SRC; ++src_idx) { | ||
| const ggml_tensor * node = cgraph->nodes[node_idx]->src[src_idx]; | ||
| //TODO: check why nrows > 1 fails, probably related to CUDA graphs | ||
| if (node && !is_empty(node) && ggml_nrows(node) <= 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if (node && !is_empty(node) && ggml_nrows(node) <= 1) { | |
| if (node && !is_empty(node) && ggml_nrows(node) == 1) { |
Do we have nodes with 0 rows? Thought we always have at least 1 element so we can multiply & divide safely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it doesn't but this check is more like ggml_nrows <= X where X we need probably can configure after it works for nrows > 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above
2c7a707 to
6562b77
Compare
ORippler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, thanks again for the tremendous effort that went into this PR, which gives a lot of perf! While the main logic is solid for QKV projections (and awesome to see we managed to achieve concurrency + fusion for the CUDA backend as well), I still have some concerns revolving around maintainability. Some recommendations/what I think we should address before merging this:
- Clean up the code (see detailed comments)
- Add some docs that state the pattern
ggml_backend_cuda_graph_optimizematches for to enable concurrency (fan_out == fan_in == 3+ no other work happens between the root/join node that does not belong to one of the 3 branches). Also I feel matching for a different pattern will be difficult with the current implementation, but that's a problem to tackle if we see other patterns where significant perf can be gained. - Resolve nrows repro
While I would also love to see some tests, I would not consider this blocking given the effort that has already went into this PR as is + the fact that it is hidden behind a flag for the moment (though we could also consider making it the default to find potential bugs quicker)
| static bool enable_graph_optimization = [] { | ||
| const char * env = getenv("GGML_CUDA_GRAPH_OPT"); | ||
| return env != nullptr && atoi(env) == 1; | ||
| }(); | ||
|
|
||
| if (!enable_graph_optimization) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would tend to disagree, and would propose renaming ggml_backend_cuda_graph_optimize to ggml_backend_cuda_cgraph_optimize instead.
A backend's interface is defined fully in ggml-backend-impl.h, and that is what the cuda backend needs to adhere to (and does). Since the cuda backend has to handle ggml_cgraph and CUDA Graphs internally, I personally think moving forward it makes sense to explicitly encode what we are handling in which function, as this saves the look to the function declaration. Consequentially, I would also like to rename ggml_backend_cuda_graph_compute to ggml_backend_cuda_cgraph_compute. This makes the above suggestion not strictly tied to this PR.
| } | ||
| for (int src_idx = 0; src_idx < GGML_MAX_SRC; ++src_idx) { | ||
| const ggml_tensor * node = cgraph->nodes[node_idx]->src[src_idx]; | ||
| //TODO: check why nrows > 1 fails, probably related to CUDA graphs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my side I could not repro garbled outputs for
GGML_CUDA_GRAPH_OPT=1 ./build-x64-linux-gcc-debug/bin/llama-cli -m /mnt/share/gguf/gpt-oss-20b-mxfp4.gguf -p "What is the Capital of Sweden? Please be very elaborative and jokey in your answer."
on commit 6562b7797 when reducing the test to if (node && !is_empty(node)).
Please specify a repro so we/I can try to root-cause (I am afraid of another heuristic that we do not fully understand, akin to the batch-size heuristic for disabling CUDA Graph). If it does not repro on your side either, I would say we enable this for pre-fill phase also (if it gives perf improvements)
| for (int src_idx = 0; src_idx < GGML_MAX_SRC; ++src_idx) { | ||
| const ggml_tensor * node = cgraph->nodes[node_idx]->src[src_idx]; | ||
| //TODO: check why nrows > 1 fails, probably related to CUDA graphs | ||
| if (node && !is_empty(node) && ggml_nrows(node) <= 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above
| std::unique_ptr<ggml_cuda_pool> ggml_backend_cuda_context::new_pool_for_device(int device, | ||
| [[maybe_unused]] int stream_no) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'm personally not a fan of adding unused function arguments (it's not only maybe_unused, but simply not used at all)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is actually used, but not in this function. new_pool_for_device is static inside ggml_backend_cuda_context which leads to create a new std::unique_ptr for a different stream_no
| GGML_LOG_DEBUG("Setting stream no to %d for node %s\n", cuda_ctx->curr_stream_no, node->name); | ||
| } | ||
| } | ||
| prev_i = i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| prev_i = i; | |
| prev_i = i; | |
| GGML_ASSERT((cuda_ctx->curr_stream_no == 0 && !is_concurrent_event_active) || | |
| (cuda_ctx->curr_stream_no > 0 && is_concurrent_event_active)); |
Let's also assert only sequential nodes that do not belong to concurrent events are placed onto the "main" stream if we reserve the "main" stream for this use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is asserted here, I feel it also doesn't tie stream 0 necessarily to the default stream, it is more like - if you are in a concurrent event, you need to find a stream for all your nodes in between
https://github.com/am17an/llama.cpp/blob/fused-qkv-stream/ggml/src/ggml-cuda/ggml-cuda.cu#L3242
✔️
Added some comments. The other pattern which is quite common is ffn_up + gate. That is easily detected by this pattern (though with fan_out = 2). However we already have a pretty solid fusion for that and it would probably not help here. |
b8b08e3 to
c9b06ad
Compare


Possibly supersede #16813.
This PR adds support to run concurrent CUDA streams on single GPU setups.
At the moment this only targets the Q, K, V branch. I feel this is the "correct" approach in case the Q, K, V tensors are of different types/not in the same place in memory. The downside is that this approach doesn't come for free and there's some complexity involved, but I'm not an expert at the ggml graph and I feel it could be simplified.
Currently this is hidden by an env variable flag. To run you can use
GGML_CUDA_GRAPH_OPT=1TG Performance gain is more than the previous PR (1-9% gain depending on the model/GPU), probably because we parallelize MUL_MAT + NORM + ROPE rather than just MUL_MAT. At the moment we leave some performance on the table where we don't fuse operations in the parallel streams themselves (e.g. MUL_MAT + BIAS, RMS_NORM + MUL etc.), I couldn't find a simple enough way to enable fusion there.
Performance details:
5090:
4090:
And just for comparison, this is without fusing ops inside a stream
5090:
4090:
TODO: