-
Notifications
You must be signed in to change notification settings - Fork 56
Use multiple streams to parallelize the kernels in the ambi-solver #1117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
🤔 That 5% benefit... Does it come from a multi-threaded throughput test? I would rather keep the algorithms using just a single stream each. Such that we would get overlaps from kernels working on different events. Instead of trying to minimise the latency of a single event's processing. 🤔 |
|
We can also overlap kernels from multiple events but does it mean we should not use multiple stream for an algorithm? |
|
I found that there is even no limit in the number of cuda streams so using a couple of streams in the ambiguity solver should be OK for our multi-threaded chain |
krasznaa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's, among other things, an API question for me.
Right now all the CUDA algorithms receive a traccc::cuda::stream object in their constructors. And they use that one stream to do all their work. Allowing any framework that would execute these algorithms, to tell them how/where to run.
If you want to use multiple streams in an algorithm, I guess you could do this with making this algorithm receive something like std::array<traccc::cuda::stream, 3> in its constructor. But with all the other algorithms expecting just one stream, there really needs to be a strong reason for doing this.
Streams do need to have a long lifetime. As creating/deleting them is not cheap at all. So it's generally meant to be done "at the framework level". At least so far that's how I've been looking at the design of all of the traccc algorithms. 🤔
| cudaStream_t stream_fill, stream_scan; | ||
| cudaStreamCreateWithFlags(&stream_fill, cudaStreamNonBlocking); | ||
| cudaStreamCreateWithFlags(&stream_scan, cudaStreamNonBlocking); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only do you create streams inside of the execute function of the algorithm (as opposed to its constructor), but you do this inside of the while-loop of the algorithm.
Plus I don't see any cudaStreamDestroy(...) statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I added cudaStreamDestroy
I took stream_fill and stream_scan out of the loop so they are created and destroyed only once per event.
I don't know why you think creating and destroying the streams are expensive - Do you have any reliable reference on this? I believe it is a microsecond scale and the usual process time per event is already a few hundred millisecond. Unless you really need to save the <<0.1% of the performance, we might need to change this PR but I don't think it is worth that much 🤔
| cudaEvent_t event_main, event_fill, event_scan; | ||
| cudaEventCreate(&event_main); | ||
| cudaEventCreate(&event_fill); | ||
| cudaEventCreate(&event_scan); | ||
|
|
||
| cudaEventRecord(event_main, stream); | ||
| cudaEventRecord(event_fill, stream_fill); | ||
| cudaEventRecord(event_scan, stream_scan); | ||
|
|
||
| // Synchronize the events with main stream | ||
| cudaStreamWaitEvent(stream, event_main, 0); | ||
| cudaStreamWaitEvent(stream, event_fill, 0); | ||
| cudaStreamWaitEvent(stream, event_scan, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are completely equivalent to just:
TRACCC_CUDA_CHECK(cudaStreamSynchronize(stream));
TRACCC_CUDA_CHECK(cudaStreamSynchronize(stream_fill));
TRACCC_CUDA_CHECK(cudaStreamSynchronize(stream_scan));Creating events makes sense if you need to pass such events between independent code blocks. Here you just want to synchronize on the stream(s).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TRACCC_CUDA_ERROR_CHECK instead of TRACCC_CUDA_CHECK
BTW, it is not obviously equivalent as I see a crash like the following:
terminate called after throwing an instance of 'vecmem::cuda::runtime_error'
what(): /mnt/nvme0n1/byeo/projects/traccc/traccc_build/_deps/vecmem-src/cuda/src/memory/managed_memory_resource.cpp:51 Failed to execute: cudaFree(p) (operation not permitted when stream is capturing)
Aborted (core dumped)
|
There is a strong reason - it improves the performance noticeably. I understand the concern that this PR does not elegantly follow the design of I don't think we have to restrain ourselves to use a single stream per algorithm unless there is a strong reason for that
That's a neat idea but I am afraid this will require too much engineering which is far beyond the actual benefit |
|
|
Can we go ahead with this PR? We can rollback the change if it causes trouble in the full chain - I don't think that would happen though |



This PR parallelize some kernels in the greedy ambiguity solver using the multiple streams - This gives about 5% improvement in the computation speed