Use multiple streams to parallelize the kernels in the ambi-solver #1117

beomki-yeo · 2025-08-07T19:09:15Z

This PR parallelize some kernels in the greedy ambiguity solver using the multiple streams - This gives about 5% improvement in the computation speed

    CUDA kernel sequence with multiple streams
    │
    ├── reset_status
    │
    ├── find_max_shared
    │
    ├── remove_tracks
    │       │
    │       └── [record event_removal]
    │
    ├── sort_updated_tracks  (Main stream — executed after event_removal)
    │       │
    │       └── [record event_main]
    │
    ├───▶ stream_fill
    │       │
    │       └── [wait for event_removal]
    │       │
    │       └── fill_inverted_ids
    │               │
    │               └── [record event_fill]
    │
    ├───▶ stream_scan
    │       │
    │       └── [wait for event_removal]
    │       │
    │       ├── block_inclusive_scan
    │       ├── scan_block_offsets
    │       └── add_block_offset
    │               │
    │               └── [record event_scan]
    │
    ├── [wait for event_main, event_fill, and event_scan] ← sync point
    │
    ├── rearrange_tracks
    │
    └── gather_tracks

krasznaa · 2025-08-12T08:06:24Z

🤔 That 5% benefit... Does it come from a multi-threaded throughput test?

I would rather keep the algorithms using just a single stream each. Such that we would get overlaps from kernels working on different events. Instead of trying to minimise the latency of a single event's processing. 🤔

beomki-yeo · 2025-08-12T13:36:31Z

We can also overlap kernels from multiple events but does it mean we should not use multiple stream for an algorithm?
As far as I know we use event-level stream less than 10 which is not even going to exceed the maximum number of streams and they don't even show clear improvement in speed (it's just fluctuating a lot. Please let me know if this has been changed)

beomki-yeo · 2025-08-14T03:22:03Z

I found that there is even no limit in the number of cuda streams so using a couple of streams in the ambiguity solver should be OK for our multi-threaded chain

krasznaa

It's, among other things, an API question for me.

Right now all the CUDA algorithms receive a traccc::cuda::stream object in their constructors. And they use that one stream to do all their work. Allowing any framework that would execute these algorithms, to tell them how/where to run.

If you want to use multiple streams in an algorithm, I guess you could do this with making this algorithm receive something like std::array<traccc::cuda::stream, 3> in its constructor. But with all the other algorithms expecting just one stream, there really needs to be a strong reason for doing this.

Streams do need to have a long lifetime. As creating/deleting them is not cheap at all. So it's generally meant to be done "at the framework level". At least so far that's how I've been looking at the design of all of the traccc algorithms. 🤔

krasznaa · 2025-08-14T09:17:32Z

device/cuda/src/ambiguity_resolution/greedy_ambiguity_resolution_algorithm.cu

+        cudaStream_t stream_fill, stream_scan;
+        cudaStreamCreateWithFlags(&stream_fill, cudaStreamNonBlocking);
+        cudaStreamCreateWithFlags(&stream_scan, cudaStreamNonBlocking);


Not only do you create streams inside of the execute function of the algorithm (as opposed to its constructor), but you do this inside of the while-loop of the algorithm.

Plus I don't see any cudaStreamDestroy(...) statements.

Good catch. I added cudaStreamDestroy

I took stream_fill and stream_scan out of the loop so they are created and destroyed only once per event.
I don't know why you think creating and destroying the streams are expensive - Do you have any reliable reference on this? I believe it is a microsecond scale and the usual process time per event is already a few hundred millisecond. Unless you really need to save the <<0.1% of the performance, we might need to change this PR but I don't think it is worth that much 🤔

krasznaa · 2025-08-14T09:21:07Z

device/cuda/src/ambiguity_resolution/greedy_ambiguity_resolution_algorithm.cu

+        cudaEvent_t event_main, event_fill, event_scan;
+        cudaEventCreate(&event_main);
+        cudaEventCreate(&event_fill);
+        cudaEventCreate(&event_scan);
+
+        cudaEventRecord(event_main, stream);
+        cudaEventRecord(event_fill, stream_fill);
+        cudaEventRecord(event_scan, stream_scan);
+
+        // Synchronize the events with main stream
+        cudaStreamWaitEvent(stream, event_main, 0);
+        cudaStreamWaitEvent(stream, event_fill, 0);
+        cudaStreamWaitEvent(stream, event_scan, 0);


These are completely equivalent to just:

TRACCC_CUDA_CHECK(cudaStreamSynchronize(stream)); TRACCC_CUDA_CHECK(cudaStreamSynchronize(stream_fill)); TRACCC_CUDA_CHECK(cudaStreamSynchronize(stream_scan));

Creating events makes sense if you need to pass such events between independent code blocks. Here you just want to synchronize on the stream(s).

TRACCC_CUDA_ERROR_CHECK instead of TRACCC_CUDA_CHECK

BTW, it is not obviously equivalent as I see a crash like the following:

terminate called after throwing an instance of 'vecmem::cuda::runtime_error' what(): /mnt/nvme0n1/byeo/projects/traccc/traccc_build/_deps/vecmem-src/cuda/src/memory/managed_memory_resource.cpp:51 Failed to execute: cudaFree(p) (operation not permitted when stream is capturing) Aborted (core dumped)

beomki-yeo · 2025-08-20T00:29:33Z

There is a strong reason - it improves the performance noticeably.

I understand the concern that this PR does not elegantly follow the design of traccc::stream or the hand-made API deeply implemented in traccc. But in my humble opinion, I don't think there is a potential hazard on this approach. (The stream creation and destroy overhead are negligible compared to the process time) I also don't understand why @krasznaa does not want to benefit from the multiple streams or cudaGraph - Those functionalities exist because they are useful in the advanced cuda programming

I don't think we have to restrain ourselves to use a single stream per algorithm unless there is a strong reason for that

I guess you could do this with making this algorithm receive something like std::array<traccc::cuda::stream, 3> in its constructor.

That's a neat idea but I am afraid this will require too much engineering which is far beyond the actual benefit

sonarqubecloud · 2025-08-20T00:30:24Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

beomki-yeo · 2025-08-29T02:54:36Z

Can we go ahead with this PR? We can rollback the change if it causes trouble in the full chain - I don't think that would happen though

Use multiple streams to parallelize the kernels

961a8dd

beomki-yeo requested review from krasznaa and stephenswat August 7, 2025 19:09

beomki-yeo mentioned this pull request Aug 8, 2025

Use cuda graph node #1121

Draft

Merge branch 'main' into multiple-stream

06d2d57

beomki-yeo mentioned this pull request Aug 13, 2025

Optimize gather_tracks kernel #1125

Merged

krasznaa reviewed Aug 14, 2025

View reviewed changes

Merge branch 'main' into multiple-stream

3dc3fd0

Backup

5143bcc

beomki-yeo requested a review from krasznaa August 20, 2025 00:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use multiple streams to parallelize the kernels in the ambi-solver #1117

Use multiple streams to parallelize the kernels in the ambi-solver #1117

Uh oh!

beomki-yeo commented Aug 7, 2025

Uh oh!

krasznaa commented Aug 12, 2025

Uh oh!

beomki-yeo commented Aug 12, 2025

Uh oh!

beomki-yeo commented Aug 14, 2025

Uh oh!

krasznaa left a comment

Uh oh!

krasznaa Aug 14, 2025

Uh oh!

beomki-yeo Aug 20, 2025 •

edited

Loading

Uh oh!

krasznaa Aug 14, 2025

Uh oh!

beomki-yeo Aug 20, 2025

Uh oh!

beomki-yeo commented Aug 20, 2025

Uh oh!

sonarqubecloud bot commented Aug 20, 2025

Uh oh!

beomki-yeo commented Aug 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use multiple streams to parallelize the kernels in the ambi-solver #1117

Are you sure you want to change the base?

Use multiple streams to parallelize the kernels in the ambi-solver #1117

Uh oh!

Conversation

beomki-yeo commented Aug 7, 2025

Uh oh!

krasznaa commented Aug 12, 2025

Uh oh!

beomki-yeo commented Aug 12, 2025

Uh oh!

beomki-yeo commented Aug 14, 2025

Uh oh!

krasznaa left a comment

Choose a reason for hiding this comment

Uh oh!

krasznaa Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

beomki-yeo Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krasznaa Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

beomki-yeo Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

beomki-yeo commented Aug 20, 2025

Uh oh!

sonarqubecloud bot commented Aug 20, 2025

Quality Gate passed

Uh oh!

beomki-yeo commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

beomki-yeo Aug 20, 2025 •

edited

Loading

beomki-yeo commented Aug 29, 2025 •

edited

Loading