Skip to content

Conversation

@beomki-yeo
Copy link
Contributor

This PR parallelize some kernels in the greedy ambiguity solver using the multiple streams - This gives about 5% improvement in the computation speed

    CUDA kernel sequence with multiple streams
    │
    ├── reset_status
    │
    ├── find_max_shared
    │
    ├── remove_tracks
    │       │
    │       └── [record event_removal]
    │
    ├── sort_updated_tracks  (Main stream — executed after event_removal)
    │       │
    │       └── [record event_main]
    │
    ├───▶ stream_fill
    │       │
    │       └── [wait for event_removal]
    │       │
    │       └── fill_inverted_ids
    │               │
    │               └── [record event_fill]
    │
    ├───▶ stream_scan
    │       │
    │       └── [wait for event_removal]
    │       │
    │       ├── block_inclusive_scan
    │       ├── scan_block_offsets
    │       └── add_block_offset
    │               │
    │               └── [record event_scan]
    │
    ├── [wait for event_main, event_fill, and event_scan] ← sync point
    │
    ├── rearrange_tracks
    │
    └── gather_tracks

@krasznaa
Copy link
Member

🤔 That 5% benefit... Does it come from a multi-threaded throughput test?

I would rather keep the algorithms using just a single stream each. Such that we would get overlaps from kernels working on different events. Instead of trying to minimise the latency of a single event's processing. 🤔

@beomki-yeo
Copy link
Contributor Author

We can also overlap kernels from multiple events but does it mean we should not use multiple stream for an algorithm?
As far as I know we use event-level stream less than 10 which is not even going to exceed the maximum number of streams and they don't even show clear improvement in speed (it's just fluctuating a lot. Please let me know if this has been changed)

@beomki-yeo
Copy link
Contributor Author

I found that there is even no limit in the number of cuda streams so using a couple of streams in the ambiguity solver should be OK for our multi-threaded chain

Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's, among other things, an API question for me.

Right now all the CUDA algorithms receive a traccc::cuda::stream object in their constructors. And they use that one stream to do all their work. Allowing any framework that would execute these algorithms, to tell them how/where to run.

If you want to use multiple streams in an algorithm, I guess you could do this with making this algorithm receive something like std::array<traccc::cuda::stream, 3> in its constructor. But with all the other algorithms expecting just one stream, there really needs to be a strong reason for doing this.

Streams do need to have a long lifetime. As creating/deleting them is not cheap at all. So it's generally meant to be done "at the framework level". At least so far that's how I've been looking at the design of all of the traccc algorithms. 🤔

Comment on lines 506 to 508
cudaStream_t stream_fill, stream_scan;
cudaStreamCreateWithFlags(&stream_fill, cudaStreamNonBlocking);
cudaStreamCreateWithFlags(&stream_scan, cudaStreamNonBlocking);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not only do you create streams inside of the execute function of the algorithm (as opposed to its constructor), but you do this inside of the while-loop of the algorithm.

Plus I don't see any cudaStreamDestroy(...) statements.

Copy link
Contributor Author

@beomki-yeo beomki-yeo Aug 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I added cudaStreamDestroy

I took stream_fill and stream_scan out of the loop so they are created and destroyed only once per event.
I don't know why you think creating and destroying the streams are expensive - Do you have any reliable reference on this? I believe it is a microsecond scale and the usual process time per event is already a few hundred millisecond. Unless you really need to save the <<0.1% of the performance, we might need to change this PR but I don't think it is worth that much 🤔

Comment on lines +628 to +640
cudaEvent_t event_main, event_fill, event_scan;
cudaEventCreate(&event_main);
cudaEventCreate(&event_fill);
cudaEventCreate(&event_scan);

cudaEventRecord(event_main, stream);
cudaEventRecord(event_fill, stream_fill);
cudaEventRecord(event_scan, stream_scan);

// Synchronize the events with main stream
cudaStreamWaitEvent(stream, event_main, 0);
cudaStreamWaitEvent(stream, event_fill, 0);
cudaStreamWaitEvent(stream, event_scan, 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are completely equivalent to just:

TRACCC_CUDA_CHECK(cudaStreamSynchronize(stream));
TRACCC_CUDA_CHECK(cudaStreamSynchronize(stream_fill));
TRACCC_CUDA_CHECK(cudaStreamSynchronize(stream_scan));

Creating events makes sense if you need to pass such events between independent code blocks. Here you just want to synchronize on the stream(s).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TRACCC_CUDA_ERROR_CHECK instead of TRACCC_CUDA_CHECK

BTW, it is not obviously equivalent as I see a crash like the following:

terminate called after throwing an instance of 'vecmem::cuda::runtime_error'
  what():  /mnt/nvme0n1/byeo/projects/traccc/traccc_build/_deps/vecmem-src/cuda/src/memory/managed_memory_resource.cpp:51 Failed to execute: cudaFree(p) (operation not permitted when stream is capturing)
Aborted (core dumped)

@beomki-yeo
Copy link
Contributor Author

There is a strong reason - it improves the performance noticeably.

I understand the concern that this PR does not elegantly follow the design of traccc::stream or the hand-made API deeply implemented in traccc. But in my humble opinion, I don't think there is a potential hazard on this approach. (The stream creation and destroy overhead are negligible compared to the process time) I also don't understand why @krasznaa does not want to benefit from the multiple streams or cudaGraph - Those functionalities exist because they are useful in the advanced cuda programming

I don't think we have to restrain ourselves to use a single stream per algorithm unless there is a strong reason for that

I guess you could do this with making this algorithm receive something like std::array<traccc::cuda::stream, 3> in its constructor.

That's a neat idea but I am afraid this will require too much engineering which is far beyond the actual benefit

@sonarqubecloud
Copy link

@beomki-yeo beomki-yeo requested a review from krasznaa August 20, 2025 00:30
@beomki-yeo
Copy link
Contributor Author

beomki-yeo commented Aug 29, 2025

Can we go ahead with this PR? We can rollback the change if it causes trouble in the full chain - I don't think that would happen though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants