[UR][CUDA][HIP] Unify queue handling between adapters #17641

npmiller · 2025-03-25T17:06:17Z

The CUDA and HIP adapters are both using a nearly identical complicated queue that handles creating an out-of-order UR queue from in-order CUDA/HIP streams.

This patch extracts all of the queue logic into a separate templated class that can be used by both adapters. Beyond removing a lot of duplicated code, it also makes it a lot easier to maintain.

There was a few functional differences between the queues in both adapters, but mostly due to fixes done in the CUDA adapter that were not ported to the HIP adapter. There might be more but I found at least one race condition (#15100) and one performance issue (#6333) that weren't fixed in the HIP adapter.

This patch uses the CUDA version of the queue as a base for the generic queue, and will thus fix for HIP the race condition and performance issue mentioned above.

This code is quite complex, so this patch also aimed to minimize any other changes beyond the structural changes needed to share the code. However it did do the following changes in the two adapters:

stream_queue.hpp:

Remove urDeviceRetain/Release: essentially a no-op

CUDA:

Rename ur_stream_guard_ to ur_stream_guard
Rename getNextEventID to getNextEventId
Remove duplicate get_device getter, use getDevice instead

HIP:

Fix queue finish so it doesn't fail when no streams need to be synchronized

The CUDA and HIP adapters are both using a nearly identical complicated queue that handles creating an out-of-order UR queue from in-order CUDA/HIP streams. This patch extracts all of the queue logic into a separate templated class that can be used by both adapters. Beyond removing a lot of duplicated code, it also makes it a lot easier to maintain. There was a few functional differences between the queues in both adapters, but mostly due to fixes done in the CUDA adapter that were not ported to the HIP adapter. There might be more but I found at least one race condition (intel#15100) and one performance issue (intel#6333) that weren't fixed in the HIP adapter. This patch uses the CUDA version of the queue as a base for the generic queue, and will thus fix for HIP the race condition and performance issue mentioned above. This code is quite complex, so this patch also aimed to minimize any other changes beyond the structural changes needed to share the code. However it did do the following changes in the two adapters: CUDA: * Rename `ur_stream_guard_` to `ur_stream_guard` * Rename `getNextEventID` to `getNextEventId` * Remove duplicate `get_device` getter, use `getDevice` instead HIP: * Fix queue finish so it doesn't fail when no streams need to be synchronized

Capturing the result is no longer needed

aarongreig · 2025-03-26T12:11:48Z

unified-runtime/source/common/cuda-hip/stream_queue.hpp

+        LastSyncComputeStreams{0}, LastSyncTransferStreams{0}, Flags(Flags),
+        URFlags(URFlags), Priority(Priority), HasOwnership{BackendOwns} {
+    urContextRetain(Context);
+    urDeviceRetain(Device);


DeviceRetain/Release can be removed, it's a no-op unless we expect Device to be a subdevice (I have a ticket open to rename and better document those entry points)

Device retain release is a no-op

aelovikov-intel

CODEOWNERS LGTM

unified-runtime/source/common/cuda-hip/stream_queue.hpp

npmiller · 2025-04-02T16:13:57Z

@intel/llvm-gatekeepers I believe this is ready to merge

Jenkins/Precommit: CI failed to start properly (it passed in the previous run before I merged the sycl branch, so should be fine)
PVC: issue with the PVC node (no gpu found)
Arc: issue with the Arc node (no gpu found)

And this patch only affects CUDA and HIP, so missing PVC and Arc testing shouldn't be an issue.

npmiller requested review from a team as code owners March 25, 2025 17:06

npmiller requested review from keyradical and ldrumm March 25, 2025 17:06

npmiller temporarily deployed to WindowsCILock March 25, 2025 17:06 — with GitHub Actions Inactive

npmiller temporarily deployed to WindowsCILock March 25, 2025 17:37 — with GitHub Actions Inactive

npmiller temporarily deployed to WindowsCILock March 26, 2025 09:40 — with GitHub Actions Inactive

npmiller temporarily deployed to WindowsCILock March 26, 2025 10:01 — with GitHub Actions Inactive

npmiller added 3 commits March 26, 2025 11:19

[UR][HIP] Cleanup urQueueFinish

1c59e6e

Capturing the result is no longer needed

[UR] Move stream_queue.hpp to common UR directory

8b178a0

npmiller force-pushed the cuda-hip-common branch from cb3bdac to 8b178a0 Compare March 26, 2025 11:57

npmiller requested review from a team as code owners March 26, 2025 11:57

npmiller had a problem deploying to WindowsCILock March 26, 2025 11:57 — with GitHub Actions Error

aarongreig reviewed Mar 26, 2025

View reviewed changes

[UR][CUDA-HIP] Remove unnecessary device retain/release

924ac2f

Device retain release is a no-op

npmiller temporarily deployed to WindowsCILock March 26, 2025 12:15 — with GitHub Actions Inactive

npmiller temporarily deployed to WindowsCILock March 26, 2025 12:25 — with GitHub Actions Inactive

aarongreig changed the title ~~[UR][CUDA][HIP] Unifiy queue handling between adapters~~ [UR][CUDA][HIP] Unify queue handling between adapters Mar 26, 2025

aarongreig approved these changes Mar 26, 2025

View reviewed changes

aelovikov-intel approved these changes Mar 26, 2025

View reviewed changes

keyradical reviewed Mar 27, 2025

View reviewed changes

unified-runtime/source/common/cuda-hip/stream_queue.hpp Show resolved Hide resolved

keyradical approved these changes Mar 27, 2025

View reviewed changes

jchlanda approved these changes Apr 2, 2025

View reviewed changes

Merge branch 'sycl' into cuda-hip-common

bd47804

npmiller temporarily deployed to WindowsCILock April 2, 2025 09:15 — with GitHub Actions Inactive

[UR][CUDA] Fix stream guard in async alloc

43f83a4

npmiller had a problem deploying to WindowsCILock April 2, 2025 09:26 — with GitHub Actions Error

npmiller temporarily deployed to WindowsCILock April 2, 2025 09:26 — with GitHub Actions Inactive

[UR][CUDA-HIP] Mark stream queue destructor as virtual

52de284

npmiller temporarily deployed to WindowsCILock April 2, 2025 09:45 — with GitHub Actions Inactive

npmiller temporarily deployed to WindowsCILock April 2, 2025 10:01 — with GitHub Actions Inactive

sarnex merged commit 24b7bc3 into intel:sycl Apr 2, 2025
38 of 44 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[UR][CUDA][HIP] Unify queue handling between adapters #17641

[UR][CUDA][HIP] Unify queue handling between adapters #17641

Uh oh!

npmiller commented Mar 25, 2025 •

edited

Loading

Uh oh!

aarongreig Mar 26, 2025

Uh oh!

aelovikov-intel left a comment

Uh oh!

Uh oh!

npmiller commented Apr 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[UR][CUDA][HIP] Unify queue handling between adapters #17641

[UR][CUDA][HIP] Unify queue handling between adapters #17641

Uh oh!

Conversation

npmiller commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aarongreig Mar 26, 2025

Choose a reason for hiding this comment

Uh oh!

aelovikov-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

npmiller commented Apr 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

npmiller commented Mar 25, 2025 •

edited

Loading