[UR] [L0v2] Add support of out-of-order command buffers to L0 adapter v2 #18570

Xewar313 · 2025-05-20T14:18:30Z

No description provided.

pbalcer · 2025-05-20T15:00:49Z

This patch improves out of order command buffer performance as expected, see results below:

https://oneapi-src.github.io/unified-runtime/performance/?runs=Baseline_PVC_L0v2%2CPR18570_PVC_L0v2&tags=graph

igchor · 2025-05-20T15:10:36Z

unified-runtime/source/adapters/level_zero/v2/command_buffer.cpp

+ur_exp_command_buffer_sync_point_t
+ur_exp_command_buffer_handle_t_::getSyncPoint(ur_event_handle_t event) {
+  auto syncPoint = NextSyncPoint++;
+  syncPoints[syncPoint] = event;


are syncPoints ever cleared or will they grow indefinitely?

I have forgotten to clear of events in the destructor, but that is the only place where syncPoints are cleared - since user can add dependency on any of the previously created syncPoint, we need to always remember all previously created syncPoints.

To be honest, I now realized that having map there does not make much sense - since we are increasing the counter by one every sync point created, we can as well use a vector. That will be faster and simpler - I think that it also may solve your other issue that you commented on below.

They allowed to grow indefinitely up to the max size a uint32_t can hold by how the spec is currently defined. If we really wanted we could put an internal limit of it and return UR_RESULT_ERROR_OUT_OF_RESOURCES if that was violated, but would be a v2 adapter specific limitation.

For a similar discussion see KhronosGroup/OpenCL-Docs#844

igchor · 2025-05-20T15:28:41Z

unified-runtime/source/adapters/level_zero/v2/command_buffer.cpp

+
+ur_exp_command_buffer_sync_point_t
+ur_exp_command_buffer_handle_t_::getSyncPoint(ur_event_handle_t event) {
+  auto syncPoint = NextSyncPoint++;


Could we just make ur_exp_command_buffer_sync_point_t a pointer to the node in the map? (or an iterator?) That would speed up the search in getWaitListFromSyncPoints().

We would need to change ur_exp_command_buffer_sync_point_t type to be uint64_t but is that a problem? @EwanC?

EwanC · 2025-05-21T08:06:20Z

This patch improves out of order command buffer performance as expected, see results below:

https://oneapi-src.github.io/unified-runtime/performance/?runs=Baseline_PVC_L0v2%2CPR18570_PVC_L0v2&tags=graph

Noting that the SYCL benchmark also sees a performance improvement with this, though not as big, which will be because the SYCL-RT doesn't set the in-order property on UR command-buffer creation unless the graph is perfectly linear (which isn't the case in the SYCL benchmark unless using the in-order variant).

igchor · 2025-05-21T14:56:51Z

unified-runtime/source/adapters/level_zero/v2/command_buffer.cpp

+ur_exp_command_buffer_sync_point_t
+ur_exp_command_buffer_handle_t_::getSyncPoint(ur_event_handle_t event) {
+  syncPoints.push_back(event);
+  return static_cast<ur_exp_command_buffer_sync_point_t>(syncPoints.size() - 1);


nit: even tough it's unlikely, can we just throw an exception when syncPoints.size() overflows uint32_t?

EwanC · 2025-05-21T16:11:25Z

Do we have any data on how this PR affects the performance [GROMACS benchmark that added to the automated benchmarking](#17934? Not sure if all the issues with that are worked through yet. Although this patch makes the out-of-order compute-benchmark faster, that benchmark is embarrassingly parallel. A gromacs grappa PME graph structure looks like below, and currently won't have the in-order UR command flag set.

I'm wondering if I need to reconsider the heuristic/method in the SYCL-RT that decides whether UR command-buffer in-order flag is set as a follow-on from this PR.

pbalcer · 2025-05-21T16:14:56Z

The gromacs benchmark hasn't run yet successfully. Hopefully it does tomorrow (#18563 this should solve it).

igchor · 2025-05-21T16:58:32Z

I'm wondering if I need to reconsider the heuristic/method in the SYCL-RT that decides whether UR command-buffer in-order flag is set as a follow-on from this PR.

@EwanC yeah, that would also help in applying the optimization in #18277 to graphs for more use-cases. Right now, I can only make a change that will avoid storing the last event when checkIfGraphIsSinglePath() is true

EwanC

The gromacs benchmark hasn't run yet successfully. Hopefully it does tomorrow (#18563 this should solve it).

Nice, I see it now 🎉

unified-runtime/source/adapters/level_zero/v2/command_buffer.hpp

unified-runtime/source/adapters/level_zero/v2/command_buffer.cpp

unified-runtime/source/adapters/level_zero/v2/command_buffer.hpp

pbalcer

mostly lgtm, just a few minor comments.

unified-runtime/source/adapters/level_zero/v2/command_buffer.cpp

unified-runtime/source/adapters/level_zero/v2/command_buffer.hpp

unified-runtime/source/adapters/level_zero/v2/context.hpp

unified-runtime/source/adapters/level_zero/v2/command_list_manager.cpp

Xewar313 · 2025-05-23T11:34:31Z

@intel/llvm-gatekeepers please merge

Mikołaj Komar added 3 commits May 19, 2025 13:26

Initial solution

bc4946d

Remove logging

3120832

First solution passing all tests

c1aca2b

Xewar313 requested review from a team as code owners May 20, 2025 14:18

Xewar313 requested a review from reble May 20, 2025 14:18

Xewar313 temporarily deployed to WindowsCILock May 20, 2025 14:18 — with GitHub Actions Inactive

Xewar313 temporarily deployed to WindowsCILock May 20, 2025 14:38 — with GitHub Actions Inactive

Reformat code and add event reset

9747878

igchor reviewed May 20, 2025

View reviewed changes

Xewar313 temporarily deployed to WindowsCILock May 21, 2025 08:20 — with GitHub Actions Inactive

Xewar313 temporarily deployed to WindowsCILock May 21, 2025 08:54 — with GitHub Actions Inactive

Replace unordered_map with vector

0f3154e

igchor reviewed May 21, 2025

View reviewed changes

igchor approved these changes May 21, 2025

View reviewed changes

EwanC reviewed May 22, 2025

View reviewed changes

pbalcer reviewed May 22, 2025

View reviewed changes

Xewar313 temporarily deployed to WindowsCILock May 23, 2025 08:36 — with GitHub Actions Inactive

Xewar313 temporarily deployed to WindowsCILock May 23, 2025 09:08 — with GitHub Actions Inactive

pbalcer approved these changes May 23, 2025

View reviewed changes

Apply PR changes

e05588d

EwanC approved these changes May 23, 2025

View reviewed changes

martygrant merged commit 3ac73a5 into intel:sycl May 23, 2025
33 checks passed

[UR] [L0v2] Add support of out-of-order command buffers to L0 adapter v2 #18570

[UR] [L0v2] Add support of out-of-order command buffers to L0 adapter v2 #18570

Uh oh!

Conversation

Xewar313 commented May 20, 2025

Uh oh!

pbalcer commented May 20, 2025

Uh oh!

igchor May 20, 2025

Choose a reason for hiding this comment

Uh oh!

Xewar313 May 21, 2025

Choose a reason for hiding this comment

Uh oh!

Xewar313 May 21, 2025

Choose a reason for hiding this comment

Uh oh!

EwanC May 21, 2025

Choose a reason for hiding this comment

Uh oh!

igchor May 20, 2025

Choose a reason for hiding this comment

Uh oh!

EwanC commented May 21, 2025

Uh oh!

igchor May 21, 2025

Choose a reason for hiding this comment

Uh oh!

EwanC commented May 21, 2025

Uh oh!

pbalcer commented May 21, 2025

Uh oh!

igchor commented May 21, 2025

Uh oh!

EwanC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pbalcer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Xewar313 commented May 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants