Skip to content

Conversation

Xewar313
Copy link
Contributor

No description provided.

@Xewar313 Xewar313 requested review from a team as code owners May 20, 2025 14:18
@Xewar313 Xewar313 requested a review from reble May 20, 2025 14:18
@Xewar313 Xewar313 temporarily deployed to WindowsCILock May 20, 2025 14:18 — with GitHub Actions Inactive
@Xewar313 Xewar313 temporarily deployed to WindowsCILock May 20, 2025 14:38 — with GitHub Actions Inactive
@Xewar313 Xewar313 temporarily deployed to WindowsCILock May 20, 2025 14:38 — with GitHub Actions Inactive
@pbalcer
Copy link
Contributor

pbalcer commented May 20, 2025

This patch improves out of order command buffer performance as expected, see results below:

SubmitGraph 32 (2)

https://oneapi-src.github.io/unified-runtime/performance/?runs=Baseline_PVC_L0v2%2CPR18570_PVC_L0v2&tags=graph

ur_exp_command_buffer_sync_point_t
ur_exp_command_buffer_handle_t_::getSyncPoint(ur_event_handle_t event) {
auto syncPoint = NextSyncPoint++;
syncPoints[syncPoint] = event;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are syncPoints ever cleared or will they grow indefinitely?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have forgotten to clear of events in the destructor, but that is the only place where syncPoints are cleared - since user can add dependency on any of the previously created syncPoint, we need to always remember all previously created syncPoints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I now realized that having map there does not make much sense - since we are increasing the counter by one every sync point created, we can as well use a vector. That will be faster and simpler - I think that it also may solve your other issue that you commented on below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They allowed to grow indefinitely up to the max size a uint32_t can hold by how the spec is currently defined. If we really wanted we could put an internal limit of it and return UR_RESULT_ERROR_OUT_OF_RESOURCES if that was violated, but would be a v2 adapter specific limitation.

For a similar discussion see KhronosGroup/OpenCL-Docs#844


ur_exp_command_buffer_sync_point_t
ur_exp_command_buffer_handle_t_::getSyncPoint(ur_event_handle_t event) {
auto syncPoint = NextSyncPoint++;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just make ur_exp_command_buffer_sync_point_t a pointer to the node in the map? (or an iterator?) That would speed up the search in getWaitListFromSyncPoints().

We would need to change ur_exp_command_buffer_sync_point_t type to be uint64_t but is that a problem? @EwanC?

@EwanC
Copy link
Contributor

EwanC commented May 21, 2025

This patch improves out of order command buffer performance as expected, see results below:

SubmitGraph 32 (2)

https://oneapi-src.github.io/unified-runtime/performance/?runs=Baseline_PVC_L0v2%2CPR18570_PVC_L0v2&tags=graph

Noting that the SYCL benchmark also sees a performance improvement with this, though not as big, which will be because the SYCL-RT doesn't set the in-order property on UR command-buffer creation unless the graph is perfectly linear (which isn't the case in the SYCL benchmark unless using the in-order variant).
image

ur_exp_command_buffer_sync_point_t
ur_exp_command_buffer_handle_t_::getSyncPoint(ur_event_handle_t event) {
syncPoints.push_back(event);
return static_cast<ur_exp_command_buffer_sync_point_t>(syncPoints.size() - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: even tough it's unlikely, can we just throw an exception when syncPoints.size() overflows uint32_t?

@EwanC
Copy link
Contributor

EwanC commented May 21, 2025

Do we have any data on how this PR affects the performance [GROMACS benchmark that added to the automated benchmarking](#17934? Not sure if all the issues with that are worked through yet. Although this patch makes the out-of-order compute-benchmark faster, that benchmark is embarrassingly parallel. A gromacs grappa PME graph structure looks like below, and currently won't have the in-order UR command flag set.
graph0

I'm wondering if I need to reconsider the heuristic/method in the SYCL-RT that decides whether UR command-buffer in-order flag is set as a follow-on from this PR.

@pbalcer
Copy link
Contributor

pbalcer commented May 21, 2025

The gromacs benchmark hasn't run yet successfully. Hopefully it does tomorrow (#18563 this should solve it).

@igchor
Copy link
Contributor

igchor commented May 21, 2025

I'm wondering if I need to reconsider the heuristic/method in the SYCL-RT that decides whether UR command-buffer in-order flag is set as a follow-on from this PR.

@EwanC yeah, that would also help in applying the optimization in #18277 to graphs for more use-cases. Right now, I can only make a change that will avoid storing the last event when checkIfGraphIsSinglePath() is true

Copy link
Contributor

@EwanC EwanC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gromacs benchmark hasn't run yet successfully. Hopefully it does tomorrow (#18563 this should solve it).

Nice, I see it now 🎉

Copy link
Contributor

@pbalcer pbalcer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly lgtm, just a few minor comments.

@Xewar313
Copy link
Contributor Author

@intel/llvm-gatekeepers please merge

@martygrant martygrant merged commit 3ac73a5 into intel:sycl May 23, 2025
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants