-
Notifications
You must be signed in to change notification settings - Fork 791
[UR] [L0v2] Add support of out-of-order command buffers to L0 adapter v2 #18570
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This patch improves out of order command buffer performance as expected, see results below: |
ur_exp_command_buffer_sync_point_t | ||
ur_exp_command_buffer_handle_t_::getSyncPoint(ur_event_handle_t event) { | ||
auto syncPoint = NextSyncPoint++; | ||
syncPoints[syncPoint] = event; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are syncPoints ever cleared or will they grow indefinitely?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have forgotten to clear of events in the destructor, but that is the only place where syncPoints are cleared - since user can add dependency on any of the previously created syncPoint, we need to always remember all previously created syncPoints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I now realized that having map there does not make much sense - since we are increasing the counter by one every sync point created, we can as well use a vector. That will be faster and simpler - I think that it also may solve your other issue that you commented on below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They allowed to grow indefinitely up to the max size a uint32_t
can hold by how the spec is currently defined. If we really wanted we could put an internal limit of it and return UR_RESULT_ERROR_OUT_OF_RESOURCES
if that was violated, but would be a v2 adapter specific limitation.
For a similar discussion see KhronosGroup/OpenCL-Docs#844
|
||
ur_exp_command_buffer_sync_point_t | ||
ur_exp_command_buffer_handle_t_::getSyncPoint(ur_event_handle_t event) { | ||
auto syncPoint = NextSyncPoint++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just make ur_exp_command_buffer_sync_point_t a pointer to the node in the map? (or an iterator?) That would speed up the search in getWaitListFromSyncPoints().
We would need to change ur_exp_command_buffer_sync_point_t type to be uint64_t but is that a problem? @EwanC?
ur_exp_command_buffer_sync_point_t | ||
ur_exp_command_buffer_handle_t_::getSyncPoint(ur_event_handle_t event) { | ||
syncPoints.push_back(event); | ||
return static_cast<ur_exp_command_buffer_sync_point_t>(syncPoints.size() - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: even tough it's unlikely, can we just throw an exception when syncPoints.size() overflows uint32_t?
Do we have any data on how this PR affects the performance [GROMACS benchmark that added to the automated benchmarking](#17934? Not sure if all the issues with that are worked through yet. Although this patch makes the out-of-order compute-benchmark faster, that benchmark is embarrassingly parallel. A gromacs grappa PME graph structure looks like below, and currently won't have the in-order UR command flag set. I'm wondering if I need to reconsider the heuristic/method in the SYCL-RT that decides whether UR command-buffer in-order flag is set as a follow-on from this PR. |
The gromacs benchmark hasn't run yet successfully. Hopefully it does tomorrow (#18563 this should solve it). |
@EwanC yeah, that would also help in applying the optimization in #18277 to graphs for more use-cases. Right now, I can only make a change that will avoid storing the last event when checkIfGraphIsSinglePath() is true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gromacs benchmark hasn't run yet successfully. Hopefully it does tomorrow (#18563 this should solve it).
Nice, I see it now 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly lgtm, just a few minor comments.
unified-runtime/source/adapters/level_zero/v2/command_list_manager.cpp
Outdated
Show resolved
Hide resolved
@intel/llvm-gatekeepers please merge |
No description provided.