-
Notifications
You must be signed in to change notification settings - Fork 796
[SYCL] Throttled Wait extension, proposal and implementation #15716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Throttled Wait extension, proposal and implementation #15716
Conversation
…t, for IoT and similar applications
| while (e.get_info<sycl::info::event::command_execution_status>() != | ||
| sycl::info::event_command_status::complete) { | ||
| std::this_thread::sleep_for(sleep); | ||
| } | ||
| e.wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't guaranteed to work. From the specification:
SYCL commands submitted to a queue are not guaranteed to begin executing until a host thread blocks on their completion. In the absence of multiple host threads, there is no guarantee that host and device code will execute concurrently.
Polling on the event status could put an application into an infinite loop, because you'll never reach the call to wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be sufficient to call queue::ext_oneapi_prod() on the associated queue prior to the polling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, no. prod() is also defined as a hint, and doesn't provide a strong guarantee that anything will actually start executing.
Pretty much everything related to the forward progress of the device as a whole is currently defined as a hint, because there are valid implementations (e.g., SimSYCL) where everything executed by the "device" is actually executed by the host thread which eventually calls wait.
Being able to reason about cases where a device could execute kernels concurrently with the host thread and/or request for that to happen would require some new extension work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Polling on the event status could put an application into an infinite loop, because you'll never reach the call to wait
This might not necessarily be the case. The spec wording you quote above is true in general, but the code being added here only needs to work for the DPC++ implementation. Does DPC++ already have a guarantee that commands will start executing even before wait is called? If not, we could add an internal function call here that does provide that guarantee.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does DPC++ already have a guarantee that commands will start executing even before
waitis called? If not, we could add an internal function call here that does provide that guarantee.
Honestly, I'm not sure. I'm worried that the answer is really complicated, though, and depends on a bunch of configuration options.
OpenCL has similar wording to SYCL regarding the guarantees about when kernels execute, so I don't think DPC++ can provide that guarantee when running on the OpenCL backend. Our OpenCL implementation for GPUs used to batch kernels before execution, and unless that's changed recently I don't think kernels are guaranteed to begin execution. Our OpenCL implementation for CPUs has a mode where kernel execution begins immediately on a pool of TBB threads, and the host thread simply joins the pool when it reaches wait, but I don't know if that's the default.
For Level Zero, it will depend on the value of SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS, which according to the documentation takes different default values for Windows and Linux.
For the native CPU backend, I don't know for sure. Their behavior might be the same as TBB's above, or they might wait until wait to use all the logical cores for kernel execution.
For CUDA and HIP, I have no idea. I suspect that submitted kernels always begin executing on the GPU in practice, but I don't know if this is actually guaranteed by the runtime or not.
| This extension adds simple APIs for an alternate "sleeping" wait implementation. | ||
| This is for scenarios (such as IoT) where one might want to trade a bit of | ||
| performance in exchange for having the host CPU be more available, not burning | ||
| cycles intently waiting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could up-level this a little, and possibly even combine it with the extension that @steffenlarsen proposed over in #15704. They seem closely related, and as a user I don't think it would be clear when to prefer a "low powered event" vs a "throttled wait". It's also not clear what would happen if somebody tried to use these extensions together (i.e., by requesting a low-powered event and then waiting on it with throttling).
One simple idea would just be to implement the "low powered event" extension using throttling when running on an IoT device, and using hardware acceleration on systems where it's available.
Another (half-baked) idea would be to replace this with something like an "expected duration" property that could be passed to submit alongside a request for a "low-powered event". The implementation could then decide for itself whether to sleep or not, based on the expected duration of the events its waiting on, and any information it can query about whether certain commands have already begun executing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree here. This seems very similar to #15704, and it seems like we should have a common extension API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the implementation is indeed as trivial as the implementation here suggests (see https://github.com/intel/llvm/pull/15716/files#r1802565892 for a counter comment) I am not convinced that we need an extension for it, as it seems like any user who would need this could just implement this behavior themselves.
| while (e.get_info<sycl::info::event::command_execution_status>() != | ||
| sycl::info::event_command_status::complete) { | ||
| std::this_thread::sleep_for(sleep); | ||
| } | ||
| e.wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be sufficient to call queue::ext_oneapi_prod() on the associated queue prior to the polling?
This PR proposes a new extension for a
waitthat sleeps rather than running CPU full tilt, as has been requested for IoT and similar applications.Because it is fairly trivial, I am including an implementation.