[SYCL] Throttled Wait extension, proposal and implementation #15716

cperkinsintel · 2024-10-16T00:41:07Z

This PR proposes a new extension for a wait that sleeps rather than running CPU full tilt, as has been requested for IoT and similar applications.

Because it is fairly trivial, I am including an implementation.

…t, for IoT and similar applications

Pennycook · 2024-10-16T08:03:24Z

sycl/include/sycl/ext/oneapi/experimental/throttled_wait.hpp

+  while (e.get_info<sycl::info::event::command_execution_status>() !=
+         sycl::info::event_command_status::complete) {
+    std::this_thread::sleep_for(sleep);
+  }
+  e.wait();


This isn't guaranteed to work. From the specification:

SYCL commands submitted to a queue are not guaranteed to begin executing until a host thread blocks on their completion. In the absence of multiple host threads, there is no guarantee that host and device code will execute concurrently.

Polling on the event status could put an application into an infinite loop, because you'll never reach the call to wait.

Would it be sufficient to call queue::ext_oneapi_prod() on the associated queue prior to the polling?

Unfortunately, no. prod() is also defined as a hint, and doesn't provide a strong guarantee that anything will actually start executing.

Pretty much everything related to the forward progress of the device as a whole is currently defined as a hint, because there are valid implementations (e.g., SimSYCL) where everything executed by the "device" is actually executed by the host thread which eventually calls wait.

Being able to reason about cases where a device could execute kernels concurrently with the host thread and/or request for that to happen would require some new extension work.

Polling on the event status could put an application into an infinite loop, because you'll never reach the call to wait

This might not necessarily be the case. The spec wording you quote above is true in general, but the code being added here only needs to work for the DPC++ implementation. Does DPC++ already have a guarantee that commands will start executing even before wait is called? If not, we could add an internal function call here that does provide that guarantee.

Does DPC++ already have a guarantee that commands will start executing even before wait is called? If not, we could add an internal function call here that does provide that guarantee.

Honestly, I'm not sure. I'm worried that the answer is really complicated, though, and depends on a bunch of configuration options.

OpenCL has similar wording to SYCL regarding the guarantees about when kernels execute, so I don't think DPC++ can provide that guarantee when running on the OpenCL backend. Our OpenCL implementation for GPUs used to batch kernels before execution, and unless that's changed recently I don't think kernels are guaranteed to begin execution. Our OpenCL implementation for CPUs has a mode where kernel execution begins immediately on a pool of TBB threads, and the host thread simply joins the pool when it reaches wait, but I don't know if that's the default.

For Level Zero, it will depend on the value of SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS, which according to the documentation takes different default values for Windows and Linux.

For the native CPU backend, I don't know for sure. Their behavior might be the same as TBB's above, or they might wait until wait to use all the logical cores for kernel execution.

For CUDA and HIP, I have no idea. I suspect that submitted kernels always begin executing on the GPU in practice, but I don't know if this is actually guaranteed by the runtime or not.

Pennycook · 2024-10-16T08:27:18Z

sycl/doc/extensions/experimental/sycl_ext_oneapi_throttled_wait.asciidoc

+This extension adds simple APIs for an alternate "sleeping" wait implementation.
+This is for scenarios (such as IoT) where one might want to trade a bit of
+performance in exchange for having the host CPU be more available, not burning
+cycles intently waiting.  


I wonder if we could up-level this a little, and possibly even combine it with the extension that @steffenlarsen proposed over in #15704. They seem closely related, and as a user I don't think it would be clear when to prefer a "low powered event" vs a "throttled wait". It's also not clear what would happen if somebody tried to use these extensions together (i.e., by requesting a low-powered event and then waiting on it with throttling).

One simple idea would just be to implement the "low powered event" extension using throttling when running on an IoT device, and using hardware acceleration on systems where it's available.

Another (half-baked) idea would be to replace this with something like an "expected duration" property that could be passed to submit alongside a request for a "low-powered event". The implementation could then decide for itself whether to sleep or not, based on the expected duration of the events its waiting on, and any information it can query about whether certain commands have already begun executing.

Yes, I agree here. This seems very similar to #15704, and it seems like we should have a common extension API.

steffenlarsen

Assuming the implementation is indeed as trivial as the implementation here suggests (see https://github.com/intel/llvm/pull/15716/files#r1802565892 for a counter comment) I am not convinced that we need an extension for it, as it seems like any user who would need this could just implement this behavior themselves.

steffenlarsen · 2024-10-16T08:46:57Z

sycl/include/sycl/ext/oneapi/experimental/throttled_wait.hpp

+  while (e.get_info<sycl::info::event::command_execution_status>() !=
+         sycl::info::event_command_status::complete) {
+    std::this_thread::sleep_for(sleep);
+  }
+  e.wait();


Would it be sufficient to call queue::ext_oneapi_prod() on the associated queue prior to the polling?

New extension for a wait that sleeps rather than running CPU full til…

bb2fd63

…t, for IoT and similar applications

cperkinsintel temporarily deployed to WindowsCILock October 16, 2024 00:42 — with GitHub Actions Inactive

cperkinsintel requested a review from gmlueck October 16, 2024 00:58

cperkinsintel had a problem deploying to WindowsCILock October 16, 2024 01:26 — with GitHub Actions Failure

Pennycook reviewed Oct 16, 2024

View reviewed changes

steffenlarsen reviewed Oct 16, 2024

View reviewed changes

remove sycl.hpp from test

dca46f6

cperkinsintel temporarily deployed to WindowsCILock October 16, 2024 16:45 — with GitHub Actions Inactive

cperkinsintel temporarily deployed to WindowsCILock October 16, 2024 18:43 — with GitHub Actions Inactive

cperkinsintel closed this Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL] Throttled Wait extension, proposal and implementation #15716

[SYCL] Throttled Wait extension, proposal and implementation #15716

Uh oh!

cperkinsintel commented Oct 16, 2024

Uh oh!

Pennycook Oct 16, 2024

Uh oh!

steffenlarsen Oct 16, 2024

Uh oh!

Pennycook Oct 16, 2024

Uh oh!

gmlueck Oct 16, 2024

Uh oh!

Pennycook Oct 16, 2024

Uh oh!

Pennycook Oct 16, 2024

Uh oh!

gmlueck Oct 16, 2024

Uh oh!

steffenlarsen left a comment

Uh oh!

steffenlarsen Oct 16, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SYCL] Throttled Wait extension, proposal and implementation #15716

[SYCL] Throttled Wait extension, proposal and implementation #15716

Uh oh!

Conversation

cperkinsintel commented Oct 16, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

steffenlarsen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants