Skip to content

Conversation

@cperkinsintel
Copy link
Contributor

This PR proposes a new extension for a wait that sleeps rather than running CPU full tilt, as has been requested for IoT and similar applications.

Because it is fairly trivial, I am including an implementation.

Comment on lines +27 to +31
while (e.get_info<sycl::info::event::command_execution_status>() !=
sycl::info::event_command_status::complete) {
std::this_thread::sleep_for(sleep);
}
e.wait();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't guaranteed to work. From the specification:

SYCL commands submitted to a queue are not guaranteed to begin executing until a host thread blocks on their completion. In the absence of multiple host threads, there is no guarantee that host and device code will execute concurrently.

Polling on the event status could put an application into an infinite loop, because you'll never reach the call to wait.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be sufficient to call queue::ext_oneapi_prod() on the associated queue prior to the polling?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, no. prod() is also defined as a hint, and doesn't provide a strong guarantee that anything will actually start executing.

Pretty much everything related to the forward progress of the device as a whole is currently defined as a hint, because there are valid implementations (e.g., SimSYCL) where everything executed by the "device" is actually executed by the host thread which eventually calls wait.

Being able to reason about cases where a device could execute kernels concurrently with the host thread and/or request for that to happen would require some new extension work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Polling on the event status could put an application into an infinite loop, because you'll never reach the call to wait

This might not necessarily be the case. The spec wording you quote above is true in general, but the code being added here only needs to work for the DPC++ implementation. Does DPC++ already have a guarantee that commands will start executing even before wait is called? If not, we could add an internal function call here that does provide that guarantee.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does DPC++ already have a guarantee that commands will start executing even before wait is called? If not, we could add an internal function call here that does provide that guarantee.

Honestly, I'm not sure. I'm worried that the answer is really complicated, though, and depends on a bunch of configuration options.

OpenCL has similar wording to SYCL regarding the guarantees about when kernels execute, so I don't think DPC++ can provide that guarantee when running on the OpenCL backend. Our OpenCL implementation for GPUs used to batch kernels before execution, and unless that's changed recently I don't think kernels are guaranteed to begin execution. Our OpenCL implementation for CPUs has a mode where kernel execution begins immediately on a pool of TBB threads, and the host thread simply joins the pool when it reaches wait, but I don't know if that's the default.

For Level Zero, it will depend on the value of SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS, which according to the documentation takes different default values for Windows and Linux.

For the native CPU backend, I don't know for sure. Their behavior might be the same as TBB's above, or they might wait until wait to use all the logical cores for kernel execution.

For CUDA and HIP, I have no idea. I suspect that submitted kernels always begin executing on the GPU in practice, but I don't know if this is actually guaranteed by the runtime or not.

Comment on lines +57 to +60
This extension adds simple APIs for an alternate "sleeping" wait implementation.
This is for scenarios (such as IoT) where one might want to trade a bit of
performance in exchange for having the host CPU be more available, not burning
cycles intently waiting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could up-level this a little, and possibly even combine it with the extension that @steffenlarsen proposed over in #15704. They seem closely related, and as a user I don't think it would be clear when to prefer a "low powered event" vs a "throttled wait". It's also not clear what would happen if somebody tried to use these extensions together (i.e., by requesting a low-powered event and then waiting on it with throttling).

One simple idea would just be to implement the "low powered event" extension using throttling when running on an IoT device, and using hardware acceleration on systems where it's available.

Another (half-baked) idea would be to replace this with something like an "expected duration" property that could be passed to submit alongside a request for a "low-powered event". The implementation could then decide for itself whether to sleep or not, based on the expected duration of the events its waiting on, and any information it can query about whether certain commands have already begun executing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree here. This seems very similar to #15704, and it seems like we should have a common extension API.

Copy link
Contributor

@steffenlarsen steffenlarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the implementation is indeed as trivial as the implementation here suggests (see https://github.com/intel/llvm/pull/15716/files#r1802565892 for a counter comment) I am not convinced that we need an extension for it, as it seems like any user who would need this could just implement this behavior themselves.

Comment on lines +27 to +31
while (e.get_info<sycl::info::event::command_execution_status>() !=
sycl::info::event_command_status::complete) {
std::this_thread::sleep_for(sleep);
}
e.wait();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be sufficient to call queue::ext_oneapi_prod() on the associated queue prior to the polling?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants