-
Notifications
You must be signed in to change notification settings - Fork 796
[SYCL] Throttled Wait extension, proposal and implementation #15716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,165 @@ | ||
| = sycl_ext_oneapi_throttled_wait | ||
|
|
||
| :source-highlighter: coderay | ||
| :coderay-linenums-mode: table | ||
|
|
||
| // This section needs to be after the document title. | ||
| :doctype: book | ||
| :toc2: | ||
| :toc: left | ||
| :encoding: utf-8 | ||
| :lang: en | ||
| :dpcpp: pass:[DPC++] | ||
| :endnote: —{nbsp}end{nbsp}note | ||
|
|
||
| // Set the default source code type in this document to C++, | ||
| // for syntax highlighting purposes. This is needed because | ||
| // docbook uses c++ and html5 uses cpp. | ||
| :language: {basebackend@docbook:c++:cpp} | ||
|
|
||
|
|
||
| == Notice | ||
|
|
||
| [%hardbreaks] | ||
| Copyright (C) 2024 Intel Corporation. All rights reserved. | ||
|
|
||
| Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks | ||
| of The Khronos Group Inc. | ||
| OpenCL(TM) is a trademark of Apple Inc. used by permission by Khronos. | ||
|
|
||
|
|
||
| == Contact | ||
|
|
||
| To report problems with this extension, please open a new issue at: | ||
|
|
||
| https://github.com/intel/llvm/issues | ||
|
|
||
|
|
||
| == Dependencies | ||
|
|
||
| This extension is written against the SYCL 2020 revision 8 specification. | ||
| All references below to the "core SYCL specification" or to section numbers in | ||
| the SYCL specification refer to that revision. | ||
|
|
||
|
|
||
| == Status | ||
|
|
||
| This is an experimental extension specification, intended to provide early | ||
| access to features and gather community feedback. Interfaces defined in | ||
| this specification are implemented in DPC++, but they are not finalized | ||
| and may change incompatibly in future versions of DPC++ without prior notice. | ||
| *Shipping software products should not rely on APIs defined in | ||
| this specification.* | ||
|
|
||
|
|
||
| == Overview | ||
|
|
||
| This extension adds simple APIs for an alternate "sleeping" wait implementation. | ||
| This is for scenarios (such as IoT) where one might want to trade a bit of | ||
| performance in exchange for having the host CPU be more available, not burning | ||
| cycles intently waiting. | ||
|
|
||
| == Specification | ||
|
|
||
| === Additional Inclusion | ||
|
|
||
| As throttled_wait is presently an experimental extension, it requires an | ||
| additional inclusion to use. | ||
|
|
||
| ```c++ | ||
| #include <sycl/sycl.hpp> | ||
| #include <sycl/ext/oneapi/experimental/throttled_wait.hpp> | ||
|
|
||
| // now the extension API is available | ||
| ``` | ||
|
|
||
| === Feature test macro | ||
|
|
||
| This extension provides a feature-test macro as described in the core SYCL | ||
| specification. | ||
| An implementation supporting this extension must predefine the macro | ||
| `SYCL_EXT_ONEAPI_THROTTLED_WAIT` | ||
| to one of the values defined in the table below. | ||
| Applications can test for the existence of this macro to determine if | ||
| the implementation supports this feature, or applications can test the macro's | ||
| value to determine which of the extension's features the implementation | ||
| supports. | ||
|
|
||
| [%header,cols="1,5"] | ||
| |=== | ||
| |Value | ||
| |Description | ||
|
|
||
| |1 | ||
| |The APIs of this experimental extension are not versioned, so the | ||
| feature-test macro always has this value. | ||
| |=== | ||
|
|
||
| === API of the extension | ||
|
|
||
| This extension adds the following free functions, where sleep is one | ||
| of the types supported by std::chrono::duration (e.g. | ||
| std::chrono::milliseconds, std::chrono::microseconds, etc) | ||
|
|
||
| For each of these calls, while waiting for the sycl object to | ||
| complete, the host process sleeps for the sleep duration paramater. | ||
|
|
||
|
|
||
| ```c++ | ||
| namespace sycl::ext::oneapi::experimental { | ||
|
|
||
| template <typename Rep, typename Period> | ||
| void ext_oneapi_throttled_wait(sycl::event& e, const std::chrono::duration<Rep, Period>& sleep); | ||
|
|
||
| template <typename Rep, typename Period> | ||
| void ext_oneapi_throttled_wait((std::vector<sycl::event>& eventList, const std::chrono::duration<Rep, Period>& sleep) | ||
|
|
||
| template <typename Rep, typename Period> | ||
| void ext_oneapi_throttled_wait_and_throw(sycl::event& e, const std::chrono::duration<Rep, Period>& sleep) | ||
|
|
||
| template <typename Rep, typename Period> | ||
| void ext_oneapi_throttled_wait_and_throw((std::vector<sycl::event>& eventList, const std::chrono::duration<Rep, Period>& sleep) | ||
|
|
||
|
|
||
| } // namespace sycl::ext::oneapi::experimental | ||
| ``` | ||
|
|
||
|
|
||
| == Example | ||
|
|
||
| The following example demonstrates simple usage of this API. | ||
|
|
||
| ``` | ||
| #include <sycl/sycl.hpp> | ||
| #include <sycl/ext/oneapi/experimental/throttled_wait.hpp> | ||
| namespace syclex = sycl::ext::oneapi::experimental; | ||
|
|
||
| constexpr uint64_t N = 1000000000; // a very big N for looping. | ||
|
|
||
| int main() { | ||
| sycl::queue q; | ||
| uint64_t a = 0; | ||
|
|
||
| { | ||
| sycl::buffer<uint64_t, 1> buf(&a, sycl::range<1>(1)); | ||
|
|
||
| sycl::event e = q.submit([&](sycl::handler &cgh) { | ||
| sycl::accessor acc(buf, cgh, sycl::read_write); | ||
| cgh.single_task<class hello_world>([=]() { | ||
| for(long i = 0; i < N; i++) { | ||
| acc[0] = acc[0] + 1; | ||
| } | ||
| }); | ||
| }); | ||
| #ifdef SYCL_EXT_ONEAPI_THROTTLED_WAIT | ||
| syclex::ext_oneapi_throttled_wait(e, std::chrono::milliseconds(100)); | ||
| #else | ||
| e.wait(); | ||
| #endif | ||
| } // buffer goes out of scope, data copied back to 'a'. | ||
|
|
||
| std::cout << "a: " << a << std::endl; | ||
|
|
||
| return 0; | ||
| } | ||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| //===------- throttled_wait.hpp - sleeping implementation of wait ------===// | ||
| // | ||
| // Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. | ||
| // See https://llvm.org/LICENSE.txt for license information. | ||
| // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
| // | ||
| //===----------------------------------------------------------------------===// | ||
|
|
||
| #pragma once | ||
|
|
||
| #include <chrono> | ||
| #include <thread> | ||
|
|
||
| // The throttled_wait extension requires the inclusion of this header. | ||
| // If we instead want it to be included with sycl.hpp, then this defnition | ||
| // will need to be removed from here and | ||
| // added to llvm/sycl/source/feature_test.hpp.in instead. | ||
| #define SYCL_EXT_ONEAPI_THROTTLED_WAIT 1 | ||
|
|
||
| namespace sycl { | ||
| inline namespace _V1 { | ||
| namespace ext::oneapi::experimental { | ||
|
|
||
| template <typename Rep, typename Period> | ||
| void ext_oneapi_throttled_wait( | ||
| sycl::event &e, const std::chrono::duration<Rep, Period> &sleep) { | ||
| while (e.get_info<sycl::info::event::command_execution_status>() != | ||
| sycl::info::event_command_status::complete) { | ||
| std::this_thread::sleep_for(sleep); | ||
| } | ||
| e.wait(); | ||
|
Comment on lines
+27
to
+31
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't guaranteed to work. From the specification:
Polling on the event status could put an application into an infinite loop, because you'll never reach the call to wait. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would it be sufficient to call There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unfortunately, no. Pretty much everything related to the forward progress of the device as a whole is currently defined as a hint, because there are valid implementations (e.g., SimSYCL) where everything executed by the "device" is actually executed by the host thread which eventually calls Being able to reason about cases where a device could execute kernels concurrently with the host thread and/or request for that to happen would require some new extension work. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This might not necessarily be the case. The spec wording you quote above is true in general, but the code being added here only needs to work for the DPC++ implementation. Does DPC++ already have a guarantee that commands will start executing even before There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Honestly, I'm not sure. I'm worried that the answer is really complicated, though, and depends on a bunch of configuration options. OpenCL has similar wording to SYCL regarding the guarantees about when kernels execute, so I don't think DPC++ can provide that guarantee when running on the OpenCL backend. Our OpenCL implementation for GPUs used to batch kernels before execution, and unless that's changed recently I don't think kernels are guaranteed to begin execution. Our OpenCL implementation for CPUs has a mode where kernel execution begins immediately on a pool of TBB threads, and the host thread simply joins the pool when it reaches For Level Zero, it will depend on the value of For the native CPU backend, I don't know for sure. Their behavior might be the same as TBB's above, or they might wait until For CUDA and HIP, I have no idea. I suspect that submitted kernels always begin executing on the GPU in practice, but I don't know if this is actually guaranteed by the runtime or not. |
||
| } | ||
|
|
||
| template <typename Rep, typename Period> | ||
| void ext_oneapi_throttled_wait( | ||
| std::vector<sycl::event> &eventList, | ||
| const std::chrono::duration<Rep, Period> &sleep) { | ||
| for (sycl::event &e : eventList) { | ||
| while (e.get_info<sycl::info::event::command_execution_status>() != | ||
| sycl::info::event_command_status::complete) { | ||
| std::this_thread::sleep_for(sleep); | ||
| } | ||
| e.wait(); | ||
| } | ||
| } | ||
|
|
||
| template <typename Rep, typename Period> | ||
| void ext_oneapi_throttled_wait_and_throw( | ||
| sycl::event &e, const std::chrono::duration<Rep, Period> &sleep) { | ||
| while (e.get_info<sycl::info::event::command_execution_status>() != | ||
| sycl::info::event_command_status::complete) { | ||
| std::this_thread::sleep_for(sleep); | ||
| } | ||
| e.wait_and_throw(); | ||
| } | ||
|
|
||
| template <typename Rep, typename Period> | ||
| void ext_oneapi_throttled_wait_and_throw( | ||
| std::vector<sycl::event> &eventList, | ||
| const std::chrono::duration<Rep, Period> &sleep) { | ||
| for (sycl::event &e : eventList) { | ||
| while (e.get_info<sycl::info::event::command_execution_status>() != | ||
| sycl::info::event_command_status::complete) { | ||
| std::this_thread::sleep_for(sleep); | ||
| } | ||
| e.wait_and_throw(); | ||
| } | ||
| } | ||
|
|
||
| } // namespace ext::oneapi::experimental | ||
| } // namespace _V1 | ||
| } // namespace sycl | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,111 @@ | ||
| // RUN: %{build} -o %t.out | ||
| // RUN: %{run} %t.out | ||
|
|
||
| #include <sycl/ext/oneapi/experimental/throttled_wait.hpp> | ||
| #include <sycl/sycl.hpp> | ||
|
|
||
| namespace syclex = sycl::ext::oneapi::experimental; | ||
|
|
||
| // a very big N for looping in long running kernel | ||
| constexpr uint64_t N = 1000000000; | ||
|
|
||
| void test_wait_and_throw(sycl::queue &q) { | ||
| try { | ||
| sycl::event e = q.submit([&](sycl::handler &CGH) { | ||
| CGH.host_task([=]() { | ||
| throw std::runtime_error("Exception thrown from host_task."); | ||
| }); | ||
| }); | ||
| syclex::ext_oneapi_throttled_wait_and_throw(e, | ||
| std::chrono::milliseconds(100)); | ||
|
|
||
| assert(false && | ||
| "We should not be here. Exception should have been thrown."); | ||
| } catch (std::runtime_error &e) { | ||
| assert(std::string(e.what()) == "Exception thrown from host_task."); | ||
| std::cout << "Caught exception: " << e.what() << std::endl; | ||
| } | ||
| } | ||
|
|
||
| void test_wait(sycl::queue &q) { | ||
| // fast kernel | ||
| sycl::event fast = | ||
| q.submit([&](sycl::handler &cgh) { cgh.single_task([=]() {}); }); | ||
| syclex::ext_oneapi_throttled_wait(fast, std::chrono::milliseconds(100)); | ||
|
|
||
| // slow kernel | ||
| uint64_t a = 0; | ||
| { | ||
| sycl::buffer<uint64_t, 1> buf(&a, sycl::range<1>(1)); | ||
|
|
||
| sycl::event slow = q.submit([&](sycl::handler &cgh) { | ||
| sycl::accessor acc(buf, cgh, sycl::read_write); | ||
| cgh.single_task<class hello_world>([=]() { | ||
| for (long i = 0; i < N; i++) { | ||
| acc[0] = acc[0] + 1; | ||
| } | ||
| }); | ||
| }); | ||
| syclex::ext_oneapi_throttled_wait(slow, std::chrono::milliseconds(100)); | ||
| } // buffer goes out of scope, data copied back to 'a'. | ||
|
|
||
| std::cout << "a: " << a << std::endl; | ||
| assert(a == N); | ||
|
|
||
| // Ensure compatible with discarded events. | ||
| auto DiscardedEvent = q.ext_oneapi_submit_barrier(); | ||
| syclex::ext_oneapi_throttled_wait(DiscardedEvent, | ||
| std::chrono::milliseconds(100)); | ||
| } | ||
|
|
||
| std::vector<sycl::event> create_event_list(sycl::queue &q) { | ||
| std::vector<sycl::event> events; | ||
| sycl::event slow = q.submit([&](sycl::handler &cgh) { | ||
| cgh.single_task([=]() { | ||
| for (long i = 0; i < N; i++) { | ||
| } | ||
| }); | ||
| }); | ||
| events.push_back(slow); | ||
|
|
||
| sycl::event fast = | ||
| q.submit([&](sycl::handler &cgh) { cgh.single_task([=]() {}); }); | ||
| events.push_back(fast); | ||
|
|
||
| sycl::event DiscardedEvent = q.ext_oneapi_submit_barrier(); | ||
| events.push_back(DiscardedEvent); | ||
|
|
||
| return events; | ||
| } | ||
|
|
||
| void test_wait_event_list(sycl::queue &q) { | ||
| auto events = create_event_list(q); | ||
| syclex::ext_oneapi_throttled_wait(events, std::chrono::milliseconds(100)); | ||
| } | ||
|
|
||
| void test_wait_and_throw_event_list(sycl::queue &q) { | ||
| auto events = create_event_list(q); | ||
| syclex::ext_oneapi_throttled_wait_and_throw(events, | ||
| std::chrono::milliseconds(100)); | ||
| } | ||
|
|
||
| int main() { | ||
| auto asyncHandler = [](sycl::exception_list el) { | ||
| for (auto &e : el) { | ||
| std::rethrow_exception(e); | ||
| } | ||
| }; | ||
| sycl::queue q(asyncHandler); | ||
|
|
||
| #ifdef SYCL_EXT_ONEAPI_THROTTLED_WAIT | ||
| test_wait(q); | ||
| test_wait_and_throw(q); | ||
| test_wait_event_list(q); | ||
| test_wait_and_throw_event_list(q); | ||
| #else | ||
| assert(false && | ||
| "SYCL_EXT_ONEAPI_THROTTLED_WAIT feature test macro not defined"); | ||
| #endif | ||
|
|
||
| return 0; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could up-level this a little, and possibly even combine it with the extension that @steffenlarsen proposed over in #15704. They seem closely related, and as a user I don't think it would be clear when to prefer a "low powered event" vs a "throttled wait". It's also not clear what would happen if somebody tried to use these extensions together (i.e., by requesting a low-powered event and then waiting on it with throttling).
One simple idea would just be to implement the "low powered event" extension using throttling when running on an IoT device, and using hardware acceleration on systems where it's available.
Another (half-baked) idea would be to replace this with something like an "expected duration" property that could be passed to submit alongside a request for a "low-powered event". The implementation could then decide for itself whether to sleep or not, based on the expected duration of the events its waiting on, and any information it can query about whether certain commands have already begun executing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree here. This seems very similar to #15704, and it seems like we should have a common extension API.