Skip to content
Closed
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions sycl/include/syclcompat/launch.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,22 @@ launch(const dim3 &grid, const dim3 &threads, Args... args) {
return launch<F>(grid, threads, get_default_queue(), args...);
}

// Overload taking zero-argument function object
template <class F, int Dim>
std::enable_if_t<std::is_invocable_v<const F &>, sycl::event>
launch(const F &f, const sycl::nd_range<Dim> &range,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure what order of argument is best:

  • Advantages of lambda first (this): function name (as part of lambda) is first like in cuda and other overloads
  • Advantage of lambda last (like in https://github.com/codeplaysoftware/cutlass-fork/pull/305/files): consistent with SYCL parallel_for and sycl_khr_free_function_commands. Function arguments (as part of lambda) are last like in cuda and other overloads.

sycl::queue q = get_default_queue()) {
return q.parallel_for(detail::transform_nd_range<Dim>(range),
[=](sycl::nd_item<Dim>) { f(); });
}
// Alternative launch through dim3 objects
template <class F>
std::enable_if_t<std::is_invocable_v<const F &>, sycl::event>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this is unchanged, but I think you will want an overload of these launch functions that do not return an event, and that use the new enqueue_functions API (nd_launch): see https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_enqueue_functions.asciidoc
This is much closer to native cuda and does lead to a very noticeable performance improvement for small kernels (relevant to cutlass), because the overhead of creating events in cuda/hip is big.

I do not know whether having the function calling nd_launch in the implementation might change other design decisions here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we want a version which doesn't return an event and calls nd_launch. I don't think internally calling nd_launch will change the design decision here.

An overload isn't a good way to add a version which doesn't return an event. Given that one can't overload by return type, it would need to be a property or something like it to opt-out of the event. But getting an event back should be opt-in nor opt-out.

It would be best if we could just remove the return value. But I assume we don't want to break API.

If so I think we need to add a new function like launch2 or launch_fast or something similar which doesn't return an event.

launch(const F &f, const dim3 &grid, const dim3 &threads,
sycl::queue q = get_default_queue()) {
return launch(f, sycl::nd_range<3>{grid * threads, threads}, q);
}

} // namespace syclcompat

namespace syclcompat::experimental {
Expand Down
Loading