Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
cc71f38
[SYCL] Postpone creation of HostKernel copy
Alexandr-Konovalov Sep 29, 2025
c917410
Fix code formatting.
Alexandr-Konovalov Sep 29, 2025
941a5ff
Fix code formatting.
Alexandr-Konovalov Sep 29, 2025
6258fed
Make HostKernelRef::InstantiateKernelOnHost() empty.
Alexandr-Konovalov Sep 30, 2025
5e89d2f
Fix code formatting.
Alexandr-Konovalov Sep 30, 2025
a90a0ea
Enable move semantic for kernels.
Alexandr-Konovalov Oct 1, 2025
f65d5ba
Fix code formatting.
Alexandr-Konovalov Oct 1, 2025
91deab6
Return unique_ptr from takeOrCopyOwnership().
Alexandr-Konovalov Oct 1, 2025
cf65f74
Add unit test.
Alexandr-Konovalov Oct 1, 2025
4c85aa5
Fix code formatting.
Alexandr-Konovalov Oct 1, 2025
63b7572
Update sycl/include/sycl/detail/cg_types.hpp
Alexandr-Konovalov Oct 1, 2025
855d6a2
Remove redundant cast.
Alexandr-Konovalov Oct 1, 2025
cb688cd
Update sycl/include/sycl/detail/cg_types.hpp
Alexandr-Konovalov Oct 1, 2025
b81d48b
Use C++17-style metafunctions.
Alexandr-Konovalov Oct 1, 2025
a64c17c
Addressing code review.
Alexandr-Konovalov Oct 2, 2025
8de865a
Fix code formatting.
Alexandr-Konovalov Oct 2, 2025
79d19a9
Delete assignment operator and add layout test for HostKernelRefBase.
Alexandr-Konovalov Oct 2, 2025
a091ac4
Explicitely delete move ctor for HostKernelRef.
Alexandr-Konovalov Oct 2, 2025
5de3ae6
Update sycl_symbols_(linux|windows).dump
Alexandr-Konovalov Oct 2, 2025
06e3a92
Explicitely delete copy ctor from HostKernelRefBase.
Alexandr-Konovalov Oct 2, 2025
6d38d49
Merge branch 'sycl' into Alexandr-Konovalov/vptr_HostKernelRefBase
Alexandr-Konovalov Oct 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
87 changes: 87 additions & 0 deletions sycl/include/sycl/detail/cg_types.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,93 @@ class HostKernel : public HostKernelBase {
#endif
};

// the class keeps reference to a lambda allocated externally on stack
class HostKernelRefBase : public HostKernelBase {
public:
virtual std::shared_ptr<HostKernelBase> takeOrCopyOwnership() const = 0;
};

template <class KernelType, class KernelArgType, int Dims>
class HostKernelRef : public HostKernelRefBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to have HostKernelRefBase->HostKernelRef (we create objects once and then use in several places by referring to the type-erased base type) and HostKernelRef->HostKernelRefImpl or something like that, but I see how aligning with a poor name choice for the existing HostKernelBase is a counterargument to that.

Maybe adding

template <typename KernelType>
static HostKernelRef HostKernelRef::create(KernelType &&Kernel) {}

and avoiding using the HostKernelRefImpl would help with clarity too (and we'd isolate choosing the right template (rvalue ref vs lvalue ref) to this single create function).

@vinser52 , WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I did not get you idea and why do we need one more derived class (HostKernelRefImpl)? What are you trying to achieve with that?

Maybe adding

template <typename KernelType>
static HostKernelRef HostKernelRef::create(KernelType &&Kernel) {}

Are you trying to avoid HostKernelref specialization for the const MKerne &?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most APIs will be accepting HostKernelRefBase, I'm trying to avoid the Base suffix in those names. I'm not suggesting any new types in the hierarchy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, now I got your point.

HostKernelRefBase->HostKernelRef

This means rename. For me *Base clearly indicates that we are dealing with the base class. But it is subjective.

const KernelType &MKernel;

public:
HostKernelRef(const KernelType &Kernel) : MKernel(Kernel) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we delete copy ctor here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to create HostKernelRef from constant reference, as in sycl/include/sycl/queue.hpp, so we can't.

  HostKernelRef<KernelType, KernelTypeUniversalRef, TransformedArgType, Dims>
      HostKernel(std::forward<KernelTypeUniversalRef>(KernelFunc));

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean add HostKernelRef(const HostKernelRef&) = delete;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we explicitly delete copy ctor from HostKernelRefBase?


virtual char *getPtr() override {
return const_cast<char *>(reinterpret_cast<const char *>(&MKernel));
}
virtual std::shared_ptr<HostKernelBase> takeOrCopyOwnership() const override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, this should std::unique_ptr because it has no overhead and one can always easily create shared via unique_ptr::release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally agree. Overhead is the creation/destructor of unique_ptr, meanwhile caller needs shared_ptr. (And for shared_ptr we return 2 pointers vs 1 for unique_ptr, so it's hard to judge). Is that a chance that someday caller would need unique_ptr?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overhead is the creation/destructor of unique_ptr

is exactly zero with optimizations enabled: https://godbolt.org/z/fcaos1Wr7

That is not true for std::shared_ptr (which not only has extra memory alloc but also involves atomics).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good point. Done.

std::shared_ptr<HostKernelBase> Kernel;
Kernel.reset(new HostKernel<KernelType, KernelArgType, Dims>(MKernel));
return Kernel;
}

~HostKernelRef() noexcept override = default;
#ifndef __INTEL_PREVIEW_BREAKING_CHANGES
// This function is needed for host-side compilation to keep kernels
// instantitated. This is important for debuggers to be able to associate
// kernel code instructions with source code lines.
// NOTE: InstatiateKernelOnHost() should not be called.
void InstantiateKernelOnHost() override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be empty. HostKernelRef<...> instantiates HostKernel<...> on line 256, and its InstantiateKernelOnHost already does the right thing (outside preview). And for preview we need a mechanism that doesn't require copy-paste.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it can be empty. I try to describe the reason in comment.

And for preview we need a mechanism that doesn't require copy-paste.

Are we talking about GetInstantiateKernelOnHostPtr() call, right? It required template parameter, so unclear what can we done other then adding the call to templated constructor.

using IDBuilder = sycl::detail::Builder;
constexpr bool HasKernelHandlerArg =
KernelLambdaHasKernelHandlerArgT<KernelType, KernelArgType>::value;
if constexpr (std::is_same_v<KernelArgType, void>) {
runKernelWithoutArg(MKernel, std::bool_constant<HasKernelHandlerArg>());
} else if constexpr (std::is_same_v<KernelArgType, sycl::id<Dims>>) {
sycl::id ID = InitializedVal<Dims, id>::template get<0>();
runKernelWithArg<const KernelArgType &>(
MKernel, ID, std::bool_constant<HasKernelHandlerArg>());
} else if constexpr (std::is_same_v<KernelArgType, item<Dims, true>> ||
std::is_same_v<KernelArgType, item<Dims, false>>) {
constexpr bool HasOffset =
std::is_same_v<KernelArgType, item<Dims, true>>;
if constexpr (!HasOffset) {
KernelArgType Item = IDBuilder::createItem<Dims, HasOffset>(
InitializedVal<Dims, range>::template get<1>(),
InitializedVal<Dims, id>::template get<0>());
runKernelWithArg<KernelArgType>(
MKernel, Item, std::bool_constant<HasKernelHandlerArg>());
} else {
KernelArgType Item = IDBuilder::createItem<Dims, HasOffset>(
InitializedVal<Dims, range>::template get<1>(),
InitializedVal<Dims, id>::template get<0>(),
InitializedVal<Dims, id>::template get<0>());
runKernelWithArg<KernelArgType>(
MKernel, Item, std::bool_constant<HasKernelHandlerArg>());
}
} else if constexpr (std::is_same_v<KernelArgType, nd_item<Dims>>) {
sycl::range<Dims> Range = InitializedVal<Dims, range>::template get<1>();
sycl::id<Dims> ID = InitializedVal<Dims, id>::template get<0>();
sycl::group<Dims> Group =
IDBuilder::createGroup<Dims>(Range, Range, Range, ID);
sycl::item<Dims, true> GlobalItem =
IDBuilder::createItem<Dims, true>(Range, ID, ID);
sycl::item<Dims, false> LocalItem =
IDBuilder::createItem<Dims, false>(Range, ID);
KernelArgType NDItem =
IDBuilder::createNDItem<Dims>(GlobalItem, LocalItem, Group);
runKernelWithArg<const KernelArgType>(
MKernel, NDItem, std::bool_constant<HasKernelHandlerArg>());
} else if constexpr (std::is_same_v<KernelArgType, sycl::group<Dims>>) {
sycl::range<Dims> Range = InitializedVal<Dims, range>::template get<1>();
sycl::id<Dims> ID = InitializedVal<Dims, id>::template get<0>();
KernelArgType Group =
IDBuilder::createGroup<Dims>(Range, Range, Range, ID);
runKernelWithArg<KernelArgType>(
MKernel, Group, std::bool_constant<HasKernelHandlerArg>());
} else {
// Assume that anything else can be default-constructed. If not, this
// should fail to compile and the implementor should implement a generic
// case for the new argument type.
runKernelWithArg<KernelArgType>(
MKernel, KernelArgType{}, std::bool_constant<HasKernelHandlerArg>());
}
}
#endif
};

// This function is needed for host-side compilation to keep kernels
// instantitated. This is important for debuggers to be able to associate
// kernel code instructions with source code lines.
Expand Down
15 changes: 11 additions & 4 deletions sycl/include/sycl/queue.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -65,14 +65,14 @@ auto get_native(const SyclObjectT &Obj)
template <int Dims>
event __SYCL_EXPORT submit_kernel_direct_with_event_impl(
const queue &Queue, const nd_range<Dims> &Range,
std::shared_ptr<detail::HostKernelBase> &HostKernel,
detail::HostKernelRefBase &HostKernel,
detail::DeviceKernelInfo *DeviceKernelInfo,
const detail::code_location &CodeLoc, bool IsTopCodeLoc);

template <int Dims>
void __SYCL_EXPORT submit_kernel_direct_without_event_impl(
const queue &Queue, const nd_range<Dims> &Range,
std::shared_ptr<detail::HostKernelBase> &HostKernel,
detail::HostKernelRefBase &HostKernel,
detail::DeviceKernelInfo *DeviceKernelInfo,
const detail::code_location &CodeLoc, bool IsTopCodeLoc);

Expand Down Expand Up @@ -180,8 +180,15 @@ auto submit_kernel_direct(
"must be either sycl::nd_item or be convertible from sycl::nd_item");
using TransformedArgType = sycl::nd_item<Dims>;

std::shared_ptr<detail::HostKernelBase> HostKernel = std::make_shared<
detail::HostKernel<KernelType, TransformedArgType, Dims>>(KernelFunc);
HostKernelRef<KernelType, TransformedArgType, Dims> HostKernel(KernelFunc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we could do

const HostKernelRefBase &TypeErasedKernel = HostKernerlRef<...>{KernelFunc};

(https://godbolt.org/z/h9v9s3TrG), but getPtr() isn't marked as const 😞


// Instantiating the kernel on the host improves debugging.
// Passing this pointer to another translation unit prevents optimization.
#ifndef NDEBUG
// TODO: call library to prevent dropping call due to optimization
(void)
detail::GetInstantiateKernelOnHostPtr<KernelType, LambdaArgType, Dims>();
#endif

detail::DeviceKernelInfo *DeviceKernelInfoPtr =
&detail::getDeviceKernelInfo<NameT>();
Expand Down
10 changes: 6 additions & 4 deletions sycl/source/detail/queue_impl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -421,15 +421,17 @@ queue_impl::submit_impl(const detail::type_erased_cgfo_ty &CGF,
}

detail::EventImplPtr queue_impl::submit_kernel_direct_impl(
const NDRDescT &NDRDesc,
std::shared_ptr<detail::HostKernelBase> &HostKernel,
const NDRDescT &NDRDesc, detail::HostKernelRefBase &HostKernel,
detail::DeviceKernelInfo *DeviceKernelInfo, bool CallerNeedsEvent,
const detail::code_location &CodeLoc, bool IsTopCodeLoc) {

KernelData KData;

std::shared_ptr<detail::HostKernelBase> HostKernelPtr =
HostKernel.takeOrCopyOwnership();

KData.setDeviceKernelInfoPtr(DeviceKernelInfo);
KData.setKernelFunc(HostKernel->getPtr());
KData.setKernelFunc(HostKernelPtr->getPtr());
KData.setNDRDesc(NDRDesc);

auto SubmitKernelFunc =
Expand All @@ -441,7 +443,7 @@ detail::EventImplPtr queue_impl::submit_kernel_direct_impl(
KData.extractArgsAndReqsFromLambda();

CommandGroup.reset(new detail::CGExecKernel(
KData.getNDRDesc(), HostKernel,
KData.getNDRDesc(), std::move(HostKernelPtr),
nullptr, // Kernel
nullptr, // KernelBundle
std::move(CGData), std::move(KData).getArgs(),
Expand Down
9 changes: 3 additions & 6 deletions sycl/source/detail/queue_impl.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -361,8 +361,7 @@ class queue_impl : public std::enable_shared_from_this<queue_impl> {

template <int Dims>
event submit_kernel_direct_with_event(
const nd_range<Dims> &Range,
std::shared_ptr<detail::HostKernelBase> &HostKernel,
const nd_range<Dims> &Range, detail::HostKernelRefBase &HostKernel,
detail::DeviceKernelInfo *DeviceKernelInfo,
const detail::code_location &CodeLoc, bool IsTopCodeLoc) {
detail::EventImplPtr EventImpl =
Expand All @@ -373,8 +372,7 @@ class queue_impl : public std::enable_shared_from_this<queue_impl> {

template <int Dims>
void submit_kernel_direct_without_event(
const nd_range<Dims> &Range,
std::shared_ptr<detail::HostKernelBase> &HostKernel,
const nd_range<Dims> &Range, detail::HostKernelRefBase &HostKernel,
detail::DeviceKernelInfo *DeviceKernelInfo,
const detail::code_location &CodeLoc, bool IsTopCodeLoc) {
submit_kernel_direct_impl(NDRDescT{Range}, HostKernel, DeviceKernelInfo,
Expand Down Expand Up @@ -905,8 +903,7 @@ class queue_impl : public std::enable_shared_from_this<queue_impl> {
///
/// \return a SYCL event representing submitted command group or nullptr.
detail::EventImplPtr submit_kernel_direct_impl(
const NDRDescT &NDRDesc,
std::shared_ptr<detail::HostKernelBase> &HostKernel,
const NDRDescT &NDRDesc, detail::HostKernelRefBase &HostKernel,
detail::DeviceKernelInfo *DeviceKernelInfo, bool CallerNeedsEvent,
const detail::code_location &CodeLoc, bool IsTopCodeLoc);

Expand Down
16 changes: 8 additions & 8 deletions sycl/source/queue.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -466,7 +466,7 @@ const property_list &queue::getPropList() const { return impl->getPropList(); }
template <int Dims>
event submit_kernel_direct_with_event_impl(
const queue &Queue, const nd_range<Dims> &Range,
std::shared_ptr<detail::HostKernelBase> &HostKernel,
detail::HostKernelRefBase &HostKernel,
detail::DeviceKernelInfo *DeviceKernelInfo,
const detail::code_location &CodeLoc, bool IsTopCodeLoc) {
return getSyclObjImpl(Queue)->submit_kernel_direct_with_event(
Expand All @@ -475,26 +475,26 @@ event submit_kernel_direct_with_event_impl(

template event __SYCL_EXPORT submit_kernel_direct_with_event_impl<1>(
const queue &Queue, const nd_range<1> &Range,
std::shared_ptr<detail::HostKernelBase> &HostKernel,
detail::HostKernelRefBase &HostKernel,
detail::DeviceKernelInfo *DeviceKernelInfo,
const detail::code_location &CodeLoc, bool IsTopCodeLoc);

template event __SYCL_EXPORT submit_kernel_direct_with_event_impl<2>(
const queue &Queue, const nd_range<2> &Range,
std::shared_ptr<detail::HostKernelBase> &HostKernel,
detail::HostKernelRefBase &HostKernel,
detail::DeviceKernelInfo *DeviceKernelInfo,
const detail::code_location &CodeLoc, bool IsTopCodeLoc);

template event __SYCL_EXPORT submit_kernel_direct_with_event_impl<3>(
const queue &Queue, const nd_range<3> &Range,
std::shared_ptr<detail::HostKernelBase> &HostKernel,
detail::HostKernelRefBase &HostKernel,
detail::DeviceKernelInfo *DeviceKernelInfo,
const detail::code_location &CodeLoc, bool IsTopCodeLoc);

template <int Dims>
void submit_kernel_direct_without_event_impl(
const queue &Queue, const nd_range<Dims> &Range,
std::shared_ptr<detail::HostKernelBase> &HostKernel,
detail::HostKernelRefBase &HostKernel,
detail::DeviceKernelInfo *DeviceKernelInfo,
const detail::code_location &CodeLoc, bool IsTopCodeLoc) {
getSyclObjImpl(Queue)->submit_kernel_direct_without_event(
Expand All @@ -503,19 +503,19 @@ void submit_kernel_direct_without_event_impl(

template void __SYCL_EXPORT submit_kernel_direct_without_event_impl<1>(
const queue &Queue, const nd_range<1> &Range,
std::shared_ptr<detail::HostKernelBase> &HostKernel,
detail::HostKernelRefBase &HostKernel,
detail::DeviceKernelInfo *DeviceKernelInfo,
const detail::code_location &CodeLoc, bool IsTopCodeLoc);

template void __SYCL_EXPORT submit_kernel_direct_without_event_impl<2>(
const queue &Queue, const nd_range<2> &Range,
std::shared_ptr<detail::HostKernelBase> &HostKernel,
detail::HostKernelRefBase &HostKernel,
detail::DeviceKernelInfo *DeviceKernelInfo,
const detail::code_location &CodeLoc, bool IsTopCodeLoc);

template void __SYCL_EXPORT submit_kernel_direct_without_event_impl<3>(
const queue &Queue, const nd_range<3> &Range,
std::shared_ptr<detail::HostKernelBase> &HostKernel,
detail::HostKernelRefBase &HostKernel,
detail::DeviceKernelInfo *DeviceKernelInfo,
const detail::code_location &CodeLoc, bool IsTopCodeLoc);

Expand Down
Loading