Skip to content
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 140 additions & 0 deletions sycl/test-e2e/VirtualFunctions/misc/group-barrier.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
// FIXME: replace unsupported with an aspect check once we have it
// UNSUPPORTED: cuda, hip, acc
//
// REQUIRES: aspect-usm_shared_allocations
//
// Fails with UR_RESULT_ERROR_PROGRAM_LINK_FAILURE. SPIR-V files produced by
// SYCL_DUMP_IMAGES can be linked just fine (using llvm-spirv -r + llvm-link),
// so it seems to be a problem on IGC side.
// Reported in https://github.com/intel/llvm/issues/15068
// On CPU it segfaults within the kernel that performs virtual function call.
// https://github.com/intel/llvm/issues/15080
// XFAIL: gpu, cpu
//
// This test checks that group operations (barrier in this case) work correctly
// inside virtual functions.
//
// RUN: %{build} -o %t.out %helper-includes
// RUN: %{run} %t.out

#include <sycl/detail/core.hpp>
#include <sycl/group_barrier.hpp>
#include <sycl/usm.hpp>

#include "helpers.hpp"

#include <iostream>
#include <numeric>

namespace oneapi = sycl::ext::oneapi::experimental;

class BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int apply(int *, sycl::nd_item<1>) = 0;
};

class SumOp : public BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int apply(int *LocalData, sycl::nd_item<1> It) {
LocalData[It.get_local_id()] += It.get_local_id();
sycl::group_barrier(It.get_group());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that people might copy-paste this example thoughtlessly in divergent control flow resulting in UB. I'm not sure if adding a comment here would be enough or if "convergent" functions should be prohibited under indirectly_callable by default and require explicit buy-in from the programmer (e.g. indirectly_callable_in_uniform_control_flow attribute).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't exactly share the concern. I.e. apply could have been a regular function which can also be blindly copy-pasted and called from a non-convergent/non-uniform context resulting in the very same UB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe using apply(int *, sycl::group) would be a better pattern? group in arguments is what the spec uses for such interfaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it more. I can't just pass group only, because I need local IDs which aren't available in group. And nd_item already includes group, so passing them both together would be a bit weird>

I suppose that we should assume that if nd_item is passed, then some group operations can be performed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is group::get_local_id in core SYCL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is group::get_local_id in core SYCL.

Didn't know that! Switched to use group instead of nd_item in 0ea83a8

Host reference calculation has also been fixed by that commit: I've verified it on CPU (test passes there with internal newer version of OCL CPU RT).

int Res = 0;
if (It.get_group().leader()) {
for (size_t I = 0; I < It.get_local_range().size(); ++I) {
Res += LocalData[I];
}
}

return Res;
}
};

class MultiplyOp : public BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int apply(int *LocalData, sycl::nd_item<1> It) {
LocalData[It.get_local_id()] += It.get_local_id();
sycl::group_barrier(It.get_group());
int Res = 1;
if (It.get_group().leader()) {
for (size_t I = 0; I < It.get_local_range().size(); ++I) {
Res *= LocalData[I];
}
}

return Res;
}
};

int main() try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, C++ never stops to surprise me with something I didn't know before...

using storage_t = obj_storage_t<SumOp, MultiplyOp>;

auto asyncHandler = [](sycl::exception_list list) {
for (auto &e : list)
std::rethrow_exception(e);
};

sycl::queue q(asyncHandler);

auto *DeviceStorage = sycl::malloc_shared<storage_t>(1, q);
sycl::range G{512};
sycl::range L{32};

constexpr oneapi::properties props{oneapi::assume_indirect_calls};
for (unsigned TestCase = 0; TestCase < 2; ++TestCase) {
std::vector<int> HostData(G.size());
std::iota(HostData.begin(), HostData.end(), 0);
std::vector<int> DeviceData = HostData;
sycl::buffer<int> DataStorage(DeviceData.data(), G);

q.submit([&](sycl::handler &CGH) {
CGH.single_task([=]() {
DeviceStorage->construct</* ret type = */ BaseOp>(TestCase);
});
}).wait_and_throw();
Comment on lines +120 to +124
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just create a derived subclass normally and then pass it into the next kernel through its baseclass pointer? That would eliminate the dependency on "helpers.hpp" in this "uniform" tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason the obj_storage_t helper was introduced is to make sure that the storage we allocated is large enough and has correct alignment.

As noted in #14209 (comment) attempting to construct an object in a misaligned memory is a UB.

Here we have two different classes instances of which we may construct: SumOp and MultipleOp. Even though they are the same in their layout, I would still prefer not to hardcode their size and alignment, but instead use this generic helper which allows to change them as we wish without worrying about alignment and allocation size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easier to write, or easier to debug when it will fail when a mistake is made in some future PR? I won't insist on the change here, but IMO, over-complicating simple tests usually leads to manually simplifying them in future whenever they catch regressions.


q.submit([&](sycl::handler &CGH) {
sycl::accessor DataAcc(DataStorage, CGH, sycl::read_write);
sycl::local_accessor<int> LocalAcc(L, CGH);
CGH.parallel_for(sycl::nd_range{G, L}, props, [=](auto It) {
LocalAcc[It.get_local_id()] = DataAcc[It.get_global_id()];
auto *Ptr = DeviceStorage->getAs<BaseOp>();
DataAcc[It.get_global_id()] = Ptr->apply(
LocalAcc.template get_multi_ptr<sycl::access::decorated::no>()
.get(),
It);
});
});

// We can't call group_barrier on host and therefore here we have a
// reference function instead of calling the same methods on host.
for (size_t GID = 0; GID < G.size() / L.size(); ++GID) {
for (size_t LID = 0; LID < L.size(); ++LID)
HostData[GID * L.size() + LID] += LID;

int Res = (TestCase == 0) ? 0 : 1;
for (size_t LID = 0; LID < L.size(); ++LID) {
if (TestCase == 0)
Res += HostData[GID * L.size() + LID];
else
Res *= HostData[GID * L.size() + LID];
}

for (size_t LID = 0; LID < L.size(); ++LID)
HostData[GID * L.size() + LID] = Res;
}

sycl::host_accessor HostAcc(DataStorage);
for (size_t I = 0; I < HostData.size(); ++I)
assert(HostAcc[I] == HostData[I]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, this requires some focus to understand... Can we use #ifdef __SYCL_DEVICE_ONLY__ to unify the paths instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we can use #ifdefs here, because nd_item is not user-constructible, i.e. the diff between host and device version of the function would be too huge.

But I will try to add some comments here which should help map this function to apply functions that we have above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code was re-written (11db515) to be closer to apply function that we have

}

sycl::free(DeviceStorage, q);

return 0;
} catch (sycl::exception &e) {
std::cout << "Unexpected exception was thrown: " << e.what() << std::endl;
return 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test generates two device images. One contains definitions of indirectly-callable functions and the other one contains kernel functions. In AOT mode they are not linked together before calling opencl-aot. When will this be fixed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, AOT support for virtual functions is incomplete yet, that will be addressed in separate PRs. For now that's a second priority, because there is plenty enough bugs even on JIT path

93 changes: 93 additions & 0 deletions sycl/test-e2e/VirtualFunctions/misc/math.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
// FIXME: replace unsupported with an aspect check once we have it
// UNSUPPORTED: cuda, hip, acc
//
// REQUIRES: aspect-usm_shared_allocations
//
// This test checks that SYCL math built-in functions work correctly
// inside virtual functions.
//
// RUN: %{build} -o %t.out %helper-includes
// RUN: %{run} %t.out

#include <sycl/builtins.hpp>
#include <sycl/detail/core.hpp>
#include <sycl/usm.hpp>

#include "helpers.hpp"

#include <iostream>

namespace oneapi = sycl::ext::oneapi::experimental;

class BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual float apply(float) = 0;
};

class FloorOp : public BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual float apply(float V) { return sycl::floor(V); }
};

class CeilOp : public BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual float apply(float V) { return sycl::ceil(V); }
};

class RoundOp : public BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual float apply(float V) { return sycl::round(V); }
};

int main() try {
using storage_t = obj_storage_t<FloorOp, CeilOp, RoundOp>;

storage_t HostStorage;

auto asyncHandler = [](sycl::exception_list list) {
for (auto &e : list)
std::rethrow_exception(e);
};

sycl::queue q(asyncHandler);

auto *DeviceStorage = sycl::malloc_shared<storage_t>(1, q);

constexpr oneapi::properties props{oneapi::assume_indirect_calls};
for (unsigned TestCase = 0; TestCase < 3; ++TestCase) {
float HostData = 3.56;
float Data = HostData;
sycl::buffer<float> DataStorage(&Data, sycl::range{1});

q.submit([&](sycl::handler &CGH) {
CGH.single_task([=]() {
DeviceStorage->construct</* ret type = */ BaseOp>(TestCase);
});
}).wait_and_throw();

q.submit([&](sycl::handler &CGH) {
sycl::accessor DataAcc(DataStorage, CGH, sycl::read_write);
CGH.single_task(props, [=]() {
auto *Ptr = DeviceStorage->getAs<BaseOp>();
DataAcc[0] = Ptr->apply(DataAcc[0]);
});
});

auto *Ptr = HostStorage.construct</* ret type = */ BaseOp>(TestCase);
HostData = Ptr->apply(HostData);

sycl::host_accessor HostAcc(DataStorage);
assert(HostAcc[0] == HostData);
}

sycl::free(DeviceStorage, q);

return 0;
} catch (sycl::exception &e) {
std::cout << "Unexpected exception was thrown: " << e.what() << std::endl;
return 1;
}
110 changes: 110 additions & 0 deletions sycl/test-e2e/VirtualFunctions/misc/range-non-uniform-vf-2.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
// FIXME: replace unsupported with an aspect check once we have it
// UNSUPPORTED: cuda, hip, acc
//
// REQUIRES: aspect-usm_shared_allocations
//
// This test checks that virtual functions work correctly in simple range
// kernels when different work-items perform calls to different virtual
// functions using the same object.
//
// RUN: %{build} -o %t.out %helper-includes
// RUN: %{run} %t.out

#include <sycl/detail/core.hpp>
#include <sycl/usm.hpp>

#include "helpers.hpp"

#include <iostream>
#include <numeric>

namespace oneapi = sycl::ext::oneapi::experimental;

class BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int foo(int) = 0;

SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int bar(int) = 0;
};

class OpA : public BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int foo(int V) { return V + 2; }

SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int bar(int V) { return V - 2; }
};

class OpB : public BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int foo(int V) { return V * 2; }

SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int bar(int V) { return V / 2; }
};

int main() try {
using storage_t = obj_storage_t<OpA, OpB>;

storage_t HostStorage;

auto asyncHandler = [](sycl::exception_list list) {
for (auto &e : list)
std::rethrow_exception(e);
};

sycl::queue q(asyncHandler);

auto *DeviceStorage = sycl::malloc_shared<storage_t>(1, q);
sycl::range R{1024};

constexpr oneapi::properties props{oneapi::assume_indirect_calls};
for (size_t TestCase = 0; TestCase < 2; ++TestCase) {
std::vector<int> HostData(R.size());
std::iota(HostData.begin(), HostData.end(), 0);
std::vector<int> DeviceData = HostData;
sycl::buffer<int> DataStorage(DeviceData.data(), R);

q.submit([&](sycl::handler &CGH) {
CGH.single_task([=]() {
DeviceStorage->construct</* ret type = */ BaseOp>(TestCase);
});
}).wait_and_throw();

q.submit([&](sycl::handler &CGH) {
sycl::accessor DataAcc(DataStorage, CGH, sycl::read_write);
CGH.parallel_for(R, props, [=](auto It) {
// Select method that corresponds to this work-item
auto *Ptr = DeviceStorage->template getAs<BaseOp>();
if (It % 2)
DataAcc[It] = Ptr->foo(DataAcc[It]);
else
DataAcc[It] = Ptr->bar(DataAcc[It]);
});
});

BaseOp *Ptr = HostStorage.construct</* ret type = */ BaseOp>(TestCase);

for (size_t I = 0; I < HostData.size(); ++I) {
if (I % 2)
HostData[I] = Ptr->foo(HostData[I]);
else
HostData[I] = Ptr->bar(HostData[I]);
}

sycl::host_accessor HostAcc(DataStorage);
for (size_t I = 0; I < HostData.size(); ++I)
assert(HostAcc[I] == HostData[I]);
}

sycl::free(DeviceStorage, q);

return 0;
} catch (sycl::exception &e) {
std::cout << "Unexpected exception was thrown: " << e.what() << std::endl;
return 1;
}
Loading