Skip to content
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions sycl/test-e2e/VirtualFunctions/lit.local.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,6 @@ import os
# paths like "../../../helper.hpp" in them, so let's just register a
# substitution to add directory with helper headers into include search path
config.substitutions.append(("%helper-includes", "-I {}".format(os.path.dirname(os.path.abspath(__file__)))))

# FIXME: replace unsupported with an aspect check once we have it
config.unsupported_features += ['cuda', 'hip', 'acc']
175 changes: 175 additions & 0 deletions sycl/test-e2e/VirtualFunctions/misc/group-barrier.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
// REQUIRES: aspect-usm_shared_allocations
//
// On CPU it segfaults within the kernel that performs virtual function call.
// https://github.com/intel/llvm/issues/15080
// XFAIL: cpu
// UNSUPPORTED: gpu
// On GPU this test (its older version which used nd_item instead of group)
// used to fail with UR_RESULT_ERROR_PROGRAM_LINK_FAILURE.
// SPIR-V files produced by SYCL_DUMP_IMAGES could be linked just fine (using
// both llvm-spirv -r + llvm-link and ocloc).
// Current version hangs and therefore it is marked as unsupported to avoid
// wasting time in CI and potentially blocking a machine.
// Reported in https://github.com/intel/llvm/issues/15068
//
// This test checks that group operations (barrier in this case) work correctly
// inside virtual functions.
//
// RUN: %{build} -o %t.out %helper-includes
// RUN: %{run} %t.out

#include <sycl/detail/core.hpp>
#include <sycl/group_algorithm.hpp>
#include <sycl/group_barrier.hpp>
#include <sycl/usm.hpp>

#include "helpers.hpp"

#include <iostream>
#include <numeric>

namespace oneapi = sycl::ext::oneapi::experimental;

class BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int apply(int *, sycl::group<1>) = 0;
};

class SumOp : public BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int apply(int *LocalData, sycl::group<1> WG) {
LocalData[WG.get_local_id()] += WG.get_local_id();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is read/write, but I'm not sure "read" part is really important for this test. Can we change it to write-only (e.g. g.get_group_linear() + g.get_local_id()). Then we'd be able to create /* virtual ? */ int calc_ref_value(auto global_size, auto local_size) { return /* formula */ }.

That would simplify lines 109-157 a lot, and would also move the reference value compute close to the device code so that they'd fit in a single screen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thanks. I've applied that approach in f92ac85

sycl::group_barrier(WG);
int Res = 0;
if (WG.leader()) {
for (size_t I = 0; I < WG.get_local_range().size(); ++I) {
Res += LocalData[I];
}
}

return sycl::group_broadcast(WG, Res);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This likely contains another group_barrier inside. Would it make sense to change the code to store the leader's value in line 50, then have a barrier and then read leader's value in each of the WIs before returning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, group_broadcast implies group_barrier. Considering that the test is named group-barrier, I've replaced group_broadcast with "manual broadcast" in b948b36

}
};

class MultiplyOp : public BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int apply(int *LocalData, sycl::group<1> WG) {
LocalData[WG.get_local_id()] += WG.get_local_id();
sycl::group_barrier(WG);
int Res = 1;
if (WG.leader()) {
for (size_t I = 0; I < WG.get_local_range().size(); ++I) {
Res *= LocalData[I];
}
}

return sycl::group_broadcast(WG, Res);
}
};

int main() try {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, C++ never stops to surprise me with something I didn't know before...

using storage_t = obj_storage_t<SumOp, MultiplyOp>;

sycl::queue q;

auto *DeviceStorage = sycl::malloc_shared<storage_t>(1, q);
// Let's keep ranges small, or otherwise we will encounter integer overflow
// (which is a UB) in MultiplyOp::apply.
sycl::range G{16};
sycl::range L{4};

constexpr oneapi::properties props{oneapi::assume_indirect_calls};
for (unsigned TestCase = 0; TestCase < 2; ++TestCase) {
std::vector<int> HostData(G.size());
std::iota(HostData.begin(), HostData.end(), 1);
std::vector<int> DeviceData = HostData;
sycl::buffer<int> DataStorage(DeviceData.data(), G);

q.submit([&](sycl::handler &CGH) {
CGH.single_task([=]() {
DeviceStorage->construct</* ret type = */ BaseOp>(TestCase);
});
}).wait_and_throw();
Comment on lines +120 to +124
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just create a derived subclass normally and then pass it into the next kernel through its baseclass pointer? That would eliminate the dependency on "helpers.hpp" in this "uniform" tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason the obj_storage_t helper was introduced is to make sure that the storage we allocated is large enough and has correct alignment.

As noted in #14209 (comment) attempting to construct an object in a misaligned memory is a UB.

Here we have two different classes instances of which we may construct: SumOp and MultipleOp. Even though they are the same in their layout, I would still prefer not to hardcode their size and alignment, but instead use this generic helper which allows to change them as we wish without worrying about alignment and allocation size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it easier to write, or easier to debug when it will fail when a mistake is made in some future PR? I won't insist on the change here, but IMO, over-complicating simple tests usually leads to manually simplifying them in future whenever they catch regressions.


q.submit([&](sycl::handler &CGH) {
sycl::accessor DataAcc(DataStorage, CGH, sycl::read_write);
sycl::local_accessor<int> LocalAcc(L, CGH);
CGH.parallel_for(sycl::nd_range{G, L}, props, [=](auto It) {
LocalAcc[It.get_local_id()] = DataAcc[It.get_global_id()];
auto *Ptr = DeviceStorage->getAs<BaseOp>();
DataAcc[It.get_global_id()] = Ptr->apply(
LocalAcc.get_multi_ptr<sycl::access::decorated::no>().get(),
It.get_group());
});
});

// We can't call group_barrier on host and therefore here we have a
// reference function instead of calling the same methods on host.
//
// 'apply' function is written as a kernel, i.e. it describes a single
// work-item in an nd-range. Here we emulate that nd-range by looping over
// all work-groups and then over each work-item within that group.
for (size_t WorkGroupID = 0; WorkGroupID < G.size() / L.size();
++WorkGroupID) {
// Equivalent of a local accessor (LocalData)
std::vector<int> LocalHostData(L.size());
// For each work-item within a group, LID - local id
for (size_t LID = 0; LID < L.size(); ++LID) {
// GID - global id
size_t GID = WorkGroupID * L.size() + LID;
LocalHostData[LID] = HostData[GID];

// Below (including other loops) is an equivalent of apply's body, but
// it combains both SumOp and MultiplyOp and hence conditions based on
// TestCase.
LocalHostData[LID] += LID;
}

// Group barrier is simulated by splitting work-group loop in two.
// Even though Res is a private variable in the kernel, here we have to
// declare it in an outer scope (making it local) so it survies our
// barriers emulation.
int Res = (TestCase == 0) ? 0 : 1;

for (size_t LID = 0; LID < L.size(); ++LID) {
if (LID == 0) { // if that is a group leader
for (size_t NestedLID = 0; NestedLID < L.size(); ++NestedLID) {
if (TestCase == 0)
Res += LocalHostData[NestedLID];
else
Res *= LocalHostData[NestedLID];
}
}
}

// Group broadcast involves a barrier, so we once again splitting
// work-group loop.
for (size_t LID = 0; LID < L.size(); ++LID) {
// GID - global id
size_t GID = WorkGroupID * L.size() + LID;
// The broadcast itself: all work-items get result computed by a
// work-group leader.
HostData[GID] = Res;
}
}

sycl::host_accessor HostAcc(DataStorage);
for (size_t I = 0; I < HostData.size(); ++I) {
if (HostAcc[I] != HostData[I]) {
std::cout << "Mismatch at index " << I << ": " << HostAcc[I]
<< " != " << HostData[I] << std::endl;
assert(HostAcc[I] == HostData[I]);
}
}
}

sycl::free(DeviceStorage, q);

return 0;
} catch (sycl::exception &e) {
std::cout << "Unexpected exception was thrown: " << e.what() << std::endl;
return 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test generates two device images. One contains definitions of indirectly-callable functions and the other one contains kernel functions. In AOT mode they are not linked together before calling opencl-aot. When will this be fixed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, AOT support for virtual functions is incomplete yet, that will be addressed in separate PRs. For now that's a second priority, because there is plenty enough bugs even on JIT path

85 changes: 85 additions & 0 deletions sycl/test-e2e/VirtualFunctions/misc/math.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
// REQUIRES: aspect-usm_shared_allocations
//
// This test checks that SYCL math built-in functions work correctly
// inside virtual functions.
//
// RUN: %{build} -o %t.out %helper-includes
// RUN: %{run} %t.out

#include <sycl/builtins.hpp>
#include <sycl/detail/core.hpp>
#include <sycl/usm.hpp>

#include "helpers.hpp"

#include <iostream>

namespace oneapi = sycl::ext::oneapi::experimental;

class BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual float apply(float) = 0;
};

class FloorOp : public BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual float apply(float V) { return sycl::floor(V); }
};

class CeilOp : public BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual float apply(float V) { return sycl::ceil(V); }
};

class RoundOp : public BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual float apply(float V) { return sycl::round(V); }
};

int main() try {
using storage_t = obj_storage_t<FloorOp, CeilOp, RoundOp>;

storage_t HostStorage;

sycl::queue q;

auto *DeviceStorage = sycl::malloc_shared<storage_t>(1, q);

constexpr oneapi::properties props{oneapi::assume_indirect_calls};
for (unsigned TestCase = 0; TestCase < 3; ++TestCase) {
float HostData = 3.56;
float Data = HostData;
sycl::buffer<float> DataStorage(&Data, sycl::range{1});

q.submit([&](sycl::handler &CGH) {
CGH.single_task([=]() {
DeviceStorage->construct</* ret type = */ BaseOp>(TestCase);
});
}).wait_and_throw();

q.submit([&](sycl::handler &CGH) {
sycl::accessor DataAcc(DataStorage, CGH, sycl::read_write);
CGH.single_task(props, [=]() {
auto *Ptr = DeviceStorage->getAs<BaseOp>();
DataAcc[0] = Ptr->apply(DataAcc[0]);
});
});

auto *Ptr = HostStorage.construct</* ret type = */ BaseOp>(TestCase);
HostData = Ptr->apply(HostData);

sycl::host_accessor HostAcc(DataStorage);
assert(HostAcc[0] == HostData);
}

sycl::free(DeviceStorage, q);

return 0;
} catch (sycl::exception &e) {
std::cout << "Unexpected exception was thrown: " << e.what() << std::endl;
return 1;
}
102 changes: 102 additions & 0 deletions sycl/test-e2e/VirtualFunctions/misc/range-non-uniform-vf-2.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
// REQUIRES: aspect-usm_shared_allocations
//
// This test checks that virtual functions work correctly in simple range
// kernels when different work-items perform calls to different virtual
// functions using the same object.
//
// RUN: %{build} -o %t.out %helper-includes
// RUN: %{run} %t.out

#include <sycl/detail/core.hpp>
#include <sycl/usm.hpp>

#include "helpers.hpp"

#include <iostream>
#include <numeric>

namespace oneapi = sycl::ext::oneapi::experimental;

class BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int foo(int) = 0;

SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int bar(int) = 0;
};

class OpA : public BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int foo(int V) { return V + 2; }

SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int bar(int V) { return V - 2; }
};

class OpB : public BaseOp {
public:
SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int foo(int V) { return V * 2; }

SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable)
virtual int bar(int V) { return V / 2; }
};

int main() try {
using storage_t = obj_storage_t<OpA, OpB>;

storage_t HostStorage;

sycl::queue q;

auto *DeviceStorage = sycl::malloc_shared<storage_t>(1, q);
sycl::range R{1024};

constexpr oneapi::properties props{oneapi::assume_indirect_calls};
for (size_t TestCase = 0; TestCase < 2; ++TestCase) {
std::vector<int> HostData(R.size());
std::iota(HostData.begin(), HostData.end(), 0);
std::vector<int> DeviceData = HostData;
sycl::buffer<int> DataStorage(DeviceData.data(), R);

q.submit([&](sycl::handler &CGH) {
CGH.single_task([=]() {
DeviceStorage->construct</* ret type = */ BaseOp>(TestCase);
});
}).wait_and_throw();

q.submit([&](sycl::handler &CGH) {
sycl::accessor DataAcc(DataStorage, CGH, sycl::read_write);
CGH.parallel_for(R, props, [=](auto It) {
// Select method that corresponds to this work-item
auto *Ptr = DeviceStorage->template getAs<BaseOp>();
if (It % 2)
DataAcc[It] = Ptr->foo(DataAcc[It]);
else
DataAcc[It] = Ptr->bar(DataAcc[It]);
});
});

BaseOp *Ptr = HostStorage.construct</* ret type = */ BaseOp>(TestCase);

for (size_t I = 0; I < HostData.size(); ++I) {
if (I % 2)
HostData[I] = Ptr->foo(HostData[I]);
else
HostData[I] = Ptr->bar(HostData[I]);
}

sycl::host_accessor HostAcc(DataStorage);
for (size_t I = 0; I < HostData.size(); ++I)
assert(HostAcc[I] == HostData[I]);
}

sycl::free(DeviceStorage, q);

return 0;
} catch (sycl::exception &e) {
std::cout << "Unexpected exception was thrown: " << e.what() << std::endl;
return 1;
}
Loading
Loading