-
Notifications
You must be signed in to change notification settings - Fork 795
[SYCL][E2E] Add more tests for virtual functions #15067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][E2E] Add more tests for virtual functions #15067
Conversation
This commit still doesn't bring an exhaustive coverage for the feature, but still improves the situation by checking the following scenarios: - using math built-ins from virtual functions - using group barriers from virtual functions - using virtual functions in nd-range kernels where every work-item calls a different virtual function - using virtual functions when the code is scattered across several translation units Some tests are disabled, because we do not support those scenarios yet and more changes are required to make them work.
| SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable) | ||
| virtual int apply(int *LocalData, sycl::nd_item<1> It) { | ||
| LocalData[It.get_local_id()] += It.get_local_id(); | ||
| sycl::group_barrier(It.get_group()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid that people might copy-paste this example thoughtlessly in divergent control flow resulting in UB. I'm not sure if adding a comment here would be enough or if "convergent" functions should be prohibited under indirectly_callable by default and require explicit buy-in from the programmer (e.g. indirectly_callable_in_uniform_control_flow attribute).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't exactly share the concern. I.e. apply could have been a regular function which can also be blindly copy-pasted and called from a non-convergent/non-uniform context resulting in the very same UB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe using apply(int *, sycl::group) would be a better pattern? group in arguments is what the spec uses for such interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it more. I can't just pass group only, because I need local IDs which aren't available in group. And nd_item already includes group, so passing them both together would be a bit weird>
I suppose that we should assume that if nd_item is passed, then some group operations can be performed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is group::get_local_id in core SYCL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is
group::get_local_idin core SYCL.
Didn't know that! Switched to use group instead of nd_item in 0ea83a8
Host reference calculation has also been fixed by that commit: I've verified it on CPU (test passes there with internal newer version of OCL CPU RT).
| // We can't call group_barrier on host and therefore here we have a | ||
| // reference function instead of calling the same methods on host. | ||
| for (size_t GID = 0; GID < G.size() / L.size(); ++GID) { | ||
| for (size_t LID = 0; LID < L.size(); ++LID) | ||
| HostData[GID * L.size() + LID] += LID; | ||
|
|
||
| int Res = (TestCase == 0) ? 0 : 1; | ||
| for (size_t LID = 0; LID < L.size(); ++LID) { | ||
| if (TestCase == 0) | ||
| Res += HostData[GID * L.size() + LID]; | ||
| else | ||
| Res *= HostData[GID * L.size() + LID]; | ||
| } | ||
|
|
||
| for (size_t LID = 0; LID < L.size(); ++LID) | ||
| HostData[GID * L.size() + LID] = Res; | ||
| } | ||
|
|
||
| sycl::host_accessor HostAcc(DataStorage); | ||
| for (size_t I = 0; I < HostData.size(); ++I) | ||
| assert(HostAcc[I] == HostData[I]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, this requires some focus to understand... Can we use #ifdef __SYCL_DEVICE_ONLY__ to unify the paths instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we can use #ifdefs here, because nd_item is not user-constructible, i.e. the diff between host and device version of the function would be too huge.
But I will try to add some comments here which should help map this function to apply functions that we have above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code was re-written (11db515) to be closer to apply function that we have
| } | ||
| }; | ||
|
|
||
| int main() try { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, C++ never stops to surprise me with something I didn't know before...
| q.submit([&](sycl::handler &CGH) { | ||
| CGH.single_task([=]() { | ||
| DeviceStorage->construct</* ret type = */ BaseOp>(TestCase); | ||
| }); | ||
| }).wait_and_throw(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we just create a derived subclass normally and then pass it into the next kernel through its baseclass pointer? That would eliminate the dependency on "helpers.hpp" in this "uniform" tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason the obj_storage_t helper was introduced is to make sure that the storage we allocated is large enough and has correct alignment.
As noted in #14209 (comment) attempting to construct an object in a misaligned memory is a UB.
Here we have two different classes instances of which we may construct: SumOp and MultipleOp. Even though they are the same in their layout, I would still prefer not to hardcode their size and alignment, but instead use this generic helper which allows to change them as we wish without worrying about alignment and allocation size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it easier to write, or easier to debug when it will fail when a mistake is made in some future PR? I won't insist on the change here, but IMO, over-complicating simple tests usually leads to manually simplifying them in future whenever they catch regressions.
| } catch (sycl::exception &e) { | ||
| std::cout << "Unexpected exception was thrown: " << e.what() << std::endl; | ||
| return 1; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test generates two device images. One contains definitions of indirectly-callable functions and the other one contains kernel functions. In AOT mode they are not linked together before calling opencl-aot. When will this be fixed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, AOT support for virtual functions is incomplete yet, that will be addressed in separate PRs. For now that's a second priority, because there is plenty enough bugs even on JIT path
…h the actual kernel
| public: | ||
| SYCL_EXT_ONEAPI_FUNCTION_PROPERTY(oneapi::indirectly_callable) | ||
| virtual int apply(int *LocalData, sycl::group<1> WG) { | ||
| LocalData[WG.get_local_id()] += WG.get_local_id(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is read/write, but I'm not sure "read" part is really important for this test. Can we change it to write-only (e.g. g.get_group_linear() + g.get_local_id()). Then we'd be able to create /* virtual ? */ int calc_ref_value(auto global_size, auto local_size) { return /* formula */ }.
That would simplify lines 109-157 a lot, and would also move the reference value compute close to the device code so that they'd fit in a single screen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, thanks. I've applied that approach in f92ac85
| } | ||
| } | ||
|
|
||
| return sycl::group_broadcast(WG, Res); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This likely contains another group_barrier inside. Would it make sense to change the code to store the leader's value in line 50, then have a barrier and then read leader's value in each of the WIs before returning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, group_broadcast implies group_barrier. Considering that the test is named group-barrier, I've replaced group_broadcast with "manual broadcast" in b948b36
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think benefits of these in-tree outweigh perfecting the tests in the review, so formally LGTM.
| q.submit([&](sycl::handler &CGH) { | ||
| sycl::accessor DataAcc(DataStorage, CGH, sycl::read_write); | ||
| CGH.parallel_for(R, props, [=](auto it) { | ||
| // Select VF that corresponds to this work-item |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Select VF that corresponds to this work-item | |
| // Select virtual function that corresponds to this work-item |
although I'm biased here as VF usually means "vector factor" to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually select an object and not a virtual function here, fixed in 21adc25
sycl/test-e2e/VirtualFunctions/multiple-translation-units/separate-call.cpp
Outdated
Show resolved
Hide resolved
|
Considering that there is a formal approval and all most recent comments were applied, I've merged the PR. If there is any other feedback, I will apply it as a follow-up PR |
This commit still doesn't bring an exhaustive coverage for the feature, but still improves the situation by checking the following scenarios:
Some tests are disabled, because we do not support those scenarios yet and more changes are required to make them work.