Modified sycl reduction algorithm#56
Conversation
uphoffc
left a comment
There was a problem hiding this comment.
ntload(&buffer[id]);
Just use a regular load unless you have evidence that the specific "ntload" boosts performance across various platforms.
idx.barrier(sycl::access::fence_space::local_space);
Not necessary, can be removed.
davschneller
left a comment
There was a problem hiding this comment.
Thanks for updating it like this. 10e7 plasticity cells might still occur at least in GTS. :D
So, ACPP (presumably) doesn't have a wrapper for CUDA yet in these min/max cases? Does it work with sm_75? (or with integers?)
Running the tests themselves is alas not possible right now—I've set up a self-hosted runner for that which should manage running some basic NVIDIA/Intel tests, but it doesn't work quite yet (if it does, I'll also run the respective SeisSol tests).
algorithms/sycl/Reduction.cpp
Outdated
| idx.barrier(sycl::access::fence_space::local_space); | ||
|
|
||
| value = sycl::reduce_over_group(subgroup, value, operation); | ||
| auto reducedValue = sycl::reduce_over_group(idx.get_group(), threadAcc, operation); |
algorithms/sycl/Reduction.cpp
Outdated
| for (std::size_t i = currentWarp; i < warpsNeeded; i += warpCount) { | ||
| const auto id = threadInWarp + i * sgSize; | ||
| auto value = (id < size) ? static_cast<AccT>(ntload(&buffer[id])) : DefaultValue; | ||
| size_t numWorkGroups = (size + (workGroupSize * itemsPerWorkItem) - 1) |
| template <ReductionType Type, typename AccT, typename VecT, typename OpT> void launchReduction(AccT* result, const VecT *buffer, size_t size, OpT operation, bool overrideResult, void* streamPtr) { | ||
|
|
||
| constexpr auto DefaultValue = neutral<Type, AccT>(); | ||
| constexpr size_t workGroupSize = 256; |
There was a problem hiding this comment.
Why 256 and not 1024? (or are you at the bandwidth already like that?)
There was a problem hiding this comment.
Yes, and for higher numbers, there is no real improvement on PVC at least.
algorithms/sycl/Reduction.cpp
Outdated
| // Explicity pass MO to load | ||
| // AccT expected = atomic.load(MO); | ||
|
|
||
| while(true){ |
There was a problem hiding this comment.
Make while(true) { if (condition) break; -> while(!condition)
algorithms/sycl/Reduction.cpp
Outdated
| // Using our own CAS loop | ||
| // AccT expected = atomic.load(MO); | ||
|
|
||
| while(true){ |
The improvement numbers I mentioned are for vector sizes of 1e4, and 1e5. I have not talked about 1e8 because that sounds unrealistic. For 1e8, 1e9 vector sizes, it shows ~ 130 times improvement, and that is in most cases not really applicable in production scenarios for us. We mostly do GTS these days only for scaling studies.
EDIT: I tried it on my workstation with
and it seems that
|
It seems to be present throughout the project. I will check these later with any relevant benchmarks, and if we decide to remove them, I will remove them across the project.
I seem to have missed this barrier. Removed it, thank you! |
I know about the GTS thing. :) Ok; if that already helps with 1e4/1e5 (as... almost to be expected), then we should also adjust the CUDA/HIP kernels; probably the |
|
Also the failing test can be ignored; that's due to something running out of space in the GHA CI side. Not sure if we can really do anything about it at this point. |
|
@davschneller, just FYI, I tried the acpp installation locally again, and with the updated acpp, |
Hmm... Since not many people use the SYCL variant for NVIDIA hardware probably anyways, I think we can make the switch. |
I think my message was unclear. What I meant to say was that we need to update the acpp to something more recent. The recent version of it from their |
Review is taken care of, and Carsten's comments are implemented.
|
Yes recent versions of AdaptiveCpp have cmpxchg emulation when native fetch_min/max is not available. |
Modified sycl reduction algorithm based on @uphoffc's recommendation. Tested the fetch_add implementation and benchmarked it with the current speed on the PVC machine; the average runtime is around 1.5x-4x faster just for sycl reduction on the PVC machines. I did not test it with SeisSol, and as reduction is only a small component of SeisSol, I do not anticipate any significant speed-up in SeisSol runs due to this.
One problem was the
fetch_max()andfetch_min()methods with sm60, which throw some errors with registers, about which I am not entirely clear. I attempted to implement a manual CAS loop to overcome this. If anyone has ideas on how to surpass this, please let me know.The current failed CI test is due to some docker issue, which I am not clear about.
Can we also add the tests to the CI? There is a folder called
testswhose tests are not being called in CI as of now, if I understand correctly.