groupreduction and subgroupreduction#421
groupreduction and subgroupreduction#421brabreda wants to merge 18 commits intoJuliaGPU:mainfrom brabreda:feature/reduction
Conversation
removed config include removed using GPUArrays versions
|
Thank you! As you found the notion of a subgroup is not necessarily consistent across backends. So I am wondering if we need to expose that as feature or if backends could use them to implement reduction. So as an example could we "just" add What is Do we need a broadcast/shuffle operation? I don't have the answers to those questions. I think adding |
I will fix this, meanwhile we can think about the wraps. If we will introduce it and in what way. |
|
Yeah requiring a neutral value sounds good. Could not each backend choose to use warp level ops, iff available? Instead of having the user make this decision. |
|
The problem with that approach is that warp intrinsics don’t work for every type. We then would need something like this:
In this case you don’t override the function. I am also not sure if this kind of type specialisation works where it takes the first function when it is one of the types. You probs have more expertise on this. would it be prefered that every thread returns the right rval? I do think in that case I Will have to rethink the warp reduction bases groupreduce. |
No it just needs to be documented :) and we might want to add a So it looks like CUDA.jl makes this decision based on the element type? |
src/KernelAbstractions.jl
Outdated
| """ | ||
| @subgroupsize() | ||
|
|
||
| returns the GPUs subgroupsize. | ||
| """ | ||
| macro subgroupsize() | ||
| quote | ||
| $__subgroupsize() | ||
| end | ||
| end |
There was a problem hiding this comment.
| """ | |
| @subgroupsize() | |
| returns the GPUs subgroupsize. | |
| """ | |
| macro subgroupsize() | |
| quote | |
| $__subgroupsize() | |
| end | |
| end |
src/reduce.jl
Outdated
| idx_in_group = @index(Local) | ||
| groupsize = @groupsize()[1] | ||
|
|
||
| localmem = @localmem(T, groupsize) |
There was a problem hiding this comment.
I see if we can do a subgroupreduce the memory we need here is much reduced.
src/reduce.jl
Outdated
| - `op`: the operator of the reduction | ||
| - `val`: value that each thread contibutes to the values that need to be reduced | ||
| - `netral`: value of the operator, so that `op(netural, neutral) = neutral`` | ||
| - `use_subgroups`: make use of the subgroupreduction of the groupreduction |
There was a problem hiding this comment.
I see the value of having a common implementation.
So I would define:
@reduce(op, val, neutral)
__reduce(op, val, neutral)
And then maybe:
__subgroupreduce & __subgroupsize (No @ version)
__can_subgroup_reduce(T) = false
And then we could define:
__reduce(__ctx___, op, val, neutral, ::Val{true})
__reduce(__ctx___, op, val, neutral, ::Val{false})
as you have here.
function __reduce(op, val::T, neutral::T) where T
__reduce(op, val, neutral, Val(__can_subgroup_reduce(T)))
end
|
Yeah so CUDA(and i think most GPU backends make the decision based on type). I was thinking if something like this would be possible: For the implementation without warps: And every back-end like CUDAKernels just implements something like this: It would allow for 1 common implementation and a few specializations that are done by the backends themselves. But, |
|
I recommed |
|
Only the group reduction without subgroups remains. It is ready to merge, once merged I will add support for Subgroups to the CUDAkernels, MetalKernels, and OneAPI. |
|
Okay so this is API neutral? Pure addition and we don't need anything new? |
| # perform the reduction | ||
| d = 1 | ||
| while d < groupsize | ||
| @synchronize() |
I am unsure why my previous PR closed but here are the changes.
It was my first time writing tests, and they passed. How are these tested on GPUs, and do I need more tests?