- 
                Notifications
    You must be signed in to change notification settings 
- Fork 79
adding atomic support with atomix #299
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KernelAbstractions.jl doesn't have to depend on UnsafeAtomicsLLVM.jl (and LLVM.jl)
Co-authored-by: Takafumi Arakaki <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Probably needs docs as well as AMDGPU support.
Co-authored-by: Valentin Churavy <[email protected]>
| Hi! As for 3. What about atomic primitives like atomic_add!(...), I'd like to say that I have several kernels that use  Also I'm curious if it will support things like: @atomic max(x[i], v) | 
Co-authored-by: Takafumi Arakaki <[email protected]>
| I don't mind reworking this PR and #282 so we get both the macro and better ordering support from Atomix and also the  I figure most people will want to use the macro, but some people will prefer the  | 
| Let's merge this for now and then you can open a second PR? | 
| This one is not ready to be merged | 
| Oops. I got excited that it passed tests :) | 
| It was missing docs and tests, at least... I will add them when I get the chance. To be fair, atomix should have all the necessary tests, I just wanted to double check here. Documentation does not need to be long, but having a section for atomics with an example would go a long way. | 
| I was just waiting to add docs until we settled the atomic "primitive" discussion. | 
| I've tried this PR and it looks like on CPU it only supports integer types. ErrorERROR: LoadError: InvalidIRError: compiling kernel #gpu_splat!(KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to modify!)
Stacktrace:
 [1] modify!
   @ ~/.julia/packages/Atomix/F9VIX/src/core.jl:33
 [2] macro expansion
   @ ~/code/a.jl:28
 [3] gpu_splat!
   @ ~/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80
 [4] gpu_splat!
   @ ./none:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_splat!), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/validation.jl:139
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:409 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/LDL7n/src/TimerOutput.jl:252 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:407 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/utils.jl:64
  [6] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:354
  [7] #224
    @ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:347 [inlined]
  [8] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_splat!), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:74
  [9] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:346
 [10] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/cache.jl:90
 [11] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:299
 [12] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}})
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:293
 [13] macro expansion
    @ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:102 [inlined]
 [14] (::KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.StaticSize{(512,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_splat!)})(::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::Vararg{Any}; ndrange::Int64, dependencies::CUDAKernels.CudaEvent, workgroupsize::Nothing, progress::Function)
    @ CUDAKernels ~/.julia/packages/CUDAKernels/4VLF4/src/CUDAKernels.jl:272
 [15] main()
    @ Main ~/code/a.jl:40
 [16] top-level scope
    @ ~/code/a.jl:42
in expression starting at /home/pxl-th/code/a.jl:42MWE: Codeusing CUDA
using CUDAKernels
using KernelAbstractions
using KernelAbstractions: @atomic
CUDA.allowscalar(false)
n_threads(::CPU) = Threads.nthreads()
n_threads(::CUDADevice) = 512
Base.rand(::CPU, T, shape) = rand(T, shape)
Base.rand(::CUDADevice, T, shape) = CUDA.rand(T, shape)
to_device(::CPU, x) = copy(x)
to_device(::CUDADevice, x) = CuArray(x)
@kernel function splat!(grid, @Const(indices), @Const(mlp_out))
    i = @index(Global)
    idx = indices[i]
    @atomic max(grid[idx], mlp_out[i])
end
function main()
    #device = CPU()
    device = CUDADevice()
    n = 16
    indices = to_device(device, UInt32.(collect(1:n)))
    mlp_out = rand(device, Int64, n) # errors on CPU with Float32
    grid = rand(device, Int64, n) # errors on CPU with Float32
    wait(splat!(device, n_threads(device), n)(grid, indices, mlp_out))
end
main() | 
| I was unable to replicate this error by running the provided code with 1.7.1 and 1.8.0-beta3 (just pulled from git). What OS are you using? Also, could you show the outputs of  | 
| I'm on Ubuntu 22.04 
  | 
| I've just updated MWE code, before I included code that does not error :) | 
| Right, I see the comments now, sorry! try  | 
| Yes, that works, thanks! Although there is another issue, which is not critical for me, but might be worth mentioning: MWE: Codeusing CUDA
using CUDAKernels
using KernelAbstractions
using KernelAbstractions: @atomic
CUDA.allowscalar(false)
const NERF_STEPS = UInt32(1024)
const MIN_CONE_STEPSIZE = √3f0 / NERF_STEPS
n_threads(::CPU) = Threads.nthreads()
n_threads(::CUDADevice) = 512
Base.rand(::CPU, T, shape) = rand(T, shape)
Base.rand(::CUDADevice, T, shape) = CUDA.rand(T, shape)
Base.zeros(::CPU, T, shape) = zeros(T, shape)
Base.zeros(::CUDADevice, T, shape) = CUDA.zeros(T, shape)
to_device(::CPU, x) = copy(x)
to_device(::CUDADevice, x) = CuArray(x)
@inline density_activation(x) = exp(x)
@kernel function splat!(grid, @Const(indices), @Const(mlp_out))
    i = @index(Global)
    idx = indices[i]
    old, new = @atomic max(grid[idx], mlp_out[i])
    @atomic grid[idx] = old
end
function main()
    # device = CPU()
    device = CUDADevice()
    n = 16
    indices = to_device(device, UInt32.(collect(1:n)))
    mlp_out = rand(device, Int64, n)
    grid = zeros(device, Int64, n)
    wait(splat!(device, n_threads(device), n)(grid, indices, mlp_out))
end
main()ErrorERROR: LoadError: LLVM error: Cannot select: 0x77adc60: ch = AtomicStore<(store seq_cst (s64) into %ir.41, addrspace 1)> 0x4926e08:1, 0x6f629c8, 0x4926e08, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/base.jl:40 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:245 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:201 @[ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/i4GMj/src/internal.jl:11 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:14 @[ /home/pxl-th/code/a.jl:29 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
  0x6f629c8: i64 = add 0x6f626f0, Constant:i64<-8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
    0x6f626f0: i64 = add 0x6f62ea8, 0x7195f28, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
      0x6f62ea8: i64,ch = CopyFromReg 0x65cc278, Register:i64 %0, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
        0x4926cd0: i64 = Register %0
      0x7195f28: i64 = shl nuw nsw 0x77ae5b8, Constant:i32<3>, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
        0x77ae5b8: i64 = AssertZext 0x71ce208, ValueType:ch:i32, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
          0x71ce208: i64,ch = CopyFromReg 0x65cc278, Register:i64 %9, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
            0x66aca88: i64 = Register %9
        0x66ac0c8: i32 = Constant<3>
    0x4926720: i64 = Constant<-8>
  0x4926e08: i64,ch = AtomicLoadMax<(load store seq_cst (s64) on %ir.39, addrspace 1)> 0x7195c50:1, 0x6f629c8, 0x7195c50, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/base.jl:40 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:270 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:270 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:374 @[ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/i4GMj/src/internal.jl:18 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:33 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ] ]
    0x6f629c8: i64 = add 0x6f626f0, Constant:i64<-8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
      0x6f626f0: i64 = add 0x6f62ea8, 0x7195f28, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
        0x6f62ea8: i64,ch = CopyFromReg 0x65cc278, Register:i64 %0, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
          0x4926cd0: i64 = Register %0
        0x7195f28: i64 = shl nuw nsw 0x77ae5b8, Constant:i32<3>, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
          0x77ae5b8: i64 = AssertZext 0x71ce208, ValueType:ch:i32, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
            0x71ce208: i64,ch = CopyFromReg 0x65cc278, Register:i64 %9, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
              0x66aca88: i64 = Register %9
          0x66ac0c8: i32 = Constant<3>
      0x4926720: i64 = Constant<-8>
    0x7195c50: i64,ch = llvm.nvvm.ldg.global.i<(load (s64) from %ir.34, addrspace 1)> 0x65cc278, TargetConstant:i64<5104>, 0x6f62b68, Constant:i32<8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/base.jl:40 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:120 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:120 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:219 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:40 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ] ] ] ]
      0x49271b0: i64 = TargetConstant<5104>
      0x6f62b68: i64 = add 0x4926b98, Constant:i64<-8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
        0x4926b98: i64 = add 0x6f62278, 0x4926ed8, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
          0x6f62278: i64,ch = CopyFromReg 0x65cc278, Register:i64 %4, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
            0x7803a58: i64 = Register %4
          0x4926ed8: i64 = shl nuw nsw 0x71ce2d8, Constant:i32<3>, int.jl:88 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
            0x71ce2d8: i64,ch = CopyFromReg 0x65cc278, Register:i64 %8, int.jl:88 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
              0x77ae620: i64 = Register %8
            0x66ac0c8: i32 = Constant<3>
        0x4926720: i64 = Constant<-8>
      0x7196880: i32 = Constant<8>
In function: _Z21julia_gpu_splat__430516CompilerMetadataI10StaticSizeI5_16__E12DynamicCheckvv7NDRangeILi1ES0_I4_1__ES0_I6_512__EvvEE13CuDeviceArrayI5Int64Li1ELi1EES3_I6UInt32Li1ELi1EES3_IS4_Li1ELi1EE
Stacktrace:
  [1] handle_error(reason::Cstring)
    @ LLVM ~/.julia/packages/LLVM/YSJ2s/src/core/context.jl:105
  [2] LLVMTargetMachineEmitToMemoryBuffer
    @ ~/.julia/packages/LLVM/YSJ2s/lib/13/libLLVM_h.jl:947 [inlined]
  [3] emit(tm::LLVM.TargetMachine, mod::LLVM.Module, filetype::LLVM.API.LLVMCodeGenFileType)
    @ LLVM ~/.julia/packages/LLVM/YSJ2s/src/targetmachine.jl:45
  [4] mcgen(job::GPUCompiler.CompilerJob, mod::LLVM.Module, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/mcgen.jl:74
  [5] macro expansion
    @ ~/.julia/packages/TimerOutputs/LDL7n/src/TimerOutput.jl:252 [inlined]
  [6] macro expansion
    @ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:421 [inlined]
  [7] macro expansion
    @ ~/.julia/packages/TimerOutputs/LDL7n/src/TimerOutput.jl:252 [inlined]
  [8] macro expansion
    @ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:418 [inlined]
  [9] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/utils.jl:64
 [10] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:354
 [11] #224
    @ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:347 [inlined]
 [12] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_splat!), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(1,)}, KernelAbstractions.NDIteration.StaticSize{(512,)}, Nothing, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:74
 [13] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:346
 [14] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/cache.jl:90
 [15] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(1,)}, KernelAbstractions.NDIteration.StaticSize{(512,)}, Nothing, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:299
 [16] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(1,)}, KernelAbstractions.NDIteration.StaticSize{(512,)}, Nothing, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}})
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:292
 [17] macro expansion
    @ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:102 [inlined]
 [18] (::KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.StaticSize{(512,)}, KernelAbstractions.NDIteration.StaticSize{(16,)}, typeof(gpu_splat!)})(::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::Vararg{Any}; ndrange::Nothing, dependencies::CUDAKernels.CudaEvent, workgroupsize::Nothing, progress::Function)
    @ CUDAKernels ~/.julia/packages/CUDAKernels/JJJ1U/src/CUDAKernels.jl:273
 [19] Kernel
    @ ~/.julia/packages/CUDAKernels/JJJ1U/src/CUDAKernels.jl:268 [inlined]
 [20] main()
    @ Main ~/code/a.jl:42
 [21] top-level scope
    @ ~/code/a.jl:45
in expression starting at /home/pxl-th/code/a.jl:45 | 
| Ah, I can replicate this error, but I am not sure if it is an Atomix or KernelAbstractions issue. It seems like the CPU version works fine, so maybe it's with UnsafeAtomicsLLVM? Would you be willing to open up a new issue either here or on Atomix (https://github.com/JuliaConcurrent/Atomix.jl) and ping @tkf? | 
| It's an LLVM issue but workaroundable at the level of (e.g.) CUDA.jl. See: JuliaConcurrent/Atomix.jl#33 | 
| @pxl-th, if you are still having trouble with Atomix, I created a separate PR with the atomic support from Core.Intrinsics and CUDA directly in #306. I also added the pkg commands to load in the subdirectory of CUDAKernels in a comment so you can just use it for now if you need. I've been struggling to get things to work as well, so I also added testing infrastructure for Atomix in #308. Hopefully we can iron out all the details there and get all this sorted. If you have run into any issues, please document them there! | 
After some discussions on #282, we decided to use Atomix for atomic support in KA.
A few quick questions:
@atomicmacro, we need to specify that we are using theAtomix.@atomicmacro in code that needs atomic operations. Should we overdub any@atomicmacros in KA to specifically use Atomix?atomic_add!(...), andatomic_sub!(...)from Atomic attempts #282? These come from eitherCUDAorCore.Intrinsics. Maybe it's a good idea to useAtomixon top of Atomic attempts #282? I don't know how many people will use the primitives over the macro, to be honest.Note, this should not be merged until JuliaRegistries/General#61002 is automerged.