Skip to content

Conversation

@leios
Copy link
Contributor

@leios leios commented May 25, 2022

After some discussions on #282, we decided to use Atomix for atomic support in KA.

A few quick questions:

  1. Because Base (and CUDA) both have an @atomic macro, we need to specify that we are using the Atomix.@atomic macro in code that needs atomic operations. Should we overdub any @atomic macros in KA to specifically use Atomix?
  2. Should we add in the tests from Atomic attempts #282?
  3. What about atomic primitives like atomic_add!(...), and atomic_sub!(...) from Atomic attempts #282? These come from either CUDA or Core.Intrinsics. Maybe it's a good idea to use Atomix on top of Atomic attempts #282? I don't know how many people will use the primitives over the macro, to be honest.

Note, this should not be merged until JuliaRegistries/General#61002 is automerged.

Copy link
Contributor

@tkf tkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KernelAbstractions.jl doesn't have to depend on UnsafeAtomicsLLVM.jl (and LLVM.jl)

Co-authored-by: Takafumi Arakaki <[email protected]>
Copy link
Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Probably needs docs as well as AMDGPU support.

Co-authored-by: Valentin Churavy <[email protected]>
@pxl-th
Copy link
Member

pxl-th commented May 31, 2022

Hi!

As for 3. What about atomic primitives like atomic_add!(...), I'd like to say that I have several kernels that use atomic_add! specifically because it returns the old value after adding. I'm not sure if this is achievable with macros.

Also I'm curious if it will support things like:

@atomic max(x[i], v)

@leios leios changed the title atting atomic support with atomix adding atomic support with atomix May 31, 2022
@vchuravy vchuravy marked this pull request as ready for review May 31, 2022 18:25
Co-authored-by: Takafumi Arakaki <[email protected]>
@leios
Copy link
Contributor Author

leios commented May 31, 2022

I don't mind reworking this PR and #282 so we get both the macro and better ordering support from Atomix and also the atomic_... functions from either Core.Intrinsics or CUDA. I have a branch locally that basically does this and it works fine for my purposes.

I figure most people will want to use the macro, but some people will prefer the atomic_... functions, so why not just do both?

@vchuravy
Copy link
Member

Let's merge this for now and then you can open a second PR?

@vchuravy vchuravy merged commit 6374613 into JuliaGPU:master May 31, 2022
@leios
Copy link
Contributor Author

leios commented May 31, 2022

This one is not ready to be merged

@vchuravy
Copy link
Member

Oops. I got excited that it passed tests :)

@leios
Copy link
Contributor Author

leios commented May 31, 2022

It was missing docs and tests, at least... I will add them when I get the chance. To be fair, atomix should have all the necessary tests, I just wanted to double check here. Documentation does not need to be long, but having a section for atomics with an example would go a long way.

@leios
Copy link
Contributor Author

leios commented May 31, 2022

I was just waiting to add docs until we settled the atomic "primitive" discussion.

@claforte
Copy link

claforte commented May 31, 2022

Thanks a lot @leios ! @pxl-th and a few others in my team are very much looking forward to this PR being merged for our Instant NeRF (3D reconstruction) Julia implementation. If you'd like a sneak preview, let me know, I can invite you to our private Discord and Github. :-)

@pxl-th
Copy link
Member

pxl-th commented Jun 4, 2022

I've tried this PR and it looks like on CPU it only supports integer types.
While on GPU I get unsupported dynamic function invocation (call to modify!) for any type.
I'm on Julia 1.8.0-rc1, but the same errors are present on 1.7.2.

Error
ERROR: LoadError: InvalidIRError: compiling kernel #gpu_splat!(KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to modify!)
Stacktrace:
 [1] modify!
   @ ~/.julia/packages/Atomix/F9VIX/src/core.jl:33
 [2] macro expansion
   @ ~/code/a.jl:28
 [3] gpu_splat!
   @ ~/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80
 [4] gpu_splat!
   @ ./none:0
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code
Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_splat!), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/validation.jl:139
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:409 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/LDL7n/src/TimerOutput.jl:252 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:407 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/utils.jl:64
  [6] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:354
  [7] #224
    @ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:347 [inlined]
  [8] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_splat!), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:74
  [9] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:346
 [10] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/cache.jl:90
 [11] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:299
 [12] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.StaticSize{(512,)}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}})
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:293
 [13] macro expansion
    @ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:102 [inlined]
 [14] (::KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.StaticSize{(512,)}, KernelAbstractions.NDIteration.DynamicSize, typeof(gpu_splat!)})(::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::Vararg{Any}; ndrange::Int64, dependencies::CUDAKernels.CudaEvent, workgroupsize::Nothing, progress::Function)
    @ CUDAKernels ~/.julia/packages/CUDAKernels/4VLF4/src/CUDAKernels.jl:272
 [15] main()
    @ Main ~/code/a.jl:40
 [16] top-level scope
    @ ~/code/a.jl:42
in expression starting at /home/pxl-th/code/a.jl:42

MWE:

Code
using CUDA
using CUDAKernels
using KernelAbstractions
using KernelAbstractions: @atomic

CUDA.allowscalar(false)

n_threads(::CPU) = Threads.nthreads()
n_threads(::CUDADevice) = 512

Base.rand(::CPU, T, shape) = rand(T, shape)
Base.rand(::CUDADevice, T, shape) = CUDA.rand(T, shape)

to_device(::CPU, x) = copy(x)
to_device(::CUDADevice, x) = CuArray(x)

@kernel function splat!(grid, @Const(indices), @Const(mlp_out))
    i = @index(Global)
    idx = indices[i]
    @atomic max(grid[idx], mlp_out[i])
end

function main()
    #device = CPU()
    device = CUDADevice()

    n = 16
    indices = to_device(device, UInt32.(collect(1:n)))
    mlp_out = rand(device, Int64, n) # errors on CPU with Float32
    grid = rand(device, Int64, n) # errors on CPU with Float32

    wait(splat!(device, n_threads(device), n)(grid, indices, mlp_out))
end
main()

@leios
Copy link
Contributor Author

leios commented Jun 4, 2022

I was unable to replicate this error by running the provided code with 1.7.1 and 1.8.0-beta3 (just pulled from git). What OS are you using? Also, could you show the outputs of ] st?

@pxl-th
Copy link
Member

pxl-th commented Jun 4, 2022

I'm on Ubuntu 22.04
CPU: AMD Ryzen 7 5800HS
GPU: NVIDIA GeForce RTX 3060

]st:

(@v1.8) pkg> st
Status `~/.julia/environments/v1.8/Project.toml`
  [a9b6321e] Atomix v0.1.0
  [052768ef] CUDA v3.10.1
  [72cfdca4] CUDAKernels v0.4.1
  [5789e2e9] FileIO v1.14.0
  [a09fc81d] ImageCore v0.9.3
  [82e4d734] ImageIO v0.6.5
  [02fcd773] ImageTransformations v0.9.4
  [b835a17e] JpegTurbo v0.1.1
  [63c18a36] KernelAbstractions v0.8.1 `https://github.com/JuliaGPU/KernelAbstractions.jl.git#master`

@pxl-th
Copy link
Member

pxl-th commented Jun 4, 2022

I've just updated MWE code, before I included code that does not error :)
You can also change grid & mlp_out eltypes to Float32 and to see that it does not work with them.

@leios
Copy link
Contributor Author

leios commented Jun 4, 2022

Right, I see the comments now, sorry!

try ]add CUDAKernels#master?

@pxl-th
Copy link
Member

pxl-th commented Jun 4, 2022

Yes, that works, thanks!

Although there is another issue, which is not critical for me, but might be worth mentioning:

MWE:

Code
using CUDA
using CUDAKernels
using KernelAbstractions
using KernelAbstractions: @atomic

CUDA.allowscalar(false)

const NERF_STEPS = UInt32(1024)
const MIN_CONE_STEPSIZE = 3f0 / NERF_STEPS

n_threads(::CPU) = Threads.nthreads()
n_threads(::CUDADevice) = 512

Base.rand(::CPU, T, shape) = rand(T, shape)
Base.rand(::CUDADevice, T, shape) = CUDA.rand(T, shape)

Base.zeros(::CPU, T, shape) = zeros(T, shape)
Base.zeros(::CUDADevice, T, shape) = CUDA.zeros(T, shape)

to_device(::CPU, x) = copy(x)
to_device(::CUDADevice, x) = CuArray(x)

@inline density_activation(x) = exp(x)

@kernel function splat!(grid, @Const(indices), @Const(mlp_out))
    i = @index(Global)
    idx = indices[i]
    old, new = @atomic max(grid[idx], mlp_out[i])
    @atomic grid[idx] = old
end

function main()
    # device = CPU()
    device = CUDADevice()

    n = 16
    indices = to_device(device, UInt32.(collect(1:n)))
    mlp_out = rand(device, Int64, n)
    grid = zeros(device, Int64, n)

    wait(splat!(device, n_threads(device), n)(grid, indices, mlp_out))
end
main()
Error
ERROR: LoadError: LLVM error: Cannot select: 0x77adc60: ch = AtomicStore<(store seq_cst (s64) into %ir.41, addrspace 1)> 0x4926e08:1, 0x6f629c8, 0x4926e08, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/base.jl:40 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:245 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:201 @[ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/i4GMj/src/internal.jl:11 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:14 @[ /home/pxl-th/code/a.jl:29 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
  0x6f629c8: i64 = add 0x6f626f0, Constant:i64<-8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
    0x6f626f0: i64 = add 0x6f62ea8, 0x7195f28, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
      0x6f62ea8: i64,ch = CopyFromReg 0x65cc278, Register:i64 %0, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
        0x4926cd0: i64 = Register %0
      0x7195f28: i64 = shl nuw nsw 0x77ae5b8, Constant:i32<3>, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
        0x77ae5b8: i64 = AssertZext 0x71ce208, ValueType:ch:i32, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
          0x71ce208: i64,ch = CopyFromReg 0x65cc278, Register:i64 %9, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
            0x66aca88: i64 = Register %9
        0x66ac0c8: i32 = Constant<3>
    0x4926720: i64 = Constant<-8>
  0x4926e08: i64,ch = AtomicLoadMax<(load store seq_cst (s64) on %ir.39, addrspace 1)> 0x7195c50:1, 0x6f629c8, 0x7195c50, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/base.jl:40 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:270 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:270 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/atomics.jl:374 @[ /home/pxl-th/.julia/packages/UnsafeAtomicsLLVM/i4GMj/src/internal.jl:18 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:33 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ] ]
    0x6f629c8: i64 = add 0x6f626f0, Constant:i64<-8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
      0x6f626f0: i64 = add 0x6f62ea8, 0x7195f28, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
        0x6f62ea8: i64,ch = CopyFromReg 0x65cc278, Register:i64 %0, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ]
          0x4926cd0: i64 = Register %0
        0x7195f28: i64 = shl nuw nsw 0x77ae5b8, Constant:i32<3>, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
          0x77ae5b8: i64 = AssertZext 0x71ce208, ValueType:ch:i32, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
            0x71ce208: i64,ch = CopyFromReg 0x65cc278, Register:i64 %9, int.jl:88 @[ abstractarray.jl:1189 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:80 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/references.jl:99 @[ /home/pxl-th/.julia/packages/Atomix/F9VIX/src/core.jl:30 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
              0x66aca88: i64 = Register %9
          0x66ac0c8: i32 = Constant<3>
      0x4926720: i64 = Constant<-8>
    0x7195c50: i64,ch = llvm.nvvm.ldg.global.i<(load (s64) from %ir.34, addrspace 1)> 0x65cc278, TargetConstant:i64<5104>, 0x6f62b68, Constant:i32<8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/base.jl:40 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:120 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:120 @[ /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:219 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:40 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ] ] ] ]
      0x49271b0: i64 = TargetConstant<5104>
      0x6f62b68: i64 = add 0x4926b98, Constant:i64<-8>, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
        0x4926b98: i64 = add 0x6f62278, 0x4926ed8, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
          0x6f62278: i64,ch = CopyFromReg 0x65cc278, Register:i64 %4, /home/pxl-th/.julia/packages/LLVM/YSJ2s/src/interop/pointer.jl:110 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
            0x7803a58: i64 = Register %4
          0x4926ed8: i64 = shl nuw nsw 0x71ce2d8, Constant:i32<3>, int.jl:88 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
            0x71ce2d8: i64,ch = CopyFromReg 0x65cc278, Register:i64 %8, int.jl:88 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:39 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/pointer.jl:48 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:184 @[ /home/pxl-th/.julia/packages/CUDA/GGwVa/src/device/array.jl:232 @[ /home/pxl-th/code/a.jl:28 @[ /home/pxl-th/.julia/packages/KernelAbstractions/I3MhZ/src/macros.jl:80 @[ none:0 ] ] ] ] ] ] ]
              0x77ae620: i64 = Register %8
            0x66ac0c8: i32 = Constant<3>
        0x4926720: i64 = Constant<-8>
      0x7196880: i32 = Constant<8>
In function: _Z21julia_gpu_splat__430516CompilerMetadataI10StaticSizeI5_16__E12DynamicCheckvv7NDRangeILi1ES0_I4_1__ES0_I6_512__EvvEE13CuDeviceArrayI5Int64Li1ELi1EES3_I6UInt32Li1ELi1EES3_IS4_Li1ELi1EE
Stacktrace:
  [1] handle_error(reason::Cstring)
    @ LLVM ~/.julia/packages/LLVM/YSJ2s/src/core/context.jl:105
  [2] LLVMTargetMachineEmitToMemoryBuffer
    @ ~/.julia/packages/LLVM/YSJ2s/lib/13/libLLVM_h.jl:947 [inlined]
  [3] emit(tm::LLVM.TargetMachine, mod::LLVM.Module, filetype::LLVM.API.LLVMCodeGenFileType)
    @ LLVM ~/.julia/packages/LLVM/YSJ2s/src/targetmachine.jl:45
  [4] mcgen(job::GPUCompiler.CompilerJob, mod::LLVM.Module, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/mcgen.jl:74
  [5] macro expansion
    @ ~/.julia/packages/TimerOutputs/LDL7n/src/TimerOutput.jl:252 [inlined]
  [6] macro expansion
    @ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:421 [inlined]
  [7] macro expansion
    @ ~/.julia/packages/TimerOutputs/LDL7n/src/TimerOutput.jl:252 [inlined]
  [8] macro expansion
    @ ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:418 [inlined]
  [9] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/utils.jl:64
 [10] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:354
 [11] #224
    @ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:347 [inlined]
 [12] JuliaContext(f::CUDA.var"#224#225"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(gpu_splat!), Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(1,)}, KernelAbstractions.NDIteration.StaticSize{(512,)}, Nothing, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}}})
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/driver.jl:74
 [13] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:346
 [14] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/XyxTy/src/cache.jl:90
 [15] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(1,)}, KernelAbstractions.NDIteration.StaticSize{(512,)}, Nothing, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}}; name::Nothing, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:299
 [16] cufunction(f::typeof(gpu_splat!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(16,)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.StaticSize{(1,)}, KernelAbstractions.NDIteration.StaticSize{(512,)}, Nothing, Nothing}}, CuDeviceVector{Int64, 1}, CuDeviceVector{UInt32, 1}, CuDeviceVector{Int64, 1}}})
    @ CUDA ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:292
 [17] macro expansion
    @ ~/.julia/packages/CUDA/GGwVa/src/compiler/execution.jl:102 [inlined]
 [18] (::KernelAbstractions.Kernel{CUDADevice, KernelAbstractions.NDIteration.StaticSize{(512,)}, KernelAbstractions.NDIteration.StaticSize{(16,)}, typeof(gpu_splat!)})(::CuArray{Int64, 1, CUDA.Mem.DeviceBuffer}, ::Vararg{Any}; ndrange::Nothing, dependencies::CUDAKernels.CudaEvent, workgroupsize::Nothing, progress::Function)
    @ CUDAKernels ~/.julia/packages/CUDAKernels/JJJ1U/src/CUDAKernels.jl:273
 [19] Kernel
    @ ~/.julia/packages/CUDAKernels/JJJ1U/src/CUDAKernels.jl:268 [inlined]
 [20] main()
    @ Main ~/code/a.jl:42
 [21] top-level scope
    @ ~/code/a.jl:45
in expression starting at /home/pxl-th/code/a.jl:45

@leios
Copy link
Contributor Author

leios commented Jun 4, 2022

Ah, I can replicate this error, but I am not sure if it is an Atomix or KernelAbstractions issue. It seems like the CPU version works fine, so maybe it's with UnsafeAtomicsLLVM?

Would you be willing to open up a new issue either here or on Atomix (https://github.com/JuliaConcurrent/Atomix.jl) and ping @tkf?

@tkf
Copy link
Contributor

tkf commented Jun 5, 2022

It's an LLVM issue but workaroundable at the level of (e.g.) CUDA.jl. See: JuliaConcurrent/Atomix.jl#33

This was referenced Jun 8, 2022
@leios
Copy link
Contributor Author

leios commented Jun 14, 2022

@pxl-th, if you are still having trouble with Atomix, I created a separate PR with the atomic support from Core.Intrinsics and CUDA directly in #306. I also added the pkg commands to load in the subdirectory of CUDAKernels in a comment so you can just use it for now if you need.

I've been struggling to get things to work as well, so I also added testing infrastructure for Atomix in #308. Hopefully we can iron out all the details there and get all this sorted. If you have run into any issues, please document them there!

@leios leios mentioned this pull request Jun 15, 2022
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants