See https://github.com/JuliaConcurrent/Atomix.jl/tree/main/lib/AtomixCUDA and https://github.com/JuliaGPU/CUDA.jl/pull/1790