-
Notifications
You must be signed in to change notification settings - Fork 87
Closed
Description
When the dimension of the array is larger than 16, there is dynamic allocation. Many CUDA array functions will break.
julia> x = (fill(2, 15)...,)
(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)
julia> @benchmark CartesianIndices($x)[3]
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 84.947 ns (0.00% GC)
median time: 85.537 ns (0.00% GC)
mean time: 86.939 ns (0.00% GC)
maximum time: 183.274 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 963
julia> x = (fill(2, 16)...,)
(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2)
julia> @benchmark CartesianIndices($x)[3]
BenchmarkTools.Trial:
memory estimate: 4.39 KiB
allocs estimate: 73
--------------
minimum time: 3.644 μs (0.00% GC)
median time: 3.713 μs (0.00% GC)
mean time: 4.002 μs (0.98% GC)
maximum time: 68.644 μs (92.66% GC)
--------------
samples: 10000
evals/sample: 8
I think it might be related to tuple splatting. Current CartesianIndices getindex is
function getindex(A::AbstractArray, I...)
@_propagate_inbounds_meta
error_if_canonical_getindex(IndexStyle(A), A, I...)
_getindex(IndexStyle(A), A, to_indices(A, I)...)
end
MWE to trigger the error
julia> cz = CUDA.zeros(fill(2, 20)...);
julia> cz / Float32(0.3)
ERROR: InvalidIRError: compiling kernel broadcast_kernel(CUDA.CuKernelContext, CuDeviceArray{Float32,20,1}, Base.Broadcast.Broadcasted{Nothing,NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,20,1},NTuple{20,Bool},NTuple{20,Int64}},Float32}}, Int64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to getindex)
Stacktrace:
[1] macro expansion at /home/leo/.julia/dev/GPUArrays/src/device/indexing.jl:81
[2] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:61
Reason: unsupported dynamic function invocation (call to getindex)
Stacktrace:
[1] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:62
Reason: unsupported dynamic function invocation (call to setindex!)
Stacktrace:
[1] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:62
Reason: unsupported call through a literal pointer (call to jl_alloc_array_1d)
Stacktrace:
[1] Array at boot.jl:406
[2] map at tuple.jl:168
[3] axes at abstractarray.jl:75
[4] CartesianIndices at multidimensional.jl:264
[5] macro expansion at /home/leo/.julia/dev/GPUArrays/src/device/indexing.jl:81
[6] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:61
Reason: unsupported call to the Julia runtime (call to jl_f__apply_iterate)
Stacktrace:
[1] map at tuple.jl:172
[2] axes at abstractarray.jl:75
[3] CartesianIndices at multidimensional.jl:264
[4] macro expansion at /home/leo/.julia/dev/GPUArrays/src/device/indexing.jl:81
[5] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:61
Reason: unsupported dynamic function invocation (call to CartesianIndices)
Stacktrace:
[1] CartesianIndices at multidimensional.jl:264
[2] macro expansion at /home/leo/.julia/dev/GPUArrays/src/device/indexing.jl:81
[3] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:61
Stacktrace:
[1] check_ir(::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget,CUDA.CUDACompilerParams}, ::LLVM.Module) at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/validation.jl:123
[2] macro expansion at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:239 [inlined]
[3] macro expansion at /home/leo/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
[4] codegen(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:237
[5] compile(::Symbol, ::GPUCompiler.CompilerJob; libraries::Bool, deferred_codegen::Bool, optimize::Bool, strip::Bool, validate::Bool, only_entry::Bool) at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:39
[6] compile at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/driver.jl:35 [inlined]
[7] cufunction_compile(::GPUCompiler.FunctionSpec; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/leo/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:310
[8] cufunction_compile(::GPUCompiler.FunctionSpec) at /home/leo/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:305
[9] check_cache(::Dict{UInt64,Any}, ::Any, ::Any, ::GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#12",Tuple{CUDA.CuKernelContext,CuDeviceArray{Float32,20,1},Base.Broadcast.Broadcasted{Nothing,NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,20,1},NTuple{20,Bool},NTuple{20,Int64}},Float32}},Int64}}, ::UInt64; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:40
[10] broadcast_kernel at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:60 [inlined]
[11] cached_compilation at /home/leo/.julia/packages/GPUCompiler/uTpNx/src/cache.jl:65 [inlined]
[12] cufunction(::GPUArrays.var"#broadcast_kernel#12", ::Type{Tuple{CUDA.CuKernelContext,CuDeviceArray{Float32,20,1},Base.Broadcast.Broadcasted{Nothing,NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{Base.Broadcast.Extruded{CuDeviceArray{Float32,20,1},NTuple{20,Bool},NTuple{20,Int64}},Float32}},Int64}}; name::Nothing, kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/leo/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:297
[13] cufunction at /home/leo/.julia/packages/CUDA/YeS8q/src/compiler/execution.jl:294 [inlined]
[14] launch_heuristic(::CUDA.CuArrayBackend, ::GPUArrays.var"#broadcast_kernel#12", ::CuArray{Float32,20}, ::Base.Broadcast.Broadcasted{Nothing,NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{Base.Broadcast.Extruded{CuArray{Float32,20},NTuple{20,Bool},NTuple{20,Int64}},Float32}}, ::Int64; maximize_blocksize::Bool) at /home/leo/.julia/packages/CUDA/YeS8q/src/gpuarrays.jl:19
[15] launch_heuristic at /home/leo/.julia/packages/CUDA/YeS8q/src/gpuarrays.jl:17 [inlined]
[16] copyto!(::CuArray{Float32,20}, ::Base.Broadcast.Broadcasted{Nothing,NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{CuArray{Float32,20},Float32}}) at /home/leo/.julia/dev/GPUArrays/src/host/broadcast.jl:66
[17] copyto! at ./broadcast.jl:886 [inlined]
[18] copy(::Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{20},NTuple{20,Base.OneTo{Int64}},typeof(/),Tuple{CuArray{Float32,20},Float32}}) at ./broadcast.jl:862
[19] materialize at ./broadcast.jl:837 [inlined]
[20] broadcast_preserving_zero_d at ./broadcast.jl:826 [inlined]
[21] /(::CuArray{Float32,20}, ::Float32) at ./arraymath.jl:55
[22] top-level scope at REPL[16]:1
Related issues JuliaLang/julia#27398 , but it seems was fixed since 2018...
@Roger-luo do you know what is happening in this case?
Originally posted by @GiggleLiu in #334 (comment)