The following works:
function broken_kernel()
c_frags = LocalArray{Tuple{16}, CUDA.WMMA.Fragment{16, 16, 16, 8, Float16, CUDA.WMMA.Unspecified, CUDA.WMMA.Accumulator}}(undef)
frag = CUDA.WMMA.Fragment{16, 16, 16, 8, Float16, CUDA.WMMA.Unspecified, CUDA.WMMA.Accumulator}(ntuple(_->Float16(0), 8))
setindex(c_frags, frag, 1)
return
end
CUDA.code_llvm(GemmKernels.Kernel.broken_kernel, Tuple{})
But bumping the eltype to Tuple{64} results in an apply_iterate.