Currently `NNlibCUDA.gather!` would not check bounds ```julia using NNlib, CUDA using NNlibCUDA src = CUDA.rand(2,3) NNlib.gather(src,cu[1,4]) 2×2 CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}: 0.430532 0.0 0.474528 0.0 ```