Does KernelAbstractions.jl support auto-setting workgroupsize when the kernel has local memory size that depends on groupsize? For example, CUDA.launch_configuration takes a shmem callback that maps a number of threads to shared memory used. This is used for implementing mapreduce in CUDA.jl. Since shmem argument for CUDA.launch_configuration is not used in Kernel{CUDADevice}, I guess it's not implemented yet? Is it related to #19?