- 
                Notifications
    
You must be signed in to change notification settings  - Fork 57
 
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Describe the bug
Any use of shfl_sync throws an error saying shfl_recurse is a dynamic function.
To reproduce
The Minimal Working Example (MWE) for this bug:
Attempting to do a stream compaction:
using CUDA
# define a new arrays of 64 elements, and fill it with random ones and zeros
a = rand(0:1, 64)
a_gpu = CuArray(a)
b_gpu = CUDA.zeros(Int64, 64)
count = CUDA.zeros(Int64, 1)
function mykernel!(in, out, count)
	threadNum = threadIdx().x + blockDim().x * (blockIdx().x-1) # 1-indexed
	warpNum = (threadIdx().x - 1) ÷ 32 # 0-indexed
	laneNum = (threadIdx().x - 1) % 32 # 0-indexed
    shared_count = CuDynamicSharedArray(Int64, 1)
    
    if threadNum == 1
        shared_count[1] = 0
    end
    sync_threads()
    if threadNum <= 64
        is_nonzero = in[threadNum] != 0
        mask = CUDA.vote_ballot_sync(0xffffffff, is_nonzero)
        warp_count = count_ones(mask)
        warp_offset = 0
        if laneNum == 0
            warp_offset = CUDA.atomic_add!(pointer(shared_count, 1), warp_count)
        end
        warp_offset = CUDA.shfl_sync(0xffffffff, warp_offset, Int32(0)) #<<<<< This is the BUG code.
        if is_nonzero
            index = count_ones(mask & ((1u << laneNum) - 1)) + warp_offset
            out[index+1] = threadNum
        end
    end
    sync_threads()
    if threadIdx().x == 1
        CUDA.atomic_add!(CUDA.pointer(count), shared_count[1])
    end
	return
end
@cuda threads = 64 blocks = 1 shmem=sizeof(Int64) mykernel!(a_gpu, b_gpu, count)
println("nonzeros:$(collect(count))")
println(collect(b_gpu))Manifest.toml
Package versions:
Status `~/.julia/environments/v1.11/Project.toml`
  [052768ef] CUDA v5.5.2
No Matches in `~/.julia/environments/v1.11/Project.toml`
No Matches in `~/.julia/environments/v1.11/Project.toml`
No Matches in `~/.julia/environments/v1.11/Project.toml`
CUDA details:
CUDA runtime version: 12.6.0
CUDA driver version: 12.6.0
CUDA capability: 9.0.0
Expected behavior
Expected behavior is that the shuffle function doesn't throw an error, and all zeros in a get removed when moved to b
Version info
Details on Julia:
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 64 × Intel(R) Xeon(R) Platinum 8462Y+
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, sapphirerapids)
Threads: 1 default, 0 interactive, 1 GC (on 64 virtual cores)
Details on CUDA:
CUDA driver 12.6
NVIDIA driver 550.90.7
CUDA libraries: 
- CUBLAS: 12.6.3
- CURAND: 10.3.7
- CUFFT: 11.3.0
- CUSOLVER: 11.7.1
- CUSPARSE: 12.5.4
- CUPTI: 2024.3.2 (API 24.0.0)
- NVML: 12.0.0+550.90.7
Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0
Toolchain:
- Julia: 1.11.1
- LLVM: 16.0.6
1 device:
  0: NVIDIA H100 80GB HBM3 (sm_90, 77.409 GiB / 79.647 GiB available)Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request