Hi, thanks for the great work.
I am working on deploy model on distributed environment. Since dequantize_weights uses the default stream, Kernels would be overlap with NCCL all_reduce kernel, which leads to unexpected result.
|
dequantize_weights<<<num_blocks, threads_per_block>>>(kernel, scaling_factors, zeros, de_kernel, G, in_c, out_c); |

Is there any reason to use the default stream here?