Why not set stream in `gemmv2_forward_cuda` and `dequantize_weights_cuda`

Hi, thanks for the great work.

I am working on deploy model on distributed environment. Since `dequantize_weights` uses the default stream, Kernels would be overlap with NCCL `all_reduce` kernel, which leads to unexpected result.

https://github.com/casper-hansen/AutoAWQ_kernels/blob/83d1f4b326a9067d0f94f089ef1bb47cf5377134/awq_ext/quantization/gemm_cuda_gen.cu#L1162

![W1WQQyhNsv](https://github.com/casper-hansen/AutoAWQ_kernels/assets/1239736/8238c6fa-7f65-4479-970b-a5f772542048)

Is there any reason to use the default stream here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why not set stream in `gemmv2_forward_cuda` and `dequantize_weights_cuda` #23

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Why not set stream in gemmv2_forward_cuda and dequantize_weights_cuda #23

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Why not set stream in `gemmv2_forward_cuda` and `dequantize_weights_cuda` #23