Skip to content

Commit 3ae202f

Browse files
authored
GGUF fix for unquantized types when using unquantize kernels
Even if the `qweight_type` is one of the `UNQUANTIZED_TYPES`, qweight still has to be "dequantized" because it is stored as an 8-bit tensor. Without doing so, it is therefore a shape mismatch in the following matmul. Side notes: - why isn't DIFFUSERS_GGUF_CUDA_KERNELS on by default? It's significantly faster and only used when installed - https://huggingface.co/Isotr0py/ggml/tree/main/build has no build for torch 2.8 (or the upcoming 2.9). Who can we contact to make such a build?
1 parent dbe4136 commit 3ae202f

File tree

1 file changed

+2
-1
lines changed

1 file changed

+2
-1
lines changed

src/diffusers/quantizers/gguf/utils.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,8 @@
7979
def _fused_mul_mat_gguf(x: torch.Tensor, qweight: torch.Tensor, qweight_type: int) -> torch.Tensor:
8080
# there is no need to call any kernel for fp16/bf16
8181
if qweight_type in UNQUANTIZED_TYPES:
82-
return x @ qweight.T
82+
weight = dequantize_gguf_tensor(qweight)
83+
return x @ weight.T
8384

8485
# TODO(Isotr0py): GGUF's MMQ and MMVQ implementation are designed for
8586
# contiguous batching and inefficient with diffusers' batching,

0 commit comments

Comments
 (0)