-
Notifications
You must be signed in to change notification settings - Fork 13.3k
ggml WebGPU: add support for quantization types #15440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The NaN present in a |
@slaren thanks for the pointer to where things might be going wrong. I guess my main question is whether debugging the potential conversion failure should be done to get this PR merged or can be delayed. I'm sure I could do some debugging but floating point conversion is not my area of expertise 😅. The error is not very consistent either, it seems to occur on different specific tests; my guess is that some seeds for the random initialization cause it. To get this PR merged without debugging, what I would do is disable * Another issue is that as the WebGPU implementation becomes more optimized, we will start using subgroup/simdgroup operations as well, in which case the CI won't be able to run the WebGPU backend either. |
I think I found the issue, I'll open a PR soon that should fix it. While looking into it, I found that these statements cause buffer overflows: llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp Lines 1241 to 1251 in 6552e2e
This is likely because these strings are not null-terminated, and to convert them to device_ctx.device_desc = std::string(info.description.data, info.description.data + info.description.length);
device_ctx.device_desc = info.description; // alternatively, rely on the conversion to std::string_view You should be able to reproduce this by building with address sanitizer with the flag |
The reason the tests fail is because Some lessons learned:
|
@slaren thanks for the investigation! Sorry I didn't find the issues earlier, I think I was misled by the There was indeed a bug in the There's also another issue still, which is that once the WebGPU code starts to use more optimized matrix multiplication, it seems like the CI won't be able to run the more optimized code as setup now. But I'm glad it's catching bugs as the implementation is progressing! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To improve the CI we can add a WebGPU node to ggml-ci.
* Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Work on templating for different types in shaders * Work on shader type generation * Working q4_0 mul_mat and some templating for different types * Add q4_0_f16 matmul and fix device init * Add matmul support for basic quantization types * Add q2_k and q3_k quantization * Add rest of k-quants * Get firt i-quant working * Closer to supporting all i-quants * Support rest of i-quants * Cleanup code * Fix python formatting * debug * Bugfix for memset * Add padding to end of buffers on creation * Simplify bit-shifting * Update usage of StringView
This PR adds support for basic matrix multiplication using many of the quantization types supported by ggml. Due to alignment requirements for WebGPU buffers, I have not yet implemented support for
mxfp4
, and WebGPU does not supportbf16
natively. Otherwise, all types thattest-backend-ops
are supported.Q8_1
, are not tested bytest-backend-ops
, so I didn't enable it insupports_op
, and didn't implement other types not tested yet.fp16
are supported by the CPU backend, so I didn't enable those either.NaN
on both the CPU and WebGPU side. For example (https://github.com/reeselevine/llama.cpp/actions/runs/17083427966/job/48442585422#step:7:26905). I'm not sure what the best solution is to this, I noticed that the Metaltest-backend-ops
doesn't actually seem to run anyMUL_MAT
operation tests at all.device_init
toreg_get_device
, as I noticed this was necessary so that some objects are initialized by the time they are actually used.