Skip to content

Conversation

reeselevine
Copy link
Collaborator

@reeselevine reeselevine commented Aug 20, 2025

This PR adds support for basic matrix multiplication using many of the quantization types supported by ggml. Due to alignment requirements for WebGPU buffers, I have not yet implemented support for mxfp4, and WebGPU does not support bf16 natively. Otherwise, all types that test-backend-ops are supported.

  • Since WGSL compilers like Dawn do not support any preprocessing natively, the Python script for generating shaders does some minor preprocessing. I'm hopeful this will extend relatively easily to support more operations/more complex generation, but I'm open to other ideas if anyone has any.
  • I noticed that some types, e.g., Q8_1, are not tested by test-backend-ops, so I didn't enable it in supports_op, and didn't implement other types not tested yet.
  • It doesn't look like combinations of quantization and fp16 are supported by the CPU backend, so I didn't enable those either.
  • In the CI, I sometimes see failures that seem unrelated to my WebGPU implementation, mostly on Metal backends, where the result of the matrix multiplication is NaN on both the CPU and WebGPU side. For example (https://github.com/reeselevine/llama.cpp/actions/runs/17083427966/job/48442585422#step:7:26905). I'm not sure what the best solution is to this, I noticed that the Metal test-backend-ops doesn't actually seem to run any MUL_MAT operation tests at all.
  • This PR also includes some other minor modifications, like adding some helper functions for tensor offset/alignment when binding buffers, and moving some initialization from device_init to reg_get_device, as I noticed this was necessary so that some objects are initialized by the time they are actually used.

@github-actions github-actions bot added python python script changes ggml changes relating to the ggml tensor library for machine learning labels Aug 20, 2025
@slaren
Copy link
Member

slaren commented Aug 20, 2025

In the CI, I sometimes see failures that seem unrelated to my WebGPU implementation, mostly on Metal backends, where the result of the matrix multiplication is NaN on both the CPU and WebGPU side.

[NONE] NaN at index 136352 (WebGPU: WebGPU=nan CPU=nan) [MUL_MAT] NaN at index 64 (WebGPU: WebGPU=nan CPU=nan) MUL_MAT(type_a=f16,type_b=f32,m=129,n=1,k=1057,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=1): �[1;31mFAIL�[0m

The NaN present in a NONE node suggests that there was some issue initializing a tensor, this should never happen. If this consistently happens for f16, it might be an issue when converting f32 to f16 with this compiler. I would check that the ggml_compute_fp16_to_fp32 and ggml_compute_fp32_to_fp16 functions in ggml-impl.h work correctly with this compiler.

@reeselevine
Copy link
Collaborator Author

reeselevine commented Aug 20, 2025

@slaren thanks for the pointer to where things might be going wrong.

I guess my main question is whether debugging the potential conversion failure should be done to get this PR merged or can be delayed. I'm sure I could do some debugging but floating point conversion is not my area of expertise 😅. The error is not very consistent either, it seems to occur on different specific tests; my guess is that some seeds for the random initialization cause it.

To get this PR merged without debugging, what I would do is disable test-backend-ops on the WebGPU/Metal backend CI for now and file a bug. Note that currently the existing Metal backend CIs don't even run test-backend-ops matrix multiplication due to a lack of simdgroup support in the CI runners (https://github.com/ggml-org/llama.cpp/actions/runs/17088302845/job/48456894670?pr=15440#step:6:26375).* The WebGPU implementation is also tested in the CI on a Linux/Vulkan backend, where the conversion seems more stable.

* Another issue is that as the WebGPU implementation becomes more optimized, we will start using subgroup/simdgroup operations as well, in which case the CI won't be able to run the WebGPU backend either.

@slaren
Copy link
Member

slaren commented Aug 20, 2025

I think I found the issue, I'll open a PR soon that should fix it.

While looking into it, I found that these statements cause buffer overflows:

device_ctx.device_desc = std::string(info.description.data);
GGML_LOG_INFO(
"ggml_webgpu: adapter_info: vendor_id: %u | vendor: %s | architecture: %s | device_id: %u | name: %s | "
"device_desc: %s\n",
info.vendorID,
info.vendor.data,
info.architecture.data,
info.deviceID,
info.device.data,
info.description.data);

This is likely because these strings are not null-terminated, and to convert them to std::string you should do something like this instead:

    device_ctx.device_desc = std::string(info.description.data, info.description.data + info.description.length);
    device_ctx.device_desc = info.description; // alternatively, rely on the conversion to std::string_view

You should be able to reproduce this by building with address sanitizer with the flag -DLLAMA_SANITIZE_ADDRESS=ON.

@slaren
Copy link
Member

slaren commented Aug 20, 2025

The reason the tests fail is because ggml_backend_webgpu_buffer_set_tensor does something wrong with the last 1-3 bytes of the tensor, and it results in corrupted tensors. There are also issues with the buffer allocation, when running the ops it pads the sizes of the tensors to the alignment size, but buffer size itself is not padded. This can result in a buffer overflow.

Some lessons learned:

  • You cannot ignore the tests just because they are inconvenient
  • test_backend_ops should not trust the tensor type of the backend being tested

@reeselevine
Copy link
Collaborator Author

reeselevine commented Aug 21, 2025

@slaren thanks for the investigation! Sorry I didn't find the issues earlier, I think I was misled by the NaN on the CPU result as well, and I'm still learning the codebase.

There was indeed a bug in the memset shader, I've verified that the updated code correctly sets the values of the last 1-3 bytes. I also updated the buffer allocation to align it to multiples of 4 bytes (WebGPU's binding size alignment), and fixed usages of wgpu::StringView to be converted to null-terminated strings when printed.

There's also another issue still, which is that once the WebGPU code starts to use more optimized matrix multiplication, it seems like the CI won't be able to run the more optimized code as setup now. But I'm glad it's catching bugs as the implementation is progressing!

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve the CI we can add a WebGPU node to ggml-ci.

@reeselevine reeselevine merged commit 4536363 into ggml-org:master Aug 22, 2025
50 checks passed
qnixsynapse pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 25, 2025
* Begin work on set_rows

* Work on set rows

* Add error buffers for reporting unsupported SET_ROWS indices

* Remove extra comments

* Work on templating for different types in shaders

* Work on shader type generation

* Working q4_0 mul_mat and some templating for different types

* Add q4_0_f16 matmul and fix device init

* Add matmul support for basic quantization types

* Add q2_k and q3_k quantization

* Add rest of k-quants

* Get firt i-quant working

* Closer to supporting all i-quants

* Support rest of i-quants

* Cleanup code

* Fix python formatting

* debug

* Bugfix for memset

* Add padding to end of buffers on creation

* Simplify bit-shifting

* Update usage of StringView
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants