ggml WebGPU: add support for quantization types #15440

reeselevine · 2025-08-20T04:02:36Z

This PR adds support for basic matrix multiplication using many of the quantization types supported by ggml. Due to alignment requirements for WebGPU buffers, I have not yet implemented support for mxfp4, and WebGPU does not support bf16 natively. Otherwise, all types that test-backend-ops are supported.

Since WGSL compilers like Dawn do not support any preprocessing natively, the Python script for generating shaders does some minor preprocessing. I'm hopeful this will extend relatively easily to support more operations/more complex generation, but I'm open to other ideas if anyone has any.
I noticed that some types, e.g., Q8_1, are not tested by test-backend-ops, so I didn't enable it in supports_op, and didn't implement other types not tested yet.
It doesn't look like combinations of quantization and fp16 are supported by the CPU backend, so I didn't enable those either.
In the CI, I sometimes see failures that seem unrelated to my WebGPU implementation, mostly on Metal backends, where the result of the matrix multiplication is NaN on both the CPU and WebGPU side. For example (https://github.com/reeselevine/llama.cpp/actions/runs/17083427966/job/48442585422#step:7:26905). I'm not sure what the best solution is to this, I noticed that the Metal test-backend-ops doesn't actually seem to run any MUL_MAT operation tests at all.
This PR also includes some other minor modifications, like adding some helper functions for tensor offset/alignment when binding buffers, and moving some initialization from device_init to reg_get_device, as I noticed this was necessary so that some objects are initialized by the time they are actually used.

Types

slaren · 2025-08-20T11:49:43Z

In the CI, I sometimes see failures that seem unrelated to my WebGPU implementation, mostly on Metal backends, where the result of the matrix multiplication is NaN on both the CPU and WebGPU side.

[NONE] NaN at index 136352 (WebGPU: WebGPU=nan CPU=nan) [MUL_MAT] NaN at index 64 (WebGPU: WebGPU=nan CPU=nan) MUL_MAT(type_a=f16,type_b=f32,m=129,n=1,k=1057,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=1): �[1;31mFAIL�[0m

The NaN present in a NONE node suggests that there was some issue initializing a tensor, this should never happen. If this consistently happens for f16, it might be an issue when converting f32 to f16 with this compiler. I would check that the ggml_compute_fp16_to_fp32 and ggml_compute_fp32_to_fp16 functions in ggml-impl.h work correctly with this compiler.

reeselevine · 2025-08-20T17:19:15Z

@slaren thanks for the pointer to where things might be going wrong.

I guess my main question is whether debugging the potential conversion failure should be done to get this PR merged or can be delayed. I'm sure I could do some debugging but floating point conversion is not my area of expertise 😅. The error is not very consistent either, it seems to occur on different specific tests; my guess is that some seeds for the random initialization cause it.

To get this PR merged without debugging, what I would do is disable test-backend-ops on the WebGPU/Metal backend CI for now and file a bug. Note that currently the existing Metal backend CIs don't even run test-backend-ops matrix multiplication due to a lack of simdgroup support in the CI runners (https://github.com/ggml-org/llama.cpp/actions/runs/17088302845/job/48456894670?pr=15440#step:6:26375).* The WebGPU implementation is also tested in the CI on a Linux/Vulkan backend, where the conversion seems more stable.

* Another issue is that as the WebGPU implementation becomes more optimized, we will start using subgroup/simdgroup operations as well, in which case the CI won't be able to run the WebGPU backend either.

slaren · 2025-08-20T20:02:18Z

I think I found the issue, I'll open a PR soon that should fix it.

While looking into it, I found that these statements cause buffer overflows:

llama.cpp/ggml/src/ggml-webgpu/ggml-webgpu.cpp

Lines 1241 to 1251 in 6552e2e

    
           device_ctx.device_desc = std::string(info.description.data); 
        
           GGML_LOG_INFO( 
        
               "ggml_webgpu: adapter_info: vendor_id: %u | vendor: %s | architecture: %s | device_id: %u | name: %s | " 
        
               "device_desc: %s\n", 
        
               info.vendorID, 
        
               info.vendor.data, 
        
               info.architecture.data, 
        
               info.deviceID, 
        
               info.device.data, 
        
               info.description.data);

This is likely because these strings are not null-terminated, and to convert them to std::string you should do something like this instead:

    device_ctx.device_desc = std::string(info.description.data, info.description.data + info.description.length);
    device_ctx.device_desc = info.description; // alternatively, rely on the conversion to std::string_view

You should be able to reproduce this by building with address sanitizer with the flag -DLLAMA_SANITIZE_ADDRESS=ON.

slaren · 2025-08-20T20:42:09Z

The reason the tests fail is because ggml_backend_webgpu_buffer_set_tensor does something wrong with the last 1-3 bytes of the tensor, and it results in corrupted tensors. There are also issues with the buffer allocation, when running the ops it pads the sizes of the tensors to the alignment size, but buffer size itself is not padded. This can result in a buffer overflow.

Some lessons learned:

You cannot ignore the tests just because they are inconvenient
test_backend_ops should not trust the tensor type of the backend being tested

debug

reeselevine · 2025-08-21T04:01:41Z

@slaren thanks for the investigation! Sorry I didn't find the issues earlier, I think I was misled by the NaN on the CPU result as well, and I'm still learning the codebase.

There was indeed a bug in the memset shader, I've verified that the updated code correctly sets the values of the last 1-3 bytes. I also updated the buffer allocation to align it to multiples of 4 bytes (WebGPU's binding size alignment), and fixed usages of wgpu::StringView to be converted to null-terminated strings when printed.

There's also another issue still, which is that once the WebGPU code starts to use more optimized matrix multiplication, it seems like the CI won't be able to run the more optimized code as setup now. But I'm glad it's catching bugs as the implementation is progressing!

ggerganov

To improve the CI we can add a WebGPU node to ggml-ci.

* Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Work on templating for different types in shaders * Work on shader type generation * Working q4_0 mul_mat and some templating for different types * Add q4_0_f16 matmul and fix device init * Add matmul support for basic quantization types * Add q2_k and q3_k quantization * Add rest of k-quants * Get firt i-quant working * Closer to supporting all i-quants * Support rest of i-quants * Cleanup code * Fix python formatting * debug * Bugfix for memset * Add padding to end of buffers on creation * Simplify bit-shifting * Update usage of StringView

reeselevine added 17 commits August 5, 2025 16:43

Begin work on set_rows

6a6135c

Work on set rows

b2dbfcd

Add error buffers for reporting unsupported SET_ROWS indices

248f7a5

Remove extra comments

4ad0986

Work on templating for different types in shaders

6355137

Work on shader type generation

831ea3c

Working q4_0 mul_mat and some templating for different types

688b51d

Add q4_0_f16 matmul and fix device init

1aa40f1

Add matmul support for basic quantization types

c3611f9

Add q2_k and q3_k quantization

de4da87

Add rest of k-quants

d76e562

Get firt i-quant working

e2380e2

Closer to supporting all i-quants

2a3b9ee

Support rest of i-quants

57c26b1

Merge remote-tracking branch 'origin/master' into types

7a2ae48

Cleanup code

51252f0

Merge pull request #2 from reeselevine/types

985508e

Types

github-actions bot added python python script changes ggml changes relating to the ggml tensor library for machine learning labels Aug 20, 2025

Fix python formatting

6552e2e

debug

65bebd3

reeselevine added 5 commits August 20, 2025 16:46

Bugfix for memset

16df269

Add padding to end of buffers on creation

10babfd

Simplify bit-shifting

7a323b0

Merge pull request #3 from reeselevine/fixes

d1b0ffe

debug

Update usage of StringView

d690303

ggerganov approved these changes Aug 22, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/master'

1fcc404

reeselevine merged commit 4536363 into ggml-org:master Aug 22, 2025
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml WebGPU: add support for quantization types #15440

ggml WebGPU: add support for quantization types #15440

Uh oh!

reeselevine commented Aug 20, 2025 •

edited

Loading

Uh oh!

slaren commented Aug 20, 2025 •

edited

Loading

Uh oh!

reeselevine commented Aug 20, 2025 •

edited

Loading

Uh oh!

slaren commented Aug 20, 2025

Uh oh!

slaren commented Aug 20, 2025

Uh oh!

reeselevine commented Aug 21, 2025 •

edited

Loading

Uh oh!

ggerganov left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggml WebGPU: add support for quantization types #15440

ggml WebGPU: add support for quantization types #15440

Uh oh!

Conversation

reeselevine commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reeselevine commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Aug 20, 2025

Uh oh!

slaren commented Aug 20, 2025

Uh oh!

reeselevine commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

reeselevine commented Aug 20, 2025 •

edited

Loading

slaren commented Aug 20, 2025 •

edited

Loading

reeselevine commented Aug 20, 2025 •

edited

Loading

reeselevine commented Aug 21, 2025 •

edited

Loading