GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators #16018

reeselevine · 2025-09-15T21:34:52Z

Continuing to add support for more operations. My current plan is to add support for enough operations to run some popular models (open for feedback on which models to focus on) and then go back and work on writing more efficient code the more important operations and dequantization.

In this PR:

work on the shader generation process to handle different operations which have similar structures, e.g., binary-head.tmpl for ADD and MUL. Note that I'm kinda rolling my own preprocessing through embed-wgsl.py, since WGSL doesn't have any preprocessor support.
Support for most quantization types for GET_ROWS. A lot of the dequantization code is duplicated in this PR, although I have refactored the i-quant grids so that they are all in one place. I am planning on refactoring the dequantization code when more efficient dequantization, i.e., not a whole block per thread, is implemented, since the structure might change quite a bit.
As mentioned by @ngxson in ggml : add WebGPU backend #7773, some operations require in-place versions, which is slightly more complex in WebGPU since bindings can't alias. Right now, I do this by having two separate template files, e.g., mul.tmpl.wgsl and mul_in_place.tmpl.wgsl. While the code is basically the same, I want to note that test-backend-ops does not currently have support for testing the in-place versions. Happy to brainstorm what it would take to add support for testing them.
Metal backends might dynamically change the allowed threadgroup size based on the shader complexity. So for example, on the macOS Github action runner, using the maxComputeInvocationsPerWorkgroup or maxComputeWorkgroupSizeX reported by WebGPU (1024) causes the quantization versions of GET_ROWS tests to fail, and Metal shader validation reported that the max threadgroup size was 704. Since the limits in WebGPU are static, it's not queryable on a per-shader basis. To allow the CI to pass, I have for now hardcoded the workgroup size to 288, and I've also opened an issue with WebGPU (Handling dynamic Metal maxThreadsPerThreadgroup gpuweb/gpuweb#5315) to discuss any possible solutions.
I also ran into an issue where on the Vulkan LLVMpipe Github action runner, it was failing to execute a loop more than 65535 iterations (whereas the loop needed to execute 78600 iterations). I believe this might just be an issue with the LLVMpipe backend. To fix this, I ended up implementing vectorized f32 GET_ROWS, which is more efficient anyways, but it's something to note going forward.

…building/submission

…o addition merging with addition

F32 Support for Addition Operation

This reverts commit 77f8b96.

ggerganov · 2025-09-17T09:01:49Z

While the code is basically the same, I want to note that test-backend-ops does not currently have support for testing the in-place versions. Happy to brainstorm what it would take to add support for testing them.

I guess one option would be to add a single test that uses an explicit inplace operation (for example ggml_mul_inplace) and to implement inplace support in the WebGPU in such a way that it is generically reused across all operations that need it. The assumption is that if it works for one op and the same mechanism for aliasing the dst tensor is reused for all other ops, this should be enough to cover it and would avoid adding inplace tests for all other ops.

reeselevine · 2025-09-17T18:05:49Z

@ggerganov yeah that makes sense. For now I realized it's easy to add a single in-place test per operation, which should add minimal time to test-backend-ops. If that works for now, I can think about other ideas going forward.

…-org#16018) * Add paramater buffer pool, batching of submissions, refactor command building/submission * Add header for linux builds * Free staged parameter buffers at once * Format with clang-format * Fix thread-safe implementation * Use device implicit synchronization * Update workflow to use custom release * Remove testing branch workflow * some f32 tests passing * Disable set_rows until it's implemented * f32 add all tests passing * Begin work on set_rows * Work on set rows * Add error buffers for reporting unsupported SET_ROWS indices * Remove extra comments * Add templated addition, clean up code * Get addition and multiplication working * Implement rms_norm * Add get_rows implementation * Add new get_rows files * Refactor use of wg size entry * Fix compilation * Try manually unrolled q4_0 quant * Revert "Try manually unrolled q4_0 quant" This reverts commit 77f8b96. * Move to constant max wg size * Check for tensor size in supports_op * Vectorize f32 and change default workgroup size * Move f32 get_rows from < 4 to % 4 != 0 * fix linter errors * Add in-place tests --------- Co-authored-by: Neha Abbas <[email protected]>

reeselevine and others added 30 commits July 30, 2025 12:33

Add paramater buffer pool, batching of submissions, refactor command …

30ba139

…building/submission

Add header for linux builds

04d7b27

Free staged parameter buffers at once

01c8ced

Format with clang-format

bfff27f

Fix thread-safe implementation

b8012ec

Use device implicit synchronization

cddda7e

Merge remote-tracking branch 'upstream/master' into fixes

1d5726a

Update workflow to use custom release

6a20e39

Remove testing branch workflow

ea39068

some f32 tests passing

96d107e

Merge branch 'ggml-org:master' into master

4c58742

Disable set_rows until it's implemented

ae8edbf

f32 add all tests passing

39aa11d

Merge branch 'master' of https://github.com/reeselevine/llama.cpp int…

2c57726

…o addition merging with addition

Begin work on set_rows

6a6135c

Work on set rows

b2dbfcd

Add error buffers for reporting unsupported SET_ROWS indices

248f7a5

Remove extra comments

4ad0986

most recent merge

ac52243

Merge remote-tracking branch 'origin/master' into addition

1b16a91

Add templated addition, clean up code

7f9ee10

Get addition and multiplication working

c102197

Merge pull request #1 from reeselevine/addition

efc0cb0

F32 Support for Addition Operation

Implement rms_norm

7fbe84c

Add get_rows implementation

dc7bc4a

Add new get_rows files

b7635c4

Refactor use of wg size entry

4293531

Fix compilation

ff41205

Merge remote-tracking branch 'upstream/master'

a5da437

Try manually unrolled q4_0 quant

77f8b96

reeselevine added 6 commits September 12, 2025 15:07

Revert "Try manually unrolled q4_0 quant"

102f225

This reverts commit 77f8b96.

Move to constant max wg size

b0bd49f

Check for tensor size in supports_op

fc91520

Vectorize f32 and change default workgroup size

4561784

Merge remote-tracking branch 'upstream/master'

26742e2

Move f32 get_rows from < 4 to % 4 != 0

cfa4fc1

github-actions bot added python python script changes ggml changes relating to the ggml tensor library for machine learning labels Sep 15, 2025

fix linter errors

9422879

Add in-place tests

b877e07

github-actions bot added the testing Everything test related label Sep 17, 2025

Merge remote-tracking branch 'upstream/master'

be35439

ggerganov approved these changes Sep 17, 2025

View reviewed changes

reeselevine merged commit d304f45 into ggml-org:master Sep 17, 2025
50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators #16018

GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators #16018

reeselevine commented Sep 15, 2025 •

edited

Loading

Uh oh!

ggerganov commented Sep 17, 2025

Uh oh!

reeselevine commented Sep 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators #16018

GGML WebGPU: Support for ADD, MUL, RMS_NORM, GET_ROWS operators #16018

Conversation

reeselevine commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Sep 17, 2025

Uh oh!

reeselevine commented Sep 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

reeselevine commented Sep 15, 2025 •

edited

Loading