Skip to content

Conversation

@reeselevine
Copy link
Collaborator

@reeselevine reeselevine commented Sep 15, 2025

Continuing to add support for more operations. My current plan is to add support for enough operations to run some popular models (open for feedback on which models to focus on) and then go back and work on writing more efficient code the more important operations and dequantization.

In this PR:

  • work on the shader generation process to handle different operations which have similar structures, e.g., binary-head.tmpl for ADD and MUL. Note that I'm kinda rolling my own preprocessing through embed-wgsl.py, since WGSL doesn't have any preprocessor support.
  • Support for most quantization types for GET_ROWS. A lot of the dequantization code is duplicated in this PR, although I have refactored the i-quant grids so that they are all in one place. I am planning on refactoring the dequantization code when more efficient dequantization, i.e., not a whole block per thread, is implemented, since the structure might change quite a bit.
  • As mentioned by @ngxson in ggml : add WebGPU backend #7773, some operations require in-place versions, which is slightly more complex in WebGPU since bindings can't alias. Right now, I do this by having two separate template files, e.g., mul.tmpl.wgsl and mul_in_place.tmpl.wgsl. While the code is basically the same, I want to note that test-backend-ops does not currently have support for testing the in-place versions. Happy to brainstorm what it would take to add support for testing them.
  • Metal backends might dynamically change the allowed threadgroup size based on the shader complexity. So for example, on the macOS Github action runner, using the maxComputeInvocationsPerWorkgroup or maxComputeWorkgroupSizeX reported by WebGPU (1024) causes the quantization versions of GET_ROWS tests to fail, and Metal shader validation reported that the max threadgroup size was 704. Since the limits in WebGPU are static, it's not queryable on a per-shader basis. To allow the CI to pass, I have for now hardcoded the workgroup size to 288, and I've also opened an issue with WebGPU (Handling dynamic Metal maxThreadsPerThreadgroup gpuweb/gpuweb#5315) to discuss any possible solutions.
  • I also ran into an issue where on the Vulkan LLVMpipe Github action runner, it was failing to execute a loop more than 65535 iterations (whereas the loop needed to execute 78600 iterations). I believe this might just be an issue with the LLVMpipe backend. To fix this, I ended up implementing vectorized f32 GET_ROWS, which is more efficient anyways, but it's something to note going forward.

@github-actions github-actions bot added python python script changes ggml changes relating to the ggml tensor library for machine learning labels Sep 15, 2025
@ggerganov
Copy link
Member

While the code is basically the same, I want to note that test-backend-ops does not currently have support for testing the in-place versions. Happy to brainstorm what it would take to add support for testing them.

I guess one option would be to add a single test that uses an explicit inplace operation (for example ggml_mul_inplace) and to implement inplace support in the WebGPU in such a way that it is generically reused across all operations that need it. The assumption is that if it works for one op and the same mechanism for aliasing the dst tensor is reused for all other ops, this should be enough to cover it and would avoid adding inplace tests for all other ops.

@github-actions github-actions bot added the testing Everything test related label Sep 17, 2025
@reeselevine
Copy link
Collaborator Author

@ggerganov yeah that makes sense. For now I realized it's easy to add a single in-place test per operation, which should add minimal time to test-backend-ops. If that works for now, I can think about other ideas going forward.

@reeselevine reeselevine merged commit d304f45 into ggml-org:master Sep 17, 2025
50 checks passed
yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025
…-org#16018)

* Add paramater buffer pool, batching of submissions, refactor command building/submission

* Add header for linux builds

* Free staged parameter buffers at once

* Format with clang-format

* Fix thread-safe implementation

* Use device implicit synchronization

* Update workflow to use custom release

* Remove testing branch workflow

* some f32 tests passing

* Disable set_rows until it's implemented

* f32 add all tests passing

* Begin work on set_rows

* Work on set rows

* Add error buffers for reporting unsupported SET_ROWS indices

* Remove extra comments

* Add templated addition, clean up code

* Get addition and multiplication working

* Implement rms_norm

* Add get_rows implementation

* Add new get_rows files

* Refactor use of wg size entry

* Fix compilation

* Try manually unrolled q4_0 quant

* Revert "Try manually unrolled q4_0 quant"

This reverts commit 77f8b96.

* Move to constant max wg size

* Check for tensor size in supports_op

* Vectorize f32 and change default workgroup size

* Move f32 get_rows from < 4 to % 4 != 0

* fix linter errors

* Add in-place tests

---------

Co-authored-by: Neha Abbas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning python python script changes testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants