-
Notifications
You must be signed in to change notification settings - Fork 13.3k
metal: optimise GGML_OP_SUM
#16559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metal: optimise GGML_OP_SUM
#16559
Conversation
This implementation assumes that the llama.cpp/ggml/src/ggml-metal/ggml-metal-device.m Lines 659 to 664 in 9cc51d3
We should either update the requirement or support non-contiguous input. This is also wrong on Would need to add non-contiguous tests in
Likely to match the performance of |
This is a temporary fix for now, but I assume it's possible to sum non-contiguous
|
Yes, we have to pass the strides and use them in the kernel. Note that we are mainly interested in the "contiguous rows" case. The completely non-contiguous case is rare and for now, no need to support it. |
@cern1710 Could you update the CUDA https://github.com/ggml-org/llama.cpp/actions/runs/18467009826/job/52636768181?pr=16559#step:3:10730 |
Hi @ggerganov , Thanks for letting me know about this and yes, I've noticed about the CI failing as well. I can't patch it myself at the moment, so it would be good if someone else can update the |
* optimise GGML_OP_SUM * add non-contiguous tests by permuting the input * change tests to require full contiguity of OP_SUM * cuda : add check GGML_OP_SUM --------- Co-authored-by: Georgi Gerganov <[email protected]>
* origin/master: Add server-driven parameter defaults and syncing (ggml-org#16515) metal: optimise `GGML_OP_SUM` (ggml-org#16559) server : fix img token logs (ggml-org#16595) llama-quant: add support for mmproj (ggml-org#16592) CUDA: Changing the CUDA scheduling strategy to spin (ggml-org#16585) server : fix mtmd checkpoints (ggml-org#16591) metal : avoid using Metal's gpuAddress property (ggml-org#16576) vulkan: Add ACC_TYPE_VEC2 implementation (ggml-org#16203) CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (ggml-org#16577) vulkan: Support FA with K/V in F32 (ggml-org#16543) vulkan: Improve build time for MSVC (ggml-org#16545) CUDA: enable FA for FP32 KV cache (ggml-org#16546) CUDA: use fastdiv + ggml_cuda_mad for mmvf (ggml-org#16557) CUDA: add fp kernel for larger batch size MoE (ggml-org#16512) cuda : remove legacy copy-op pointer indirection code (ggml-org#16485) server : dynamic token limit for prompt cache (ggml-org#16560)
This PR optimises the
GGML_OP_SUM
operation, as implemented in #16539.The original implementation performed the sum op on one thread as follows:
resulting the following
./test-backend-ops perf -o SUM
log:Implementation
To fix this, I've modified the host-side code to launch
nth
threads in one threadgroup, so that each thread sums a strided chunk (similar toop_sum_rows
).simd_sum(sumf)
is used to perform a partial sum within each SIMD groupsimd_sum(v)
is used to do a reduction across SIMD groups (only SIMD group 0 is involved here)nth
value is calculated similarly toop_sum_rows
Resulting in the following performance log via
./test-backend-ops perf -o SUM
:Note that this is still quite a bit slower than
SUM_ROWS
,So there may be room for improvement in the current kernel implementation.