Skip to content

Conversation

cern1710
Copy link
Contributor

@cern1710 cern1710 commented Oct 12, 2025

Part of #14909:

This PR adds Metal backend for the OPT_STEP_ADAMW operator.

Implementation

  • Added kernel_opt_step_adamw_f32 kernel in ggml-metal.metal
  • Implemented argument struct for parameters, ggml_metal_kargs_opt_step_adamw
  • Storing parameters and np in the struct as a passed constant & args
  • Threadgroup size calculation, similar to ggml_metal_op_upscale

I've not written test cases for this operator as one already exists in test-backend-ops.cpp.

Additional changes

  • Due to test-opts failing, the GGML_OP_SUM operator is implemented in Metal (currently unoptimised)

@cern1710 cern1710 requested a review from ggerganov as a code owner October 12, 2025 08:56
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Oct 12, 2025
@cern1710
Copy link
Contributor Author

I've tried commenting out most of the code inside ggml_metal_op_opt_step_adamw() (L3410 - L3440) as well as the entire kernel, which did not resolve the CI failure. Below I've copied the same CI error on my own machine by running ./test-opt:

pre-allocated tensor (loss_sum) in a buffer (Metal) that cannot run the operation (SUM)
(lldb) process attach --pid 84566
Process 84566 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x000000018342d42c libsystem_kernel.dylib`__wait4 + 8
libsystem_kernel.dylib`__wait4:
->  0x18342d42c <+8>:  b.lo   0x18342d44c    ; <+40>
    0x18342d430 <+12>: pacibsp 
    0x18342d434 <+16>: stp    x29, x30, [sp, #-0x10]!
    0x18342d438 <+20>: mov    x29, sp
Executable module set to "/Users/samuel/Desktop/Personal/llama.cpp/build/bin/test-opt".
Architecture set to: arm64-apple-macosx-.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x000000018342d42c libsystem_kernel.dylib`__wait4 + 8
    frame #1: 0x000000010115146c libggml-base.dylib`ggml_abort + 156
    frame #2: 0x000000010116b724 libggml-base.dylib`ggml_backend_sched_backend_id_from_cur(ggml_backend_sched*, ggml_tensor*) + 896
    frame #3: 0x0000000101169304 libggml-base.dylib`ggml_backend_sched_split_graph + 444
    frame #4: 0x000000010116c2f0 libggml-base.dylib`ggml_backend_sched_alloc_graph + 96
    frame #5: 0x0000000101170d68 libggml-base.dylib`ggml_opt_alloc + 872
    frame #6: 0x0000000100f3397c test-opt`test_backend(ggml_backend_sched*, ggml_backend*, ggml_opt_optimizer_type) + 1772
    frame #7: 0x0000000100f32ff4 test-opt`main + 1300
    frame #8: 0x00000001830a9d54 dyld`start + 7184
(lldb) quit
[1]    84566 abort      ./test-opt

The culprit is this line right here, but I'm unsure as to whether this has something to do with my kernel or whether this is another issue.

GGML_ABORT("pre-allocated tensor (%s) in a buffer (%s) that cannot run the operation (%s)", tensor->name, ggml_backend_buffer_name(buffer), ggml_op_name(tensor->op));

@ggerganov
Copy link
Member

We need to implement the GGML_OP_SUM operator in Metal as well.

@cern1710
Copy link
Contributor Author

I see, thanks for the follow up. Would you like me to implement GGML_OP_SUM for Metal as part of this PR, or should that be done in a follow-up one?

@ggerganov
Copy link
Member

Yes, it should be in this PR so that the CI becomes green.

@cern1710 cern1710 changed the title Metal: add opt_step_adamw Metal: add opt_step_adamw and op_sum Oct 12, 2025
@cern1710
Copy link
Contributor Author

cern1710 commented Oct 12, 2025

I have a working GGML_OP_SUM implementation for Metal which is very simple, but fixes the CI abort.
Would you prefer if I continue optimising it or just leave it as is for now?

P.S. For an optimised implementation, would we be looking at reducing the lane for each SIMD group using simd_sum(), then a shared memory reduction to combine the SIMD group partials?

@ggerganov
Copy link
Member

Yes, we can optimize it in a separate PR. For now just correctness is OK.

P.S. For an optimised implementation, would we be looking at reducing the lane for each SIMD group using simd_sum(), then a shared memory reduction to combine the SIMD group partials?

Yes, exactly.

@cern1710
Copy link
Contributor Author

cern1710 commented Oct 12, 2025

Right, this change fixed CI / ggml-ci-mac-metal, but somehow CI / macOS-latest-cmake-arm64 is still failing.

Is it possible that it is too restrictive to tie SUM to has_simdgroup_reduction?

        case GGML_OP_SUM:
        case GGML_OP_SUM_ROWS:
        case GGML_OP_MEAN:
        case GGML_OP_SOFT_MAX:
        case GGML_OP_GROUP_NORM:
            return has_simdgroup_reduction && ggml_is_contiguous_rows(op->src[0]);

I think It might be better to do something like this for now:

        case GGML_OP_SUM:
            return ggml_is_contiguous(op->src[0]) && op->src[0]->type == GGML_TYPE_F32;

@ggerganov
Copy link
Member

It might be better to add has_simdgroup_reduction to OPT_STEP_ADAMW so that both ops would end up being not supported on the virtualized device of the CI. This way GGML_OP_SUM would keep the simdgroup reduction requirement, which we will need later for optimization.

@cern1710
Copy link
Contributor Author

cern1710 commented Oct 12, 2025

Yes, that makes sense. I've made that change in the newest commit

I also think the previous commit has passed the failing CI

@ggerganov ggerganov merged commit a31cf36 into ggml-org:master Oct 12, 2025
70 checks passed
@cern1710 cern1710 deleted the metal-opt-step-adamw branch October 12, 2025 18:46
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Oct 13, 2025
* origin/master: (32 commits)
metal : FA support F32 K and V and head size = 32 (ggml-org#16531)
graph : support cacheless embeddings with FA and iSWA (ggml-org#16528)
opencl: fix build targeting CL 2 (ggml-org#16554)
CUDA: fix numerical issues in tile FA kernel (ggml-org#16540)
ggml : fix build broken with -march=armv9-a on MacOS (ggml-org#16520)
CANN: fix CPU memory leak in CANN backend (ggml-org#16549)
fix: add remark plugin to render raw HTML as literal text (ggml-org#16505)
metal: add support for opt_step_sgd (ggml-org#16539)
ggml : fix scalar path for computing norm (ggml-org#16558)
CANN: Update several operators to support FP16 data format (ggml-org#16251)
metal : add opt_step_adamw and op_sum (ggml-org#16529)
webui: remove client-side context pre-check and rely on backend for limits (ggml-org#16506)
[SYCL] fix UT fault cases: count-equal, argsort, pad OPs (ggml-org#16521)
ci : add Vulkan on Ubuntu with default packages build (ggml-org#16532)
common : handle unicode during partial json parsing (ggml-org#16526)
common : update presets (ggml-org#16504)
ggml : Fix FP16 ELU positive branch (ggml-org#16519)
hparams : add check for layer index in is_recurrent (ggml-org#16511)
ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (ggml-org#16518)
CUDA: faster tile FA, add oob checks, more HSs (ggml-org#16492)
...
yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025
* scaffold to support opt step adamw on metal (not written so far)

* add opt-step-adamw kernel for metal

* pass op->src[4] as a separate buffer to the pipeline

* add bounds check to opt-step-adamw kernel

* complete scaffold for GGML_OP_SUM

* naive GGML_OP_SUM kernel

* remove unwanted comment

* change OP_SUM capability gate

* Add has_simdgroup_reduction to both ops to pass CI
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants