metal : make the backend async v2 #15906

ggerganov · 2025-09-09T17:44:19Z

Opening an alternative implementation to avoid disturbing the ongoing review in #15832. This supports devices with discrete GPUs, while for Apple Silicon with unified memory we continue to allocate the buffers in host memory.

The main reason for the latter is that the CPU -> GPU copies have significant overhead for text generation (measured up to 30% slow down for some smaller models). The main overhead is from the creation of MTLCommandQueue and [MTLCommandBuffer waitUntilCompleted]. Most of the overhead can be eliminated by using a global MTLCommandQueue per device, but there is still 1-2% slowdown and the implementation starts to deviate a bit from the design of the backend context. I've kept this implementation in a branch for reference: https://github.com/ggml-org/llama.cpp/tree/gg/metal-async-save-global-queue

Edit: Switched to a global MTLCommandQueue and added support for both Shared and Private buffer types. The latter is selected based on device support. See #15906 (comment).

Outdated

Keeping the memory buffers as host allocated (as we do on master) even results in a small tg performance bump when the backend is async. In the other PR, I think I was thrown off from this approach due to incorrectly advertising both the buffer_type and buffer_from_ptr_type as is_host == true. While I think the correct approach is to advertise only buffer_from_ptr_type as is_host == true.

This is the relevant change:

With this implementation, the op offloading now works correctly, even when the allocated buffers are in host memory.

Here is sample perf comparison:

Model	Test	t/s master	t/s gg/metal-async-v2	Speedup
gemma3 4B Q4_0	pp2048	2551.69	2575.80	1.01
gemma3 4B Q4_0	pp2048@d512	2494.55	2531.78	1.01
gemma3 4B Q4_0	pp2048@d1024	2463.06	2505.31	1.02
gemma3 4B Q4_0	pp2048@d2048	2420.55	2475.32	1.02
gemma3 4B Q4_0	pp2048@d4096	2344.58	2407.00	1.02
gemma3 4B Q4_0	pp2048@d8192	2213.52	2031.77	1.02
gemma3 4B Q4_0	tg64	123.36	124.56	1.01
gemma3 4B Q4_0	tg64@d512	122.02	122.82	1.01
gemma3 4B Q4_0	tg64@d1024	116.61	117.05	1.00
gemma3 4B Q4_0	tg64@d2048	112.28	115.66	1.03
gemma3 4B Q4_0	tg64@d4096	111.70	115.28	1.01
gemma3 4B Q4_0	tg64@d8192	111.49	111.87	1.00
gpt-oss 20B MXFP4 MoE	pp2048	2321.57	2354.81	1.01
gpt-oss 20B MXFP4 MoE	pp2048@d512	2276.46	2307.69	1.01
gpt-oss 20B MXFP4 MoE	pp2048@d1024	2236.99	2269.06	1.01
gpt-oss 20B MXFP4 MoE	pp2048@d2048	2147.54	2185.87	1.02
gpt-oss 20B MXFP4 MoE	pp2048@d4096	1989.62	2040.79	1.03
gpt-oss 20B MXFP4 MoE	pp2048@d8192	1738.36	1791.70	1.03
gpt-oss 20B MXFP4 MoE	tg64	120.65	120.79	1.00
gpt-oss 20B MXFP4 MoE	tg64@d512	119.83	120.09	1.00
gpt-oss 20B MXFP4 MoE	tg64@d1024	118.38	118.61	1.00
gpt-oss 20B MXFP4 MoE	tg64@d2048	112.91	116.27	1.03
gpt-oss 20B MXFP4 MoE	tg64@d4096	110.12	113.08	1.03
gpt-oss 20B MXFP4 MoE	tg64@d8192	107.12	107.48	1.00

Using private GPU memory can be forced on Apple Silicon with GGML_METAL_PRIVATE_BUFFERS=1. This runs the less-optimal implementation that creates a new MTLCommandQueue for each operation. The performance drops like this:

Model	Test	t/s master	t/s gg/metal-async-v2	Speedup
gemma3 4B Q4_0	pp2048	2554.71	2538.31	0.99
gemma3 4B Q4_0	pp2048@d512	2497.62	2491.58	1.00
gemma3 4B Q4_0	tg64	124.07	94.34	0.76
gemma3 4B Q4_0	tg64@d512	122.42	93.55	0.76
gpt-oss 20B MXFP4 MoE	pp2048	2323.16	2324.01	1.00
gpt-oss 20B MXFP4 MoE	pp2048@d512	2268.54	2272.60	1.00
gpt-oss 20B MXFP4 MoE	tg64	120.29	91.80	0.76
gpt-oss 20B MXFP4 MoE	tg64@d512	119.41	91.65	0.77

This will be automatically enabled for discrete GPUs.

ggml-ci

slaren · 2025-09-09T17:54:48Z

Creating a different MTLCommandQueue for non-async ops is definitely overkill. The CUDA backend uses the special stream cudaStreamPerThread for these operations, which should be equivalent to having a queue per thread, but even a single queue per device should be good for these operations (assuming that there are no thread safety issues).

slaren · 2025-09-09T18:03:05Z

While I think the correct approach is to advertise only buffer_from_ptr_type as is_host == true.

So this will work because these buffers are only used to store weights, so there aren't any data races to worry about. But the fundamental issue is that the scheduler does not keep track of the dependencies between splits that are stored in compatible buffer types that do not require a copy. The scheduler needs to know if a split depends on outputs from a previous split, and if so, synchronize it.

OTOH, weights allocated in a Metal buffer should never be accessed from the CPU, so there is no benefit to tagging them as a host buffer either. Only tensors allocated in a compute buffer may benefit from being accessible by the CPU backend.

ggerganov · 2025-09-09T18:14:00Z

Regarding the MTLCommandQueue, it should be thread-safe, so a single queue per device should be good. I'm just not sure if it is worth enabling the "private buffer" path (with single queue) for Apple Silicon as it has a small performance hit and I don't think there are any benefits?

Regarding the is_host setting of the buffer types - IIUC, the default buffer_type should remain with is_host == false. The question is, what to use for buffer_from_ptr_type. Do you mean that announcing it as "host" could potentially cause problems in some cases and it's better to set it to false too?

slaren · 2025-09-09T18:20:27Z

Do you mean that announcing it as "host" could potentially cause problems in some cases and it's better to set it to false too?

Not in llama.cpp as far as I can tell, since these buffers are only used for weights. But if someone used it to allocate a mutable tensor, say, a KV cache, and then this tensor is used in the next split by the CPU backend, the data race would still be there.

slaren · 2025-09-09T18:25:17Z

Regarding the MTLCommandQueue, it should be thread-safe, so a single queue per device should be good. I'm just not sure if it is worth enabling the "private buffer" path (with single queue) for Apple Silicon as it has a small performance hit and I don't think there are any benefits?

I don't think there is any benefit, other than just keeping the code simpler with a single path. I am not sure where is the performance hit, I suppose a memcpy is faster than setting up a command buffer and launching a copy operation? If the buffer is made non-host, or ggml_backend_sched is patched as I suggested before, the data race issue would go away.

ggerganov · 2025-09-09T18:38:37Z

I am not sure where is the performance hit, I suppose a memcpy is faster than setting up a command buffer and launching a copy operation?

Hm, I just tried to remove the extraneous enqueue and waitUntilScheduled calls from the "private buffers + global queue" branch as you noted earlier and seems the performance now matches the current implementation in this PR. Let me update this and do more detailed timing - there might not be any performance hit after all.

ggml-ci

ggerganov · 2025-09-09T20:33:06Z

So without the last commit e65c53e the performance looks like this:

Model	Test	t/s master	t/s gg/metal-async-v2	Speedup
gemma3 4B Q4_0	pp2048	2555.19	2569.00	1.01
gemma3 4B Q4_0	pp2048@d512	2497.13	2520.72	1.01
gemma3 4B Q4_0	pp2048@d1024	2469.35	2497.16	1.01
gemma3 4B Q4_0	pp2048@d2048	2428.99	2464.25	1.01
gemma3 4B Q4_0	pp2048@d4096	2362.31	2406.41	1.02
gemma3 4B Q4_0	pp2048@d8192	2229.66	2290.82	1.03
gemma3 4B Q4_0	tg64	123.83	123.44	1.00
gemma3 4B Q4_0	tg64@d512	122.32	121.33	0.99
gemma3 4B Q4_0	tg64@d1024	117.09	115.77	0.99
gemma3 4B Q4_0	tg64@d2048	116.14	113.75	0.98
gemma3 4B Q4_0	tg64@d4096	115.26	112.44	0.98
gemma3 4B Q4_0	tg64@d8192	112.27	108.52	0.97
gpt-oss 20B MXFP4 MoE	pp2048	2339.53	2327.16	0.99
gpt-oss 20B MXFP4 MoE	pp2048@d512	2291.12	2309.62	1.01
gpt-oss 20B MXFP4 MoE	pp2048@d1024	2244.38	2267.99	1.01
gpt-oss 20B MXFP4 MoE	pp2048@d2048	2151.09	2183.18	1.01
gpt-oss 20B MXFP4 MoE	pp2048@d4096	1999.84	2037.05	1.02
gpt-oss 20B MXFP4 MoE	pp2048@d8192	1745.39	1787.78	1.02
gpt-oss 20B MXFP4 MoE	tg64	121.43	119.94	0.99
gpt-oss 20B MXFP4 MoE	tg64@d512	120.09	118.21	0.98
gpt-oss 20B MXFP4 MoE	tg64@d1024	119.02	117.09	0.98
gpt-oss 20B MXFP4 MoE	tg64@d2048	116.03	114.41	0.99
gpt-oss 20B MXFP4 MoE	tg64@d4096	112.72	111.15	0.99
gpt-oss 20B MXFP4 MoE	tg64@d8192	107.40	104.32	0.97

There seems to be some degradation in TG with increasing context. Likely because we now effectively perform 2 copies of the data in set_tensor.

With e65c53e, we remove one of the copies and the performance recovers:

Model	Test	t/s master	t/s gg/metal-async-v2	Speedup
gemma3 4B Q4_0	pp2048	2503.35	2580.39	1.03
gemma3 4B Q4_0	pp2048@d512	2498.42	2537.26	1.02
gemma3 4B Q4_0	pp2048@d1024	2468.24	2516.98	1.02
gemma3 4B Q4_0	pp2048@d2048	2431.50	2487.06	1.02
gemma3 4B Q4_0	pp2048@d4096	2361.52	2442.99	1.03
gemma3 4B Q4_0	pp2048@d8192	2227.24	2373.64	1.07
gemma3 4B Q4_0	tg64	124.48	123.12	0.99
gemma3 4B Q4_0	tg64@d512	122.89	121.57	0.99
gemma3 4B Q4_0	tg64@d1024	116.91	116.45	1.00
gemma3 4B Q4_0	tg64@d2048	115.52	114.79	0.99
gemma3 4B Q4_0	tg64@d4096	115.21	115.44	1.00
gemma3 4B Q4_0	tg64@d8192	112.18	113.20	1.01
gpt-oss 20B MXFP4 MoE	pp2048	2339.56	2355.45	1.01
gpt-oss 20B MXFP4 MoE	pp2048@d512	2290.19	2314.41	1.01
gpt-oss 20B MXFP4 MoE	pp2048@d1024	2240.93	2284.23	1.02
gpt-oss 20B MXFP4 MoE	pp2048@d2048	2149.74	2214.77	1.03
gpt-oss 20B MXFP4 MoE	pp2048@d4096	1998.52	2082.67	1.04
gpt-oss 20B MXFP4 MoE	pp2048@d8192	1745.39	1887.92	1.05
gpt-oss 20B MXFP4 MoE	tg64	121.26	120.18	0.99
gpt-oss 20B MXFP4 MoE	tg64@d512	120.95	118.73	0.98
gpt-oss 20B MXFP4 MoE	tg64@d1024	118.84	118.18	0.99
gpt-oss 20B MXFP4 MoE	tg64@d2048	116.08	115.03	0.99
gpt-oss 20B MXFP4 MoE	tg64@d4096	112.83	112.51	1.00
gpt-oss 20B MXFP4 MoE	tg64@d8192	107.10	107.25	1.00

But the code is no longer thread-safe (test-backend-ops crashes). Adding waitUntilCompleted at the end of set_tensor avoids the race, but the performance tanks significantly (more than 10%).

Wondering if we should accept the performance hit. Will continue tomorrow.

slaren · 2025-09-09T22:16:52Z

Adding waitUntilCompleted at the end of set_tensor avoids the race, but the performance tanks significantly (more than 10%).

That's what the non-async functions are expected to do, but if the overhead of creating a command buffer is so high, I don't think this is an acceptable solution. If you want to support discrete GPUs, I think it would be better to create two different buffer types, and select the one to use depending on the device type.

ggml-ci

ggerganov · 2025-09-10T07:35:55Z

ggml/src/ggml-metal/ggml-metal.m

+            // note: for experimentation purposes, here we use a semaphore to wait for the copy to complete
+            //       this is alternative to waitUntilCompleted, which should be faster, but don't seem to make much difference
+            dispatch_semaphore_t completion_semaphore = dispatch_semaphore_create(0);
+
+            id<MTLCommandQueue>  queue   = ctx->queue;
+            id<MTLCommandBuffer> cmd_buf = [queue commandBufferWithUnretainedReferences];
+
+            {
+                id<MTLBlitCommandEncoder> encoder = [cmd_buf blitCommandEncoder];
+
+                [encoder copyFromBuffer:buf_src
+                           sourceOffset:0
+                               toBuffer:buf_dst
+                      destinationOffset:buf_dst_offset
+                                   size:size];
+
+                [encoder endEncoding];
+            }
+
+            [cmd_buf addCompletedHandler:^(id<MTLCommandBuffer> cb) {
+                // TODO: can check for errors here
+                GGML_UNUSED(cb);
+
+                dispatch_semaphore_signal(completion_semaphore);
+            }];
+
+            [cmd_buf commit];
+
+            dispatch_semaphore_wait(completion_semaphore, DISPATCH_TIME_FOREVER);
+            //[cmd_buf waitUntilCompleted];


I found some docs online about using a semaphore + command buffer completion handler for better performance than waitUntilCompleted. Unfortunately, in this case it does not make a significant difference.

ggerganov · 2025-09-10T07:42:59Z

In the latest commits, I've made the following changes:

Metal backend now supports 3 buffer types:
- Shared (same as default buffer_type on master, uses host memory)
- Private (new for dGPU support)
- Mapped (same as buffer_from_ptr_type on master)
Simplified the synchronization logic as discussed in metal : make the backend async #15832. Still have small concerns about the safety of the MTLHeaps in the memory pools, but it might be OK after all. In any case, removing the memory pools as discussed would be the best solution - added TODOs.

Perf is good. @slaren PTAL

Model	Test	t/s master	t/s gg/metal-async-v2	Speedup
gemma3 4B Q4_0	pp2048	2558.03	2578.52	1.01
gemma3 4B Q4_0	pp2048@d512	2500.19	2533.51	1.01
gemma3 4B Q4_0	pp2048@d1024	2470.93	2515.47	1.02
gemma3 4B Q4_0	pp2048@d2048	2427.07	2478.36	1.02
gemma3 4B Q4_0	pp2048@d4096	2361.99	2428.36	1.03
gemma3 4B Q4_0	pp2048@d8192	2232.90	2313.88	1.04
gemma3 4B Q4_0	tg64	124.30	125.27	1.01
gemma3 4B Q4_0	tg64@d512	122.49	123.44	1.01
gemma3 4B Q4_0	tg64@d1024	116.82	117.62	1.01
gemma3 4B Q4_0	tg64@d2048	115.66	115.97	1.00
gemma3 4B Q4_0	tg64@d4096	114.98	115.71	1.01
gemma3 4B Q4_0	tg64@d8192	112.14	112.17	1.00
gpt-oss 20B MXFP4 MoE	pp2048	2299.63	2312.23	1.01
gpt-oss 20B MXFP4 MoE	pp2048@d512	2275.59	2312.47	1.02
gpt-oss 20B MXFP4 MoE	pp2048@d1024	2237.89	2275.07	1.02
gpt-oss 20B MXFP4 MoE	pp2048@d2048	2146.30	2192.57	1.02
gpt-oss 20B MXFP4 MoE	pp2048@d4096	1998.90	2045.94	1.02
gpt-oss 20B MXFP4 MoE	pp2048@d8192	1743.98	1799.40	1.03
gpt-oss 20B MXFP4 MoE	tg64	120.83	121.12	1.00
gpt-oss 20B MXFP4 MoE	tg64@d512	119.46	120.14	1.01
gpt-oss 20B MXFP4 MoE	tg64@d1024	118.60	118.68	1.00
gpt-oss 20B MXFP4 MoE	tg64@d2048	115.75	116.26	1.00
gpt-oss 20B MXFP4 MoE	tg64@d4096	112.65	113.15	1.00
gpt-oss 20B MXFP4 MoE	tg64@d8192	107.20	108.03	1.01

ggml-ci

slaren

It might be a bit more clear if there are different functions for the private and shared buffers, but other than that it looks ok.

Some possible improvements:

When using a discrete GPU, I believe the "shared" buffers are essentially the same than what we call host buffers in CUDA or Vulkan
Thus, when using a discrete GPU the shared buffer type could be used as the host buffer type
To do so, you would need to implement get_host_buffer_type and update the caps in ggml_backend_metal_device_get_props to set host_buffer
Additionally, for discrete GPUs mapped buffers don't really make sense, and could be disabled by returning NULL from ggml_backend_metal_device_buffer_mapped and setting buffer_from_host_ptr to false in ggml_backend_metal_device_get_props. This would prevent using a system memory buffer when mmap is enabled.

slaren · 2025-09-10T12:13:54Z

ggml/src/ggml-metal/ggml-metal.m

+        struct ggml_backend_metal_buffer_context * ctx = (struct ggml_backend_metal_buffer_context *)buffer->context;
+
+        if (ctx->is_shared) {
+            memcpy(dst->data, src->data, ggml_nbytes(src));
+        } else {
+            // TODO: implement
+            return false;
+        }


It's not necessary to implement this, ggml_backend_tensor_copy already calls ggml_backend_tensor_set if the source tensor is in a host buffer.

slaren · 2025-09-10T13:01:23Z

when using a discrete GPU the shared buffer type could be used as the host buffer type

For this to work as expected, it should return true from the is_host function, and supports_buft function would need to return false. Without having a system to test that everything works as expected, this is probably too risky, though.

ggml-ci

ggerganov · 2025-09-10T13:38:03Z

I'll see if my old Intel MacBook still works and try to implement the discrete GPU recommendations in a follow up PR.

ggerganov added 9 commits September 9, 2025 09:29

metal : make the backend async

97b96c1

ggml-ci

cont : add comments, extend op offload, clean up

c5637cf

ggml-ci

metal : fix batch size for MUL_MAT_ID

bdff772

metal : remove deprecated ggml_backend_metal_buffer_from_ptr

d91ba85

metal : create only metal buffers, no wrapping of host memory

85aaf52

ggml-ci

metal : restore .alloc_buffer for buffer_from_ptr_type

7fc2b3d

ggml-ci

metal : remove broken implementation of GGML_OP_SET

f288225

ggml-ci

metal : clean-up loose ends, ready for tests

0926cb4

ggml-ci

metal : support both private and shared buffers

7b59f0f

ggml-ci

github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Sep 9, 2025

ggerganov added 3 commits September 9, 2025 22:05

metal : enable private buffers + add global device queue

040844c

metal : disable host buffer to prevent races

afd95d2

ggml-ci

metal : avoid extra copy during set_tensor

e65c53e

ggml-ci

ggerganov added 2 commits September 10, 2025 10:00

metal : use separate buffer types for shread and private Metal buffers

523750a

ggml-ci

metal : simplify synchronization logic

9248aec

ggml-ci

ggerganov commented Sep 10, 2025

View reviewed changes

metal : fix build

5fdce91

ggml-ci

slaren reviewed Sep 10, 2025

View reviewed changes

metal : do not implement cpy_tensor

c9a5ba4

ggml-ci

metal : separate implementations for shared and private buffers

e796f66

ggml-ci

ggerganov mentioned this pull request Sep 10, 2025

metal : make the backend async #15832

Closed

ggerganov merged commit 0f0a3c2 into master Sep 10, 2025
54 of 55 checks passed

ggerganov deleted the gg/metal-async-v2 branch September 10, 2025 14:52

ggerganov mentioned this pull request Sep 13, 2025

metal : fix memory leaks #15962

Merged

jhen0409 mentioned this pull request Sep 18, 2025

metal : avoid call free for non-owned buffer #16067

Merged

metal : make the backend async v2 #15906

metal : make the backend async v2 #15906

Uh oh!

Conversation

ggerganov commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Sep 9, 2025

Uh oh!

slaren commented Sep 9, 2025

Uh oh!

slaren commented Sep 9, 2025

Uh oh!

ggerganov commented Sep 9, 2025

Uh oh!

ggerganov commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Sep 9, 2025

Uh oh!

ggerganov Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slaren Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

slaren commented Sep 10, 2025

Uh oh!

ggerganov commented Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggerganov commented Sep 9, 2025 •

edited

Loading

slaren commented Sep 9, 2025 •

edited

Loading

slaren commented Sep 9, 2025 •

edited

Loading

ggerganov commented Sep 9, 2025 •

edited

Loading

ggerganov commented Sep 10, 2025 •

edited

Loading

slaren left a comment •

edited

Loading