Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Sep 9, 2025

cont #15832

Opening an alternative implementation to avoid disturbing the ongoing review in #15832. This supports devices with discrete GPUs, while for Apple Silicon with unified memory we continue to allocate the buffers in host memory.

The main reason for the latter is that the CPU -> GPU copies have significant overhead for text generation (measured up to 30% slow down for some smaller models). The main overhead is from the creation of MTLCommandQueue and [MTLCommandBuffer waitUntilCompleted]. Most of the overhead can be eliminated by using a global MTLCommandQueue per device, but there is still 1-2% slowdown and the implementation starts to deviate a bit from the design of the backend context. I've kept this implementation in a branch for reference: https://github.com/ggml-org/llama.cpp/tree/gg/metal-async-save-global-queue

Edit: Switched to a global MTLCommandQueue and added support for both Shared and Private buffer types. The latter is selected based on device support. See #15906 (comment).

Outdated

Keeping the memory buffers as host allocated (as we do on master) even results in a small tg performance bump when the backend is async. In the other PR, I think I was thrown off from this approach due to incorrectly advertising both the buffer_type and buffer_from_ptr_type as is_host == true. While I think the correct approach is to advertise only buffer_from_ptr_type as is_host == true.

This is the relevant change:

image

With this implementation, the op offloading now works correctly, even when the allocated buffers are in host memory.

Here is sample perf comparison:

Model Test t/s master t/s gg/metal-async-v2 Speedup
gemma3 4B Q4_0 pp2048 2551.69 2575.80 1.01
gemma3 4B Q4_0 pp2048@d512 2494.55 2531.78 1.01
gemma3 4B Q4_0 pp2048@d1024 2463.06 2505.31 1.02
gemma3 4B Q4_0 pp2048@d2048 2420.55 2475.32 1.02
gemma3 4B Q4_0 pp2048@d4096 2344.58 2407.00 1.02
gemma3 4B Q4_0 pp2048@d8192 2213.52 2031.77 1.02
gemma3 4B Q4_0 tg64 123.36 124.56 1.01
gemma3 4B Q4_0 tg64@d512 122.02 122.82 1.01
gemma3 4B Q4_0 tg64@d1024 116.61 117.05 1.00
gemma3 4B Q4_0 tg64@d2048 112.28 115.66 1.03
gemma3 4B Q4_0 tg64@d4096 111.70 115.28 1.01
gemma3 4B Q4_0 tg64@d8192 111.49 111.87 1.00
gpt-oss 20B MXFP4 MoE pp2048 2321.57 2354.81 1.01
gpt-oss 20B MXFP4 MoE pp2048@d512 2276.46 2307.69 1.01
gpt-oss 20B MXFP4 MoE pp2048@d1024 2236.99 2269.06 1.01
gpt-oss 20B MXFP4 MoE pp2048@d2048 2147.54 2185.87 1.02
gpt-oss 20B MXFP4 MoE pp2048@d4096 1989.62 2040.79 1.03
gpt-oss 20B MXFP4 MoE pp2048@d8192 1738.36 1791.70 1.03
gpt-oss 20B MXFP4 MoE tg64 120.65 120.79 1.00
gpt-oss 20B MXFP4 MoE tg64@d512 119.83 120.09 1.00
gpt-oss 20B MXFP4 MoE tg64@d1024 118.38 118.61 1.00
gpt-oss 20B MXFP4 MoE tg64@d2048 112.91 116.27 1.03
gpt-oss 20B MXFP4 MoE tg64@d4096 110.12 113.08 1.03
gpt-oss 20B MXFP4 MoE tg64@d8192 107.12 107.48 1.00

Using private GPU memory can be forced on Apple Silicon with GGML_METAL_PRIVATE_BUFFERS=1. This runs the less-optimal implementation that creates a new MTLCommandQueue for each operation. The performance drops like this:

Model Test t/s master t/s gg/metal-async-v2 Speedup
gemma3 4B Q4_0 pp2048 2554.71 2538.31 0.99
gemma3 4B Q4_0 pp2048@d512 2497.62 2491.58 1.00
gemma3 4B Q4_0 tg64 124.07 94.34 0.76
gemma3 4B Q4_0 tg64@d512 122.42 93.55 0.76
gpt-oss 20B MXFP4 MoE pp2048 2323.16 2324.01 1.00
gpt-oss 20B MXFP4 MoE pp2048@d512 2268.54 2272.60 1.00
gpt-oss 20B MXFP4 MoE tg64 120.29 91.80 0.76
gpt-oss 20B MXFP4 MoE tg64@d512 119.41 91.65 0.77

This will be automatically enabled for discrete GPUs.

@slaren
Copy link
Member

slaren commented Sep 9, 2025

Creating a different MTLCommandQueue for non-async ops is definitely overkill. The CUDA backend uses the special stream cudaStreamPerThread for these operations, which should be equivalent to having a queue per thread, but even a single queue per device should be good for these operations (assuming that there are no thread safety issues).

@slaren
Copy link
Member

slaren commented Sep 9, 2025

While I think the correct approach is to advertise only buffer_from_ptr_type as is_host == true.

So this will work because these buffers are only used to store weights, so there aren't any data races to worry about. But the fundamental issue is that the scheduler does not keep track of the dependencies between splits that are stored in compatible buffer types that do not require a copy. The scheduler needs to know if a split depends on outputs from a previous split, and if so, synchronize it.

OTOH, weights allocated in a Metal buffer should never be accessed from the CPU, so there is no benefit to tagging them as a host buffer either. Only tensors allocated in a compute buffer may benefit from being accessible by the CPU backend.

@ggerganov
Copy link
Member Author

Regarding the MTLCommandQueue, it should be thread-safe, so a single queue per device should be good. I'm just not sure if it is worth enabling the "private buffer" path (with single queue) for Apple Silicon as it has a small performance hit and I don't think there are any benefits?

Regarding the is_host setting of the buffer types - IIUC, the default buffer_type should remain with is_host == false. The question is, what to use for buffer_from_ptr_type. Do you mean that announcing it as "host" could potentially cause problems in some cases and it's better to set it to false too?

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Sep 9, 2025
@slaren
Copy link
Member

slaren commented Sep 9, 2025

Do you mean that announcing it as "host" could potentially cause problems in some cases and it's better to set it to false too?

Not in llama.cpp as far as I can tell, since these buffers are only used for weights. But if someone used it to allocate a mutable tensor, say, a KV cache, and then this tensor is used in the next split by the CPU backend, the data race would still be there.

@slaren
Copy link
Member

slaren commented Sep 9, 2025

Regarding the MTLCommandQueue, it should be thread-safe, so a single queue per device should be good. I'm just not sure if it is worth enabling the "private buffer" path (with single queue) for Apple Silicon as it has a small performance hit and I don't think there are any benefits?

I don't think there is any benefit, other than just keeping the code simpler with a single path. I am not sure where is the performance hit, I suppose a memcpy is faster than setting up a command buffer and launching a copy operation? If the buffer is made non-host, or ggml_backend_sched is patched as I suggested before, the data race issue would go away.

@ggerganov
Copy link
Member Author

I am not sure where is the performance hit, I suppose a memcpy is faster than setting up a command buffer and launching a copy operation?

Hm, I just tried to remove the extraneous enqueue and waitUntilScheduled calls from the "private buffers + global queue" branch as you noted earlier and seems the performance now matches the current implementation in this PR. Let me update this and do more detailed timing - there might not be any performance hit after all.

@ggerganov
Copy link
Member Author

ggerganov commented Sep 9, 2025

So without the last commit e65c53e the performance looks like this:

Model Test t/s master t/s gg/metal-async-v2 Speedup
gemma3 4B Q4_0 pp2048 2555.19 2569.00 1.01
gemma3 4B Q4_0 pp2048@d512 2497.13 2520.72 1.01
gemma3 4B Q4_0 pp2048@d1024 2469.35 2497.16 1.01
gemma3 4B Q4_0 pp2048@d2048 2428.99 2464.25 1.01
gemma3 4B Q4_0 pp2048@d4096 2362.31 2406.41 1.02
gemma3 4B Q4_0 pp2048@d8192 2229.66 2290.82 1.03
gemma3 4B Q4_0 tg64 123.83 123.44 1.00
gemma3 4B Q4_0 tg64@d512 122.32 121.33 0.99
gemma3 4B Q4_0 tg64@d1024 117.09 115.77 0.99
gemma3 4B Q4_0 tg64@d2048 116.14 113.75 0.98
gemma3 4B Q4_0 tg64@d4096 115.26 112.44 0.98
gemma3 4B Q4_0 tg64@d8192 112.27 108.52 0.97
gpt-oss 20B MXFP4 MoE pp2048 2339.53 2327.16 0.99
gpt-oss 20B MXFP4 MoE pp2048@d512 2291.12 2309.62 1.01
gpt-oss 20B MXFP4 MoE pp2048@d1024 2244.38 2267.99 1.01
gpt-oss 20B MXFP4 MoE pp2048@d2048 2151.09 2183.18 1.01
gpt-oss 20B MXFP4 MoE pp2048@d4096 1999.84 2037.05 1.02
gpt-oss 20B MXFP4 MoE pp2048@d8192 1745.39 1787.78 1.02
gpt-oss 20B MXFP4 MoE tg64 121.43 119.94 0.99
gpt-oss 20B MXFP4 MoE tg64@d512 120.09 118.21 0.98
gpt-oss 20B MXFP4 MoE tg64@d1024 119.02 117.09 0.98
gpt-oss 20B MXFP4 MoE tg64@d2048 116.03 114.41 0.99
gpt-oss 20B MXFP4 MoE tg64@d4096 112.72 111.15 0.99
gpt-oss 20B MXFP4 MoE tg64@d8192 107.40 104.32 0.97

There seems to be some degradation in TG with increasing context. Likely because we now effectively perform 2 copies of the data in set_tensor.

With e65c53e, we remove one of the copies and the performance recovers:

Model Test t/s master t/s gg/metal-async-v2 Speedup
gemma3 4B Q4_0 pp2048 2503.35 2580.39 1.03
gemma3 4B Q4_0 pp2048@d512 2498.42 2537.26 1.02
gemma3 4B Q4_0 pp2048@d1024 2468.24 2516.98 1.02
gemma3 4B Q4_0 pp2048@d2048 2431.50 2487.06 1.02
gemma3 4B Q4_0 pp2048@d4096 2361.52 2442.99 1.03
gemma3 4B Q4_0 pp2048@d8192 2227.24 2373.64 1.07
gemma3 4B Q4_0 tg64 124.48 123.12 0.99
gemma3 4B Q4_0 tg64@d512 122.89 121.57 0.99
gemma3 4B Q4_0 tg64@d1024 116.91 116.45 1.00
gemma3 4B Q4_0 tg64@d2048 115.52 114.79 0.99
gemma3 4B Q4_0 tg64@d4096 115.21 115.44 1.00
gemma3 4B Q4_0 tg64@d8192 112.18 113.20 1.01
gpt-oss 20B MXFP4 MoE pp2048 2339.56 2355.45 1.01
gpt-oss 20B MXFP4 MoE pp2048@d512 2290.19 2314.41 1.01
gpt-oss 20B MXFP4 MoE pp2048@d1024 2240.93 2284.23 1.02
gpt-oss 20B MXFP4 MoE pp2048@d2048 2149.74 2214.77 1.03
gpt-oss 20B MXFP4 MoE pp2048@d4096 1998.52 2082.67 1.04
gpt-oss 20B MXFP4 MoE pp2048@d8192 1745.39 1887.92 1.05
gpt-oss 20B MXFP4 MoE tg64 121.26 120.18 0.99
gpt-oss 20B MXFP4 MoE tg64@d512 120.95 118.73 0.98
gpt-oss 20B MXFP4 MoE tg64@d1024 118.84 118.18 0.99
gpt-oss 20B MXFP4 MoE tg64@d2048 116.08 115.03 0.99
gpt-oss 20B MXFP4 MoE tg64@d4096 112.83 112.51 1.00
gpt-oss 20B MXFP4 MoE tg64@d8192 107.10 107.25 1.00

But the code is no longer thread-safe (test-backend-ops crashes). Adding waitUntilCompleted at the end of set_tensor avoids the race, but the performance tanks significantly (more than 10%).

Wondering if we should accept the performance hit. Will continue tomorrow.

@slaren
Copy link
Member

slaren commented Sep 9, 2025

Adding waitUntilCompleted at the end of set_tensor avoids the race, but the performance tanks significantly (more than 10%).

That's what the non-async functions are expected to do, but if the overhead of creating a command buffer is so high, I don't think this is an acceptable solution. If you want to support discrete GPUs, I think it would be better to create two different buffer types, and select the one to use depending on the device type.

Comment on lines 5952 to 5981
// note: for experimentation purposes, here we use a semaphore to wait for the copy to complete
// this is alternative to waitUntilCompleted, which should be faster, but don't seem to make much difference
dispatch_semaphore_t completion_semaphore = dispatch_semaphore_create(0);

id<MTLCommandQueue> queue = ctx->queue;
id<MTLCommandBuffer> cmd_buf = [queue commandBufferWithUnretainedReferences];

{
id<MTLBlitCommandEncoder> encoder = [cmd_buf blitCommandEncoder];

[encoder copyFromBuffer:buf_src
sourceOffset:0
toBuffer:buf_dst
destinationOffset:buf_dst_offset
size:size];

[encoder endEncoding];
}

[cmd_buf addCompletedHandler:^(id<MTLCommandBuffer> cb) {
// TODO: can check for errors here
GGML_UNUSED(cb);

dispatch_semaphore_signal(completion_semaphore);
}];

[cmd_buf commit];

dispatch_semaphore_wait(completion_semaphore, DISPATCH_TIME_FOREVER);
//[cmd_buf waitUntilCompleted];
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found some docs online about using a semaphore + command buffer completion handler for better performance than waitUntilCompleted. Unfortunately, in this case it does not make a significant difference.

@ggerganov
Copy link
Member Author

ggerganov commented Sep 10, 2025

In the latest commits, I've made the following changes:

  • Metal backend now supports 3 buffer types:
    • Shared (same as default buffer_type on master, uses host memory)
    • Private (new for dGPU support)
    • Mapped (same as buffer_from_ptr_type on master)
  • Simplified the synchronization logic as discussed in metal : make the backend async #15832. Still have small concerns about the safety of the MTLHeaps in the memory pools, but it might be OK after all. In any case, removing the memory pools as discussed would be the best solution - added TODOs.

Perf is good. @slaren PTAL

Model Test t/s master t/s gg/metal-async-v2 Speedup
gemma3 4B Q4_0 pp2048 2558.03 2578.52 1.01
gemma3 4B Q4_0 pp2048@d512 2500.19 2533.51 1.01
gemma3 4B Q4_0 pp2048@d1024 2470.93 2515.47 1.02
gemma3 4B Q4_0 pp2048@d2048 2427.07 2478.36 1.02
gemma3 4B Q4_0 pp2048@d4096 2361.99 2428.36 1.03
gemma3 4B Q4_0 pp2048@d8192 2232.90 2313.88 1.04
gemma3 4B Q4_0 tg64 124.30 125.27 1.01
gemma3 4B Q4_0 tg64@d512 122.49 123.44 1.01
gemma3 4B Q4_0 tg64@d1024 116.82 117.62 1.01
gemma3 4B Q4_0 tg64@d2048 115.66 115.97 1.00
gemma3 4B Q4_0 tg64@d4096 114.98 115.71 1.01
gemma3 4B Q4_0 tg64@d8192 112.14 112.17 1.00
gpt-oss 20B MXFP4 MoE pp2048 2299.63 2312.23 1.01
gpt-oss 20B MXFP4 MoE pp2048@d512 2275.59 2312.47 1.02
gpt-oss 20B MXFP4 MoE pp2048@d1024 2237.89 2275.07 1.02
gpt-oss 20B MXFP4 MoE pp2048@d2048 2146.30 2192.57 1.02
gpt-oss 20B MXFP4 MoE pp2048@d4096 1998.90 2045.94 1.02
gpt-oss 20B MXFP4 MoE pp2048@d8192 1743.98 1799.40 1.03
gpt-oss 20B MXFP4 MoE tg64 120.83 121.12 1.00
gpt-oss 20B MXFP4 MoE tg64@d512 119.46 120.14 1.01
gpt-oss 20B MXFP4 MoE tg64@d1024 118.60 118.68 1.00
gpt-oss 20B MXFP4 MoE tg64@d2048 115.75 116.26 1.00
gpt-oss 20B MXFP4 MoE tg64@d4096 112.65 113.15 1.00
gpt-oss 20B MXFP4 MoE tg64@d8192 107.20 108.03 1.01

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be a bit more clear if there are different functions for the private and shared buffers, but other than that it looks ok.

Some possible improvements:

  • When using a discrete GPU, I believe the "shared" buffers are essentially the same than what we call host buffers in CUDA or Vulkan
  • Thus, when using a discrete GPU the shared buffer type could be used as the host buffer type
  • To do so, you would need to implement get_host_buffer_type and update the caps in ggml_backend_metal_device_get_props to set host_buffer
  • Additionally, for discrete GPUs mapped buffers don't really make sense, and could be disabled by returning NULL from ggml_backend_metal_device_buffer_mapped and setting buffer_from_host_ptr to false in ggml_backend_metal_device_get_props. This would prevent using a system memory buffer when mmap is enabled.

Comment on lines 6029 to 6036
struct ggml_backend_metal_buffer_context * ctx = (struct ggml_backend_metal_buffer_context *)buffer->context;

if (ctx->is_shared) {
memcpy(dst->data, src->data, ggml_nbytes(src));
} else {
// TODO: implement
return false;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary to implement this, ggml_backend_tensor_copy already calls ggml_backend_tensor_set if the source tensor is in a host buffer.

@slaren
Copy link
Member

slaren commented Sep 10, 2025

  • when using a discrete GPU the shared buffer type could be used as the host buffer type

For this to work as expected, it should return true from the is_host function, and supports_buft function would need to return false. Without having a system to test that everything works as expected, this is probably too risky, though.

@ggerganov
Copy link
Member Author

I'll see if my old Intel MacBook still works and try to implement the discrete GPU recommendations in a follow up PR.

@ggerganov ggerganov merged commit 0f0a3c2 into master Sep 10, 2025
54 of 55 checks passed
@ggerganov ggerganov deleted the gg/metal-async-v2 branch September 10, 2025 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants