-
Notifications
You must be signed in to change notification settings - Fork 13.4k
metal : make the backend async v2 #15906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Creating a different |
So this will work because these buffers are only used to store weights, so there aren't any data races to worry about. But the fundamental issue is that the scheduler does not keep track of the dependencies between splits that are stored in compatible buffer types that do not require a copy. The scheduler needs to know if a split depends on outputs from a previous split, and if so, synchronize it. OTOH, weights allocated in a Metal buffer should never be accessed from the CPU, so there is no benefit to tagging them as a host buffer either. Only tensors allocated in a compute buffer may benefit from being accessible by the CPU backend. |
Regarding the Regarding the |
Not in llama.cpp as far as I can tell, since these buffers are only used for weights. But if someone used it to allocate a mutable tensor, say, a KV cache, and then this tensor is used in the next split by the CPU backend, the data race would still be there. |
I don't think there is any benefit, other than just keeping the code simpler with a single path. I am not sure where is the performance hit, I suppose a |
Hm, I just tried to remove the extraneous |
So without the last commit
There seems to be some degradation in TG with increasing context. Likely because we now effectively perform 2 copies of the data in With
But the code is no longer thread-safe ( Wondering if we should accept the performance hit. Will continue tomorrow. |
That's what the non-async functions are expected to do, but if the overhead of creating a command buffer is so high, I don't think this is an acceptable solution. If you want to support discrete GPUs, I think it would be better to create two different buffer types, and select the one to use depending on the device type. |
ggml/src/ggml-metal/ggml-metal.m
Outdated
// note: for experimentation purposes, here we use a semaphore to wait for the copy to complete | ||
// this is alternative to waitUntilCompleted, which should be faster, but don't seem to make much difference | ||
dispatch_semaphore_t completion_semaphore = dispatch_semaphore_create(0); | ||
|
||
id<MTLCommandQueue> queue = ctx->queue; | ||
id<MTLCommandBuffer> cmd_buf = [queue commandBufferWithUnretainedReferences]; | ||
|
||
{ | ||
id<MTLBlitCommandEncoder> encoder = [cmd_buf blitCommandEncoder]; | ||
|
||
[encoder copyFromBuffer:buf_src | ||
sourceOffset:0 | ||
toBuffer:buf_dst | ||
destinationOffset:buf_dst_offset | ||
size:size]; | ||
|
||
[encoder endEncoding]; | ||
} | ||
|
||
[cmd_buf addCompletedHandler:^(id<MTLCommandBuffer> cb) { | ||
// TODO: can check for errors here | ||
GGML_UNUSED(cb); | ||
|
||
dispatch_semaphore_signal(completion_semaphore); | ||
}]; | ||
|
||
[cmd_buf commit]; | ||
|
||
dispatch_semaphore_wait(completion_semaphore, DISPATCH_TIME_FOREVER); | ||
//[cmd_buf waitUntilCompleted]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found some docs online about using a semaphore + command buffer completion handler for better performance than waitUntilCompleted
. Unfortunately, in this case it does not make a significant difference.
In the latest commits, I've made the following changes:
Perf is good. @slaren PTAL
|
ggml-ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a bit more clear if there are different functions for the private and shared buffers, but other than that it looks ok.
Some possible improvements:
- When using a discrete GPU, I believe the "shared" buffers are essentially the same than what we call host buffers in CUDA or Vulkan
- Thus, when using a discrete GPU the shared buffer type could be used as the host buffer type
- To do so, you would need to implement
get_host_buffer_type
and update the caps inggml_backend_metal_device_get_props
to sethost_buffer
- Additionally, for discrete GPUs mapped buffers don't really make sense, and could be disabled by returning
NULL
fromggml_backend_metal_device_buffer_mapped
and settingbuffer_from_host_ptr
to false inggml_backend_metal_device_get_props
. This would prevent using a system memory buffer when mmap is enabled.
ggml/src/ggml-metal/ggml-metal.m
Outdated
struct ggml_backend_metal_buffer_context * ctx = (struct ggml_backend_metal_buffer_context *)buffer->context; | ||
|
||
if (ctx->is_shared) { | ||
memcpy(dst->data, src->data, ggml_nbytes(src)); | ||
} else { | ||
// TODO: implement | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not necessary to implement this, ggml_backend_tensor_copy
already calls ggml_backend_tensor_set
if the source tensor is in a host buffer.
For this to work as expected, it should return true from the |
I'll see if my old Intel MacBook still works and try to implement the discrete GPU recommendations in a follow up PR. |
cont #15832
Opening an alternative implementation to avoid disturbing the ongoing review in #15832. This supports devices with discrete GPUs, while for Apple Silicon with unified memory we continue to allocate the buffers in host memory.
The main reason for the latter is that the CPU -> GPU copies have significant overhead for text generation (measured up to 30% slow down for some smaller models). The main overhead is from the creation of
MTLCommandQueue
and[MTLCommandBuffer waitUntilCompleted]
. Most of the overhead can be eliminated by using a globalMTLCommandQueue
per device, but there is still 1-2% slowdown and the implementation starts to deviate a bit from the design of the backend context. I've kept this implementation in a branch for reference: https://github.com/ggml-org/llama.cpp/tree/gg/metal-async-save-global-queueEdit: Switched to a global
MTLCommandQueue
and added support for both Shared and Private buffer types. The latter is selected based on device support. See #15906 (comment).Outdated
Keeping the memory buffers as host allocated (as we do on
master
) even results in a small tg performance bump when the backend is async. In the other PR, I think I was thrown off from this approach due to incorrectly advertising both thebuffer_type
andbuffer_from_ptr_type
asis_host == true
. While I think the correct approach is to advertise onlybuffer_from_ptr_type
asis_host == true
.This is the relevant change:
With this implementation, the op offloading now works correctly, even when the allocated buffers are in host memory.
Here is sample perf comparison:
Using private GPU memory can be forced on Apple Silicon with
GGML_METAL_PRIVATE_BUFFERS=1
. This runs the less-optimal implementation that creates a newMTLCommandQueue
for each operation. The performance drops like this:This will be automatically enabled for discrete GPUs.