-
Notifications
You must be signed in to change notification settings - Fork 13.4k
llama-model: fix insonsistent ctxs <-> bufs order #16581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
llama-model: fix insonsistent ctxs <-> bufs order #16581
Conversation
src/llama-model.cpp
Outdated
buf_map.emplace(idx, buf); | ||
} | ||
} | ||
pimpl->ctxs.emplace_back(ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think moving this here is good, it is too far departed from the place it is allocated, and I suspect it may cause a leak if there is an error in between.
I don't quite understand why the order is important to you, can you elaborate on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my prototype I used llama_model
like this:
std::map<ggml_backend_buffer_type_t, size_t> llama_model::memory_breakdown() const {
std::map<ggml_backend_buffer_type_t, size_t> ret;
for (size_t i = 0; i < pimpl->bufs.size(); i++) {
ggml_backend_buffer_t buf = pimpl->bufs[i].get();
ggml_backend_buffer_type_t buft = ggml_backend_buffer_get_type(buf);
if (hparams.no_alloc) {
GGML_ASSERT(ggml_backend_buffer_get_base(buf) == nullptr);
ret[buft] += ggml_backend_alloc_ctx_tensors_from_buft_size(pimpl->ctxs[i].get(), buft);
} else {
GGML_ASSERT(ggml_backend_buffer_get_base(buf) != nullptr);
ret[buft] += ggml_backend_buffer_get_size(buf);
}
}
return ret;
}
The code implicitly assumed that pimpl->bufs[i]
would correspond to pimpl->ctx[i]
. This is not universally true however. So when I moved from my single GPU desktop to another machine with multiple GPUs I suddenly got unexpected results. In my opinion this is a footgun. I think it would be preferable to have a consistent order, I would also consider it to be fine to add a comment warning that the order is not guaranteed to be the same. From my end I can write code that will produce correct results regardless of how the vectors are ordered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really agree that this is a footgun - nowhere it is stated that the vectors of contexts and buffers are related to each other, and nothing warrants assuming that they are. I am not completely sure that this is a good approach, but if you want to make the association between context and buffer explicit, then the two vectors could be joined as a single vector, such as std::vector<std::pair<ggml_context_ptr, ggml_backend_buffer_ptr>>
.
d8e131f
to
b53509c
Compare
Thank you for pointing out the potential memory leak, I changed |
for (auto & it : ctx_map) { | ||
ggml_backend_buffer_type_t buft = it.first; | ||
ggml_context * ctx = it.second; | ||
ggml_backend_buffer_type_t buft = it.first; | ||
ggml_context_ptr & ctx_ptr = it.second; | ||
ggml_context * ctx = ctx_ptr.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be a matter of preference, but you could use structured bindings such as:
for (auto & [buft, ctx_ptr] : ctx_map) {
ggml_context * ctx = ctx_ptr.get();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I just wasn't aware that that is valid syntax.
While working on code for automating memory allocation across devices I've noticed that the order of the vectors for
ggml_context
andggml_backend_buffer_t
inllama_model::impl
can be inconsistent. To my understanding the reason for this is thatctxs
is filled in the order in which theggml_context
structs are created for the model tensors. However, the corresponding backend buffers are then created by iterating overctx_map
. Astd::map
is sorted by its keys, soctx_map
is sorted by the arrangement ofggml_backend_buffer_type_t
in virtual memory. As a consequence the order forbufs
can be inconsistent with the order ofctxs
.The way I fixed it in this PR is to define a comparator for the map that sorts it based on the buffer type name and to defer the population of
ctxs
until later whenbufs
is populated. To my understanding there should be no functional difference for the master branch (order of some prints andtensors_by_name
would change) but I'm spinning this out into a standalone PR regardless since it seems like an easy fix vs. the amount of time required for debugging.