Skip to content

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented Oct 14, 2025

While working on code for automating memory allocation across devices I've noticed that the order of the vectors for ggml_context and ggml_backend_buffer_t in llama_model::impl can be inconsistent. To my understanding the reason for this is that ctxs is filled in the order in which the ggml_context structs are created for the model tensors. However, the corresponding backend buffers are then created by iterating over ctx_map. A std::map is sorted by its keys, so ctx_map is sorted by the arrangement of ggml_backend_buffer_type_t in virtual memory. As a consequence the order for bufs can be inconsistent with the order of ctxs.

The way I fixed it in this PR is to define a comparator for the map that sorts it based on the buffer type name and to defer the population of ctxs until later when bufs is populated. To my understanding there should be no functional difference for the master branch (order of some prints and tensors_by_name would change) but I'm spinning this out into a standalone PR regardless since it seems like an easy fix vs. the amount of time required for debugging.

@CISC CISC requested a review from slaren October 14, 2025 20:34
buf_map.emplace(idx, buf);
}
}
pimpl->ctxs.emplace_back(ctx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think moving this here is good, it is too far departed from the place it is allocated, and I suspect it may cause a leak if there is an error in between.

I don't quite understand why the order is important to you, can you elaborate on that?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my prototype I used llama_model like this:

std::map<ggml_backend_buffer_type_t, size_t> llama_model::memory_breakdown() const {
    std::map<ggml_backend_buffer_type_t, size_t> ret;

    for (size_t i = 0; i < pimpl->bufs.size(); i++) {
        ggml_backend_buffer_t      buf  = pimpl->bufs[i].get();
        ggml_backend_buffer_type_t buft = ggml_backend_buffer_get_type(buf);

        if (hparams.no_alloc) {
            GGML_ASSERT(ggml_backend_buffer_get_base(buf) == nullptr);
            ret[buft] += ggml_backend_alloc_ctx_tensors_from_buft_size(pimpl->ctxs[i].get(), buft);
        } else {
            GGML_ASSERT(ggml_backend_buffer_get_base(buf) != nullptr);
            ret[buft] += ggml_backend_buffer_get_size(buf);
        }
    }

    return ret;
}

The code implicitly assumed that pimpl->bufs[i] would correspond to pimpl->ctx[i]. This is not universally true however. So when I moved from my single GPU desktop to another machine with multiple GPUs I suddenly got unexpected results. In my opinion this is a footgun. I think it would be preferable to have a consistent order, I would also consider it to be fine to add a comment warning that the order is not guaranteed to be the same. From my end I can write code that will produce correct results regardless of how the vectors are ordered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really agree that this is a footgun - nowhere it is stated that the vectors of contexts and buffers are related to each other, and nothing warrants assuming that they are. I am not completely sure that this is a good approach, but if you want to make the association between context and buffer explicit, then the two vectors could be joined as a single vector, such as std::vector<std::pair<ggml_context_ptr, ggml_backend_buffer_ptr>>.

@JohannesGaessler JohannesGaessler force-pushed the llama-model-fix-ctx-order branch from d8e131f to b53509c Compare October 15, 2025 13:33
@JohannesGaessler
Copy link
Collaborator Author

Thank you for pointing out the potential memory leak, I changed ctx_map to use a unique pointer for ggml_context which is then later moved into the vector. There was an error being raised when ctx->bufs would be empty but to my understanding this is redundant because the code is already raising an error when the returned ggml_backend_buffer_t is empty.

Comment on lines 6049 to +6052
for (auto & it : ctx_map) {
ggml_backend_buffer_type_t buft = it.first;
ggml_context * ctx = it.second;
ggml_backend_buffer_type_t buft = it.first;
ggml_context_ptr & ctx_ptr = it.second;
ggml_context * ctx = ctx_ptr.get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be a matter of preference, but you could use structured bindings such as:

    for (auto & [buft, ctx_ptr] : ctx_map) {
        ggml_context * ctx = ctx_ptr.get();

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I just wasn't aware that that is valid syntax.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants