llama: fix llama-model-saver#20503
Conversation
|
It would be useful to have a simple little CI that checks that KV values in |
|
I agree. I'm thinking it would make sense to implement a roundtrip like manual GGUF context -> |
64c9b8a to
60312f6
Compare
|
I took over the file pointer code from #20402 - I am taking responsibility for an eventual refactor of Some models are still broken with |
|
@ggerganov are there sources of nondeterminism on Macs in particular? Otherwise I wouldn't understand why the CI for them in particular fails when the results are bit-for-bit identical everywhere else. |
|
I noticed some instability as well looking at the CIs - I think the webgpu is some issue with the backend. But for the Metal backend I'll take a look now. Though I think I saw |
|
But to answer the question - there are no known sources of non-determinism on Mac. |
I can confirm, it's usually |
|
@JohannesGaessler Some of the failing roundtrips are caused by: #20943 There are a couple of remaining failures:
|
This comment was marked as outdated.
This comment was marked as outdated.
|
Sorry, this is the correct patch: diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index 490e8f336..80baa9009 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -7607,14 +7607,15 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
buf_map.emplace(idx, buf);
}
}
- pimpl->ctxs_bufs.emplace_back(std::move(ctx_ptr), std::move(bufs));
- for (auto & buf : buf_map) {
+ for (auto & buf : bufs) {
// indicate that this buffer contains weights
// this is used by ggml_backend_sched to improve op scheduling: ops that use a weight are preferably scheduled to the backend that contains the weight
- ggml_backend_buffer_set_usage(buf.second, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
+ ggml_backend_buffer_set_usage(buf.get(), GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
}
+ pimpl->ctxs_bufs.emplace_back(std::move(ctx_ptr), std::move(bufs));
+
ctx_buf_maps.emplace_back(ctx, buf_map);
}
|
60312f6 to
55577aa
Compare
|
I'm not able to reproduce this failure: https://github.com/ggml-org/llama.cpp/actions/runs/23511432758/job/68433115543?pr=20503#step:3:1441 It would be useful to print the seed that |
ggerganov
left a comment
There was a problem hiding this comment.
@JohannesGaessler Good to merge?
|
From my end yes but it says:
|
|
Is it an expected behavior that llama-quantize now spams the output with messages like: Just noticed this when doing quantization of DeepSeek V3.2-Exp. |
|
I didn't test that binary, it is definitely not intended. |
This PR fixes
llama-model-saverand makes the--outputargument oftest-llama-archsfunctional (the models themselves are still broken though because they lack tokenizers).The first issue fixed in this PR is that
llama-model-saveris simply unmaintained: a lot of new KV values were added since I implemented it and those were not being saved correctly. I simply went through the KV values again, added the missing ones and checked where the corresponding information can be extracted from.The second issue fixed in this PR is that on master several archs have broken tensor names: typically what happens is that in
llama_model::load_tensorstensors are being created without a corresponding entry inllm_get_tensor_names. As a consequenceLLM_TN_IMPL::strthen doesn't use the provided arguments to format the tensor name with e.g. the layer index. So you end up with multiple, different tensors that have names likeblk.%d.attn_q. Since a GGUF context is populated by tensor name this leads to conflicts and the model cannot be saved correctly. To me it is now clear why we havellm_get_tensor_namesin the first place. I think it would make more sense to just check inLLM_TN_IMPL::str()whethersuffix,bid, and/orxidare set and to use them in those cases. Also add a warning in cases where the tensor name template and the provided arguments don't match. I would implement this refactor in this PR.