Skip to content

llama: fix llama-model-saver#20503

Merged
ggerganov merged 14 commits intoggml-org:masterfrom
JohannesGaessler:llama-fix-model-saver
Mar 25, 2026
Merged

llama: fix llama-model-saver#20503
ggerganov merged 14 commits intoggml-org:masterfrom
JohannesGaessler:llama-fix-model-saver

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

This PR fixes llama-model-saver and makes the --output argument of test-llama-archs functional (the models themselves are still broken though because they lack tokenizers).

The first issue fixed in this PR is that llama-model-saver is simply unmaintained: a lot of new KV values were added since I implemented it and those were not being saved correctly. I simply went through the KV values again, added the missing ones and checked where the corresponding information can be extracted from.

The second issue fixed in this PR is that on master several archs have broken tensor names: typically what happens is that in llama_model::load_tensors tensors are being created without a corresponding entry in llm_get_tensor_names. As a consequence LLM_TN_IMPL::str then doesn't use the provided arguments to format the tensor name with e.g. the layer index. So you end up with multiple, different tensors that have names like blk.%d.attn_q. Since a GGUF context is populated by tensor name this leads to conflicts and the model cannot be saved correctly. To me it is now clear why we have llm_get_tensor_names in the first place. I think it would make more sense to just check in LLM_TN_IMPL::str() whether suffix, bid, and/or xid are set and to use them in those cases. Also add a warning in cases where the tensor name template and the provided arguments don't match. I would implement this refactor in this PR.

@github-actions github-actions bot added the testing Everything test related label Mar 13, 2026
@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 13, 2026

It would be useful to have a simple little CI that checks that KV values in llama-arch.h are handled in llama-model-saver whenever updated. Perhaps also check gguf-py to ensure everything is in sync.

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

I agree. I'm thinking it would make sense to implement a roundtrip like manual GGUF context -> llama_model -> tmpfile -> llama_model in test-llama-archs. #20402 could be related, I haven't reviewed it yet.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 23, 2026
@JohannesGaessler JohannesGaessler marked this pull request as ready for review March 23, 2026 22:05
@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

I took over the file pointer code from #20402 - I am taking responsibility for an eventual refactor of llama_model_loader w.r.t. loading a model from a real file vs. from a file pointer vs. from a user function. test-llama-archs now uses the new code to write the test models to a tmpfile, read it back, and assert that the resulting logits are bit-for-bit identical.

Some models are still broken with llama_model_saver, I intend to fix these eventually.

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

@ggerganov are there sources of nondeterminism on Macs in particular? Otherwise I wouldn't understand why the CI for them in particular fails when the results are bit-for-bit identical everywhere else.

@ggerganov
Copy link
Copy Markdown
Member

I noticed some instability as well looking at the CIs - I think the webgpu is some issue with the backend.

But for the Metal backend I'll take a look now. Though I think I saw test-llama-archs failing with Vulkan/CUDA occasionally too.

@ggerganov
Copy link
Copy Markdown
Member

But to answer the question - there are no known sources of non-determinism on Mac.

@CISC
Copy link
Copy Markdown
Member

CISC commented Mar 24, 2026

But for the Metal backend I'll take a look now. Though I think I saw test-llama-archs failing with Vulkan/CUDA occasionally too.

I can confirm, it's usually gpt-oss, but sometimes also qwen2moe:
https://github.com/ggml-org/llama.cpp/actions/runs/23405172384/job/68082797215#step:3:1511

@ggerganov
Copy link
Copy Markdown
Member

@JohannesGaessler Some of the failing roundtrips are caused by: #20943

There are a couple of remaining failures:

  • The CHAMELEON architecture is partially implemented - I propose to simply disable it from test-llama-archs and we can completely remove it in a follow up PR
  • The GET_ROWS operator from the token embeddings is performed on different devices. First, when using the in-memory model, this op runs on the GPU. Next, when we read the model from the temp file, this op runs on the CPU. This sometimes causes very small numerical differences. I have a repro and will try to propose a fix.

@ggerganov

This comment was marked as outdated.

@ggerganov
Copy link
Copy Markdown
Member

Sorry, this is the correct patch:

diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index 490e8f336..80baa9009 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -7607,14 +7607,15 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
                 buf_map.emplace(idx, buf);
             }
         }
-        pimpl->ctxs_bufs.emplace_back(std::move(ctx_ptr), std::move(bufs));
 
-        for (auto & buf : buf_map) {
+        for (auto & buf : bufs) {
             // indicate that this buffer contains weights
             // this is used by ggml_backend_sched to improve op scheduling: ops that use a weight are preferably scheduled to the backend that contains the weight
-            ggml_backend_buffer_set_usage(buf.second, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
+            ggml_backend_buffer_set_usage(buf.get(), GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
         }
 
+        pimpl->ctxs_bufs.emplace_back(std::move(ctx_ptr), std::move(bufs));
+
         ctx_buf_maps.emplace_back(ctx, buf_map);
     }
 

@ggerganov
Copy link
Copy Markdown
Member

I'm not able to reproduce this failure:

https://github.com/ggml-org/llama.cpp/actions/runs/23511432758/job/68433115543?pr=20503#step:3:1441

It would be useful to print the seed that test-llama-archs uses, so I can try the same seed and see if this reproduces the problem.

Copy link
Copy Markdown
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JohannesGaessler Good to merge?

@JohannesGaessler JohannesGaessler removed the request for review from CISC March 25, 2026 10:42
@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

From my end yes but it says:

Merging is blocked
At least 2 approving reviews are required by reviewers with write access.

@ggerganov ggerganov merged commit 36dafba into ggml-org:master Mar 25, 2026
48 checks passed
@fairydreaming
Copy link
Copy Markdown
Collaborator

Is it an expected behavior that llama-quantize now spams the output with messages like:

...
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
...

Just noticed this when doing quantization of DeepSeek V3.2-Exp.

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

I didn't test that binary, it is definitely not intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants