llama: fix llama-model-saver by JohannesGaessler · Pull Request #20503 · ggml-org/llama.cpp

JohannesGaessler · 2026-03-13T11:55:25Z

This PR fixes llama-model-saver and makes the --output argument of test-llama-archs functional (the models themselves are still broken though because they lack tokenizers).

The first issue fixed in this PR is that llama-model-saver is simply unmaintained: a lot of new KV values were added since I implemented it and those were not being saved correctly. I simply went through the KV values again, added the missing ones and checked where the corresponding information can be extracted from.

The second issue fixed in this PR is that on master several archs have broken tensor names: typically what happens is that in llama_model::load_tensors tensors are being created without a corresponding entry in llm_get_tensor_names. As a consequence LLM_TN_IMPL::str then doesn't use the provided arguments to format the tensor name with e.g. the layer index. So you end up with multiple, different tensors that have names like blk.%d.attn_q. Since a GGUF context is populated by tensor name this leads to conflicts and the model cannot be saved correctly. To me it is now clear why we have llm_get_tensor_names in the first place. I think it would make more sense to just check in LLM_TN_IMPL::str() whether suffix, bid, and/or xid are set and to use them in those cases. Also add a warning in cases where the tensor name template and the provided arguments don't match. I would implement this refactor in this PR.

CISC · 2026-03-13T12:28:09Z

It would be useful to have a simple little CI that checks that KV values in llama-arch.h are handled in llama-model-saver whenever updated. Perhaps also check gguf-py to ensure everything is in sync.

JohannesGaessler · 2026-03-13T12:32:03Z

I agree. I'm thinking it would make sense to implement a roundtrip like manual GGUF context -> llama_model -> tmpfile -> llama_model in test-llama-archs. #20402 could be related, I haven't reviewed it yet.

JohannesGaessler · 2026-03-23T22:10:47Z

I took over the file pointer code from #20402 - I am taking responsibility for an eventual refactor of llama_model_loader w.r.t. loading a model from a real file vs. from a file pointer vs. from a user function. test-llama-archs now uses the new code to write the test models to a tmpfile, read it back, and assert that the resulting logits are bit-for-bit identical.

Some models are still broken with llama_model_saver, I intend to fix these eventually.

JohannesGaessler · 2026-03-24T07:37:06Z

@ggerganov are there sources of nondeterminism on Macs in particular? Otherwise I wouldn't understand why the CI for them in particular fails when the results are bit-for-bit identical everywhere else.

ggerganov · 2026-03-24T08:01:54Z

I noticed some instability as well looking at the CIs - I think the webgpu is some issue with the backend.

But for the Metal backend I'll take a look now. Though I think I saw test-llama-archs failing with Vulkan/CUDA occasionally too.

ggerganov · 2026-03-24T08:10:55Z

But to answer the question - there are no known sources of non-determinism on Mac.

CISC · 2026-03-24T08:35:56Z

But for the Metal backend I'll take a look now. Though I think I saw test-llama-archs failing with Vulkan/CUDA occasionally too.

I can confirm, it's usually gpt-oss, but sometimes also qwen2moe:
https://github.com/ggml-org/llama.cpp/actions/runs/23405172384/job/68082797215#step:3:1511

ggerganov · 2026-03-24T09:24:56Z

@JohannesGaessler Some of the failing roundtrips are caused by: #20943

There are a couple of remaining failures:

The CHAMELEON architecture is partially implemented - I propose to simply disable it from test-llama-archs and we can completely remove it in a follow up PR
The GET_ROWS operator from the token embeddings is performed on different devices. First, when using the in-memory model, this op runs on the GPU. Next, when we read the model from the temp file, this op runs on the CPU. This sometimes causes very small numerical differences. I have a repro and will try to propose a fix.

ggerganov · 2026-03-24T14:00:54Z

Sorry, this is the correct patch:

diff --git a/src/llama-model.cpp b/src/llama-model.cpp
index 490e8f336..80baa9009 100644
--- a/src/llama-model.cpp
+++ b/src/llama-model.cpp
@@ -7607,14 +7607,15 @@ bool llama_model::load_tensors(llama_model_loader & ml) {
                 buf_map.emplace(idx, buf);
             }
         }
-        pimpl->ctxs_bufs.emplace_back(std::move(ctx_ptr), std::move(bufs));
 
-        for (auto & buf : buf_map) {
+        for (auto & buf : bufs) {
             // indicate that this buffer contains weights
             // this is used by ggml_backend_sched to improve op scheduling: ops that use a weight are preferably scheduled to the backend that contains the weight
-            ggml_backend_buffer_set_usage(buf.second, GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
+            ggml_backend_buffer_set_usage(buf.get(), GGML_BACKEND_BUFFER_USAGE_WEIGHTS);
         }
 
+        pimpl->ctxs_bufs.emplace_back(std::move(ctx_ptr), std::move(bufs));
+
         ctx_buf_maps.emplace_back(ctx, buf_map);
     }

ggerganov · 2026-03-24T21:14:37Z

I'm not able to reproduce this failure:

https://github.com/ggml-org/llama.cpp/actions/runs/23511432758/job/68433115543?pr=20503#step:3:1441

It would be useful to print the seed that test-llama-archs uses, so I can try the same seed and see if this reproduces the problem.

ggerganov

@JohannesGaessler Good to merge?

JohannesGaessler · 2026-03-25T10:43:15Z

From my end yes but it says:

Merging is blocked
At least 2 approving reviews are required by reviewers with write access.

fairydreaming · 2026-03-27T20:58:14Z

Is it an expected behavior that llama-quantize now spams the output with messages like:

...
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1                                                      
str: cannot properly format tensor name token_types with suffix=weight bid=-1 xid=-1                                                        
str: cannot properly format tensor name position_embd with suffix=weight bid=-1 xid=-1
...

Just noticed this when doing quantization of DeepSeek V3.2-Exp.

JohannesGaessler · 2026-03-27T21:55:32Z

I didn't test that binary, it is definitely not intended.

JohannesGaessler requested review from CISC and ggerganov March 13, 2026 11:55

github-actions bot added the testing Everything test related label Mar 13, 2026

JohannesGaessler mentioned this pull request Mar 15, 2026

llama : add fd-based model loading via llama_model_load_from_fd ( REWORK ) #20402

Open

JohannesGaessler force-pushed the llama-fix-model-saver branch from 64c9b8a to 60312f6 Compare March 23, 2026 22:04

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Mar 23, 2026

JohannesGaessler marked this pull request as ready for review March 23, 2026 22:05

ggerganov mentioned this pull request Mar 24, 2026

models : move the token embedding norms to the first layer #20943

Merged

This comment was marked as outdated.

Sign in to view

Siddhesh2377 and others added 12 commits March 24, 2026 21:45

llama : add fd-based model loading via llama_model_load_from_fd

91827f2

llama : address review feedback for fd-based model loading

4f3bfcb

llama : use FILE pointer instead of fd in public API

f62f9fe

llama : use FILE pointer consistently, address review feedback

cdcebdd

fixup

b0eb76c

fix tensor names

ebdcbc1

fix llama-model-saver

329460a

roundtrip tests

85c0b96

fixup

6af33cd

refactor tests

bec1949

fix prints

8c0c8ca

fix model saving

ebcf754

fix CI, disable Chameleon

55577aa

JohannesGaessler force-pushed the llama-fix-model-saver branch from 60312f6 to 55577aa Compare March 24, 2026 20:46

print seed

bee2011

ggerganov approved these changes Mar 25, 2026

View reviewed changes

JohannesGaessler removed the request for review from CISC March 25, 2026 10:42

ggerganov merged commit 36dafba into ggml-org:master Mar 25, 2026
48 checks passed

EAddario mentioned this pull request Mar 28, 2026

Eval bug: regression introduced in #20503 #21115

Open

CISC mentioned this pull request Mar 29, 2026

arch : add missing ROPE_FACTORS_LONG/SHORT for MiniCPM #21150

Merged

JohannesGaessler mentioned this pull request Apr 6, 2026

llama: remove per-arch tensor name lists #21531

Merged

Conversation

JohannesGaessler commented Mar 13, 2026

Uh oh!

CISC commented Mar 13, 2026

Uh oh!

JohannesGaessler commented Mar 13, 2026

Uh oh!

JohannesGaessler commented Mar 23, 2026

Uh oh!

JohannesGaessler commented Mar 24, 2026

Uh oh!

ggerganov commented Mar 24, 2026

Uh oh!

ggerganov commented Mar 24, 2026

Uh oh!

CISC commented Mar 24, 2026

Uh oh!

ggerganov commented Mar 24, 2026

Uh oh!

This comment was marked as outdated.

ggerganov commented Mar 24, 2026

Uh oh!

ggerganov commented Mar 24, 2026

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Mar 25, 2026

Uh oh!

Uh oh!

fairydreaming commented Mar 27, 2026

Uh oh!

JohannesGaessler commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants