llama: store mrope data in KV cell #16825

ngxson · 2025-10-28T18:27:44Z

Supersede #16822

Fix #13694 (hopefully this time for real)

The idea is to store the M-RoPE (x,y,t) positions inside KV cells. This will allow the causal mask to be constructed correctly based on (x,y,t) positions.

The benefit is that this introduce no breaking changes, compared to other proposals.

This should now give the same output as #16822 across multiple values of -b

./build/bin/llama-mtmd-cli \
    -m "../models/Qwen2.5-VL-7B-Instruct-Q4_K_M.gguf" \
    --mmproj "../models/mmproj-Qwen2.5-VL-7B-Instruct-Q8_0.gguf" \
    --image "../models/0_bbox.png" \
    -p "Please first output bbox coordinates and colors of every rectangle in this image in JSON format, and then answer how many rectangles are there in the image." \
    --temp 0 -n 128

[
        {"bbox_2d": [168, 679, 462, 837], "color": "red"},
        {"bbox_2d": [312, 575, 480, 765], "color": "green"},
        {"bbox_2d": [601, 708, 672, 775], "color": "black"}
]

Image draw using this script: https://gist.github.com/ngxson/039024fb2bdaf2e3c15db702f9fddaff

TODO:

fix KV save/load with mrope --> follow-up PR
~~also fix the mtmd_image_tokens_get_n_pos~~

ngxson · 2025-10-28T18:28:54Z

@ggerganov Could you have a quick look to see if this is indeed better than the other proposals? Thanks!

ggerganov · 2025-10-28T20:47:20Z

Looking good on first look. Will take a detailed look tomorrow.

FMayran · 2025-10-28T22:21:44Z

I think @rujialiu and myself thought of this possibility, and it really is the best solutions, as it stores everything state-related in the kV cache. With the price of increased kV cell size. Perhaps we only need to store the positional encoding of the last inserted token of a sequence as a property of the kV cache, not one position per cell?
Edit : unless we later want to remove part of the kV cache, in which case we need the per-cell positional encoding of course

ngxson · 2025-10-28T23:52:36Z

Perhaps we only need to store the positional encoding of the last inserted token of a sequence as a property of the kV cache, not one position per cell?

This could potentially work, but it will make the code to be more error-prone and more complicated to understand.

Storing (x,y) position per-cell is a more robust solution, with a bit of memory cost . Even with 128k tokens, this only use an additional of 2 (number of int) * 4 (bytes per int) * (128*1024) = 1048576 bytes = 1 MB

rujialiu · 2025-10-29T02:00:00Z

I think @rujialiu and myself thought of this possibility, and it really is the best solutions, as it stores everything state-related in the kV cache. With the price of increased kV cell size. Perhaps we only need to store the positional encoding of the last inserted token of a sequence as a property of the kV cache, not one position per cell? Edit : unless we later want to remove part of the kV cache, in which case we need the per-cell positional encoding of course

Yes, I've thought of this as well, but I'm not confident enough to ensure nothing is broken 😄

Just one concern: I see in llama-batch, has_mrope() only checks pos.size() vs token.size() (I know, there's nothing else we can do, without some ugly tricks), so essentially we can only support "one kind of mrope". I don't know whether this will cause trouble later. That's why I didn't go further with this approach because I want to minimize the model dependent multi-modal logic inside llama-batch.

But overall, this is the best solution so far.

rujialiu · 2025-10-29T02:02:14Z

Storing (x,y) position per-cell is a more robust solution, with a bit of memory cost . Even with 128k tokens, this only use an additional of 2 (number of int) * 4 (bytes per int) * (128*1024) = 1048576 bytes = 1 MB

I would definitely buy robustness and code simplicity with 1MB/128k tokens 😄

rujialiu · 2025-10-29T02:15:52Z

Though it's not implemented yet (but people requested it #16186 ), should we (conceptually) check this approach can support Qwen3-Omni's mrope (called TM-RoPe in its technical report) without much additional effort?

ggerganov · 2025-10-29T07:31:05Z

src/llama-batch.h

+    bool has_mrope() const {
+        return data->pos.size() == data->token.size()*4;
+    }
+


I think we can make this multi-dimensional positional information more decoupled from the concept of rope:

diff --git a/src/llama-batch.h b/src/llama-batch.h index 34f964ef0..8a6c6daff 100644 --- a/src/llama-batch.h +++ b/src/llama-batch.h @@ -17,8 +17,13 @@ struct llama_ubatch { return b_equal_seqs != 0; } - bool has_mrope() const { - return data->pos.size() == data->token.size()*4; + // typical for M-RoPE cases: + // 0 - sequantial position of the tokens/embeddings in the sequence + // 1 - x position in the image + // 2 - y position in the image + // 3 - other + bool is_pos_2d() const { + return n_pos >= 3; } uint32_t b_equal_seqs; // note: this is a boolean, but we use an int32_t for alignment @@ -29,6 +34,7 @@ struct llama_ubatch { uint32_t n_seq_tokens; // tokens per sequence set uint32_t n_seqs; // sequence sets in the ubatch uint32_t n_seqs_unq; // unique sequence ids in the ubatch + uint32_t n_pos; // position inputs for each token/embedding // seq_id_unq: unique sequence ids in the ubatch // seq_idx: indices of the unique sequence ids in the ubatch in [0, n_seqs_unq) @@ -37,7 +43,7 @@ struct llama_ubatch { // // size | idx | val llama_token * token; // [n_tokens] | i | id, token float * embd; // [n_embd, n_tokens] | i | embd - llama_pos * pos; // [n_tokens] | i | pos + llama_pos * pos; // [n_tokens*n_pos] | i | pos int32_t * n_seq_id; // [n_tokens] | i | - llama_seq_id ** seq_id; // [n_tokens] | s | s0, s1, seq_id llama_seq_id * seq_id_unq; // [n_seqs_unq] | s | seq_id

ggerganov · 2025-10-29T07:34:47Z

src/llama-kv-cells.h

+struct llama_kv_pos_mrope {
+    llama_pos y = 0;
+    llama_pos x = 0;
+    // return true if this position is greater than the other position
+    bool is_gt(const llama_kv_pos_mrope & other) const {
+        return (y > other.y) || (y == other.y && x > other.x);
+    }
+};
+


Similarly, I think we can decouple the concept of M-RoPE here by declaring this struct to be more generic:

struct llama_kv_cell_ext { // 2D spatial positions, typically used for M-RoPE llama_pos x = 0; llama_pos y = 0; // ... maybe more data in the future };

ggerganov · 2025-10-29T07:35:25Z

src/llama-kv-cells.h

+    // stores addition info for M-RoPE positions
+    std::vector<llama_kv_pos_mrope> pos_mrope;
+


Suggested change

// stores addition info for M-RoPE positions

std::vector<llama_kv_pos_mrope> pos_mrope;

// stores extra optional cell info

std::vector<llama_kv_cell_ext> ext;

ggerganov · 2025-10-29T07:46:38Z

src/llama-kv-cache.cpp

+                llama_kv_pos_mrope p1_mrope;
+                if (ubatch->has_mrope()) {
+                    p1_mrope.y = ubatch->pos[i + ubatch->n_tokens];
+                    p1_mrope.x = ubatch->pos[i + ubatch->n_tokens*2];
+                }
+


I find it confusing to have the order of the positions as y, x. It's more canonical to have the dimensions ordered by increasing significance - x, y, z, .... This is also inline with the ggml convention for indexing.

I now notice that even the implementation of ggml_rope_multi uses this order. I would recommend to update this across the codebase for consistency. Even though it's a breaking change, it's better to do it now, before the mtmd stuff gets more adopted.

yes I agree that we should fix the ordering in ggml, I will make a PR for that

hmm on second thought, I think it cannot be ordered as x,y,z. This is because the full 4D position will be p,x,y,z with p the traditional LLM position

Because Qwen doesn't use the last z dim, so the ordering is currently p,y,x which is decreasing significant.

I think the better way is as you suggest above, decouple the logic into 2d_mrope to be more specific

Hm, not sure I follow. My point is that p,x,y,t is more consistent order compared to the current p,y,x,t.

broadbit-hu · 2025-10-29T08:16:56Z

@ngxson Are we still sure we need this hparams.image_size 1024-size limitation? Scaling distorts the image, not just the relative positions.

https://github.com/ggml-org/llama.cpp/blob/master/tools/mtmd/clip.cpp#L2416

                case PROJECTOR_TYPE_QWEN2VL:
                    {
                        // max image size = sqrt(max_pixels) = 3584
                        // ref: https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct/blob/main/preprocessor_config.json
                        // however, the model use unreasonable memory past 1024 size, we force it to 1024 otherwise it's unusable
                        // ref: https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct/discussions/10
                        hparams.image_size = 1024;
                        hparams.warmup_image_size = hparams.patch_size * 8;
                    } break;
                case PROJECTOR_TYPE_QWEN25VL:
                    {
                        // max image size = sqrt(max_pixels)
                        // https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct/blob/main/preprocessor_config.json
                        // however, the model use unreasonable memory past 1024 size, we force it to 1024 otherwise it's unusable
                        // ref: https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct/discussions/10
                        hparams.image_size = 1024;
                        hparams.warmup_image_size = hparams.patch_size * 8;
                        get_u32(KEY_WIN_ATTN_PATTERN, hparams.n_wa_pattern);
                    } break;

See the differences between hparams.image_size (64x2x14=) 1792 (left) and 1024 (right):

ngxson · 2025-10-29T09:08:44Z

@broadbit-hu the limit was added because some users reported out of memory issue. We will implement custom image size limit in near future, which should fix the issue.

broadbit-hu · 2025-10-29T09:22:17Z

@broadbit-hu the limit was added because some users reported out of memory issue. We will implement custom image size limit in near future, which should fix the issue.

@ngxson Yes, I also read the Hugging Face comment, so I ran tests with @FMayran 's llama.cpp version using various image_size values. If we want to process an image of a letter-sized A4 page with small printed text, a 1024-pixel height results in very poor quality. Increasing 1024 to 1792 significantly improved the results, and I didn't encounter any memory issues. I'll also run tests with even larger values.

Another problem is that 1024 is not divisible by 28 (Qwen 2.5 VL: 28, Qwen 3 VL: 32), so if we really want a smaller value, 1008 or 1120 would be better.

[EDIT] I performed tests with 1792x1792-sized images too and successfully reproduced the mentioned memory issue on both AMD GPUs and CPUs. I reduced the image_size value to 1568, and now the tests are running.

ngxson · 2025-10-29T10:45:20Z

The save/load seems to be tricky as apply_ubatch() call inside state_read_meta only take the 1-dim position input. I think it's safer to implement save/load of llama_kv_cell_ext in a follow-up PR

broadbit-hu · 2025-10-29T10:49:22Z

@ngxson I also reviewed your rectangles-test with my local FMayran's PR - compared to the original 1024x1024 resolution (right), manually scaling to 1008x1008 (left) significantly better results, just like 980x980 does. However, for this image, the 1120x1120 resolution does not improve anything but rather degrades positional accuracy.

Prompt:

Please first output bbox coordinates and colors of every rectangle in this image in JSON bbox2d format.

theo77186 · 2025-10-29T12:11:10Z

The whole 1024px limitation will be moot as soon as flash attention will be implemented for multimodal models (related: #16837), this will avoid excessive VRAM usage caused by large image sizes.

ngxson · 2025-10-29T13:06:13Z

For discussion regarding image sizes, please open a dedicated issue to discuss. It is not related to the current PR.

ggerganov

The save/load seems to be tricky as apply_ubatch() call inside state_read_meta only take the 1-dim position input. I think it's safer to implement save/load of llama_kv_cell_ext in a follow-up PR

Add a TODO with a reference to this PR to not forget about this.

src/llama-kv-cache.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

ggerganov · 2025-10-29T13:30:19Z

src/llama-kv-cache.cpp

+        // TODO: we cannot yet restore llama_kv_cell_ext as the apply_ubatch() does not support it yet
+        //       see: https://github.com/ggml-org/llama.cpp/pull/16825#issuecomment-3460868350
        apply_ubatch(sinfo, ubatch);


I don't understand this statement - the apply_ubatch() does handle ext:

llama.cpp/src/llama-kv-cache.cpp

Lines 902 to 910 in 5ec41a1

if (ubatch.is_pos_2d()) {

llama_kv_cell_ext ext {

/*.x =*/ ubatch.pos[i + ubatch.n_tokens*2],

/*.y =*/ ubatch.pos[i + ubatch.n_tokens],

};

cells.ext_set(idx, std::move(ext));

}

I mean the ext is constructed from pos, but ideally what I want is that apply_ubatch take the raw ext read from the save file.

The benefit is that when ext has more info than just x,y, then we won't need to update the save/load code again.

Another approach could be the other way: on saving the state, we "serialize" ext back into list of pos that can be later feed into ubatch. But IMO this is a bit hacky.

ggerganov

Just pushed a few more changes bed0f57

Should be good to merge now.

easyfab · 2025-10-30T18:42:32Z

Hi since this commit, with LightOnOCR-1B-1025-Q8_0.gguf and blank page I got infinite /n/n/n......
I saw it when ocr using python script and pdfs with blank page.
command line : llama-server -m ../models/LightOnOCR-1B-1025-Q8_0.gguf --mmproj ../models/mmproj/mmproj-LightOnOCR-1B-1025-Q8_0.gguf -c 8192 --jinja

And confirmation with webui:

llama: store mrope data in KV cell

bf7f924

ngxson mentioned this pull request Oct 28, 2025

llama: add linearly increasing token dim for M-RoPE #16822

Closed

correct x,y ordering

90353ea

github-actions bot added the examples label Oct 28, 2025

ggerganov reviewed Oct 29, 2025

View reviewed changes

broadbit-hu mentioned this pull request Oct 29, 2025

Feature Request: support qwen3-vl series #16207

Closed

4 tasks

ngxson added 2 commits October 29, 2025 11:16

address review comments

c3e1393

Merge branch 'master' into xsn/fix-mrope-causal

ebac831

ngxson marked this pull request as ready for review October 29, 2025 10:25

ngxson requested a review from ggerganov October 29, 2025 10:25

add consistency checks

9102a7c

ggerganov reviewed Oct 29, 2025

View reviewed changes

src/llama-kv-cache.cpp Show resolved Hide resolved

src/llama-kv-cache.cpp Outdated Show resolved Hide resolved

ngxson and others added 3 commits October 29, 2025 14:20

Update src/llama-kv-cache.cpp

f706358

Co-authored-by: Georgi Gerganov <[email protected]>

add TODO

18842b6

fix asan error

5ec41a1

ggerganov reviewed Oct 29, 2025

View reviewed changes

ngxson mentioned this pull request Oct 29, 2025

Improve pre-processed image size for QwenVL #16842

Open

kv-cells : improve ext handling

bed0f57

ggerganov approved these changes Oct 29, 2025

View reviewed changes

cont : fix headers

45d60e1

ngxson merged commit e3af556 into ggml-org:master Oct 29, 2025
62 of 63 checks passed

This was referenced Oct 29, 2025

llama: fix ASAN error with M-RoPE #16848

Merged

[model] add support for qwen3vl series #16780

Merged

ddh0 mentioned this pull request Oct 30, 2025

support GLM-4.5V vision model #16600

Draft

		// stores addition info for M-RoPE positions
		std::vector<llama_kv_pos_mrope> pos_mrope;


	if (ubatch.is_pos_2d()) {
	llama_kv_cell_ext ext {
	/.x =/ ubatch.pos[i + ubatch.n_tokens*2],
	/.y =/ ubatch.pos[i + ubatch.n_tokens],
	};
	cells.ext_set(idx, std::move(ext));
	}

llama: store mrope data in KV cell #16825

llama: store mrope data in KV cell #16825

Conversation

ngxson commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Oct 28, 2025

Uh oh!

ggerganov commented Oct 28, 2025

Uh oh!

FMayran commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Oct 28, 2025

Uh oh!

rujialiu commented Oct 29, 2025

Uh oh!

rujialiu commented Oct 29, 2025

Uh oh!

rujialiu commented Oct 29, 2025

Uh oh!

ggerganov Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

broadbit-hu commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Oct 29, 2025

Uh oh!

broadbit-hu commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Oct 29, 2025

Uh oh!

broadbit-hu commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

theo77186 commented Oct 29, 2025

Uh oh!

ngxson commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ggerganov Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

easyfab commented Oct 30, 2025

Uh oh!

Reviewers

ngxson commented Oct 28, 2025 •

edited

Loading

FMayran commented Oct 28, 2025 •

edited

Loading

ngxson Oct 29, 2025 •

edited

Loading

broadbit-hu commented Oct 29, 2025 •

edited

Loading

broadbit-hu commented Oct 29, 2025 •

edited

Loading

broadbit-hu commented Oct 29, 2025 •

edited

Loading

ngxson commented Oct 29, 2025 •

edited

Loading

ngxson Oct 29, 2025 •

edited

Loading