|
| 1 | +# llama.cpp PR #16745: Fix Qwen2.5 VL Cache Causal Masking |
| 2 | + |
| 3 | +## Overview |
| 4 | +This PR fixes causal masking issues in Qwen2.5 Vision-Language models by tracking actual KV cache positions instead of assuming consecutive token positions. This resolves inference errors when processing vision embeddings with non-consecutive position IDs. |
| 5 | + |
| 6 | +## Source |
| 7 | +- **Upstream PR**: https://github.com/ggml-org/llama.cpp/pull/16745 |
| 8 | +- **Applied**: October 25, 2025 |
| 9 | +- **Branch**: 12_07_mio |
| 10 | +- **Commit**: e1a3d8557 |
| 11 | + |
| 12 | +## Problem Statement |
| 13 | +Qwen2.5 VL models use vision embeddings with **non-consecutive position IDs**: |
| 14 | +- Text tokens: positions 0, 1, 2, 3, ... |
| 15 | +- Vision embeddings: positions 100, 200, 300, ... |
| 16 | +- Continuation: positions 4, 5, 6, ... |
| 17 | + |
| 18 | +The old implementation assumed **consecutive positions** for causal masking, causing: |
| 19 | +1. Incorrect attention masks for vision tokens |
| 20 | +2. Model inference failures |
| 21 | +3. Poor generation quality with vision inputs |
| 22 | + |
| 23 | +## Changes Made |
| 24 | + |
| 25 | +### 1. Batch Structure Enhancement |
| 26 | +**File**: `llama/llama.cpp/src/llama-batch.h` |
| 27 | + |
| 28 | +**Added to `llama_ubatch` struct**: |
| 29 | +```cpp |
| 30 | +int32_t * kv_position_of_token; // actual KV cache position for each token |
| 31 | +``` |
| 32 | + |
| 33 | +**Added to `llama_ubatch::data_t` struct**: |
| 34 | +```cpp |
| 35 | +std::vector<int32_t> kv_position_of_token; // storage for KV positions |
| 36 | +``` |
| 37 | + |
| 38 | +**Purpose**: Track the actual KV cache position for each token in the batch, independent of temporal position. |
| 39 | + |
| 40 | +### 2. Batch Initialization |
| 41 | +**File**: `llama/llama.cpp/src/llama-batch.cpp` |
| 42 | + |
| 43 | +**Commented out strict position validation** (lines 259-289): |
| 44 | +```cpp |
| 45 | +// GGML_ASSERT(ubatch.n_tokens > 0); |
| 46 | +// GGML_ASSERT(batch->pos[0] >= 0); |
| 47 | +// for (int i = 1; i < ubatch.n_tokens; ++i) { |
| 48 | +// GGML_ASSERT(batch->pos[i] == batch->pos[i-1] + 1); // No longer required |
| 49 | +// } |
| 50 | +``` |
| 51 | + |
| 52 | +**Added kv_position_of_token initialization** in 3 locations: |
| 53 | +1. Standard batch split (line ~175) |
| 54 | +2. Equal split mode (line ~230) |
| 55 | +3. Batch sequence processing (line ~315) |
| 56 | + |
| 57 | +**Added code**: |
| 58 | +```cpp |
| 59 | +ubatch.kv_position_of_token = ubatch_data->kv_position_of_token.data(); |
| 60 | +``` |
| 61 | + |
| 62 | +**Rationale**: Vision embeddings can have non-consecutive positions, validation was too strict. |
| 63 | + |
| 64 | +### 3. KV Cache Causal Masking Rewrite |
| 65 | +**File**: `llama/llama.cpp/src/llama-kv-cache.cpp` |
| 66 | + |
| 67 | +**Function**: `llama_kv_cache_update_impl()` |
| 68 | + |
| 69 | +**Old behavior**: |
| 70 | +- Used token's temporal position for masking |
| 71 | +- Assumed consecutive positions |
| 72 | +- Couldn't handle vision embedding position jumps |
| 73 | + |
| 74 | +**New behavior**: |
| 75 | +- Builds `map_kv_to_batch` vector to track actual KV positions |
| 76 | +- Updates `ubatch.kv_position_of_token[i]` with actual cache position |
| 77 | +- Uses batch position indices for causal masking instead of temporal positions |
| 78 | + |
| 79 | +**Key code**: |
| 80 | +```cpp |
| 81 | +// Build mapping from KV cache position to batch index |
| 82 | +std::vector<int32_t> map_kv_to_batch(kv_self.size, -1); |
| 83 | +for (uint32_t i = 0; i < ubatch.n_tokens; ++i) { |
| 84 | + for (int32_t s = 0; s < ubatch.n_seq_tokens[i]; ++s) { |
| 85 | + const llama_seq_id seq_id = ubatch.seq_id[i][s]; |
| 86 | + // ... find cache position for this token ... |
| 87 | + ubatch.kv_position_of_token[i] = (int32_t)idx; // Store actual position |
| 88 | + map_kv_to_batch[idx] = (int32_t)i; // Map position to batch index |
| 89 | + } |
| 90 | +} |
| 91 | + |
| 92 | +// Causal masking using batch indices |
| 93 | +for (uint32_t i = 0; i < ubatch.n_tokens; ++i) { |
| 94 | + if (has_mask) { |
| 95 | + int32_t pos_kv_i = ubatch.kv_position_of_token[i]; |
| 96 | + for (int32_t s = 0; s < ubatch.n_seq_tokens[i]; ++s) { |
| 97 | + const llama_seq_id seq_id = ubatch.seq_id[i][s]; |
| 98 | + for (uint32_t j = 0; j < ubatch.n_tokens; ++j) { |
| 99 | + int32_t pos_kv_j = ubatch.kv_position_of_token[j]; |
| 100 | + // Check if j can attend to i using batch positions |
| 101 | + ubatch.mask[i*ubatch.n_tokens + j] = ( |
| 102 | + ubatch_seq_id_cmp(ubatch, j, seq_id) && |
| 103 | + pos_kv_j <= pos_kv_i // Causal masking based on KV position |
| 104 | + ); |
| 105 | + } |
| 106 | + } |
| 107 | + } |
| 108 | +} |
| 109 | +``` |
| 110 | +
|
| 111 | +**Benefits**: |
| 112 | +- Handles non-consecutive positions correctly |
| 113 | +- Vision embeddings masked properly |
| 114 | +- Preserves causal attention semantics |
| 115 | +
|
| 116 | +### 4. M-RoPE Position Calculation |
| 117 | +**File**: `llama/llama.cpp/tools/mtmd/mtmd.cpp` |
| 118 | +
|
| 119 | +**Function**: `llama_mtmd_input_text_template::get_position()` |
| 120 | +
|
| 121 | +**Changed** (line 113): |
| 122 | +```cpp |
| 123 | +// Old: return 1; // Always returned 1 for images |
| 124 | +// New: |
| 125 | +return std::max(nx, ny); // Return max(width, height) for proper image dimensions |
| 126 | +``` |
| 127 | + |
| 128 | +**Rationale**: Qwen VL uses image dimensions for RoPE position calculation. Returning 1 broke positional encoding for vision embeddings. |
| 129 | + |
| 130 | +### 5. Documentation Update |
| 131 | +**File**: `llama/llama.cpp/tools/mtmd/mtmd.h` |
| 132 | + |
| 133 | +**Updated comment** (line 112): |
| 134 | +```cpp |
| 135 | +// Old comment: return temporal position (usually 1 for images) |
| 136 | +// New comment: |
| 137 | +// return temporal position for embeddings |
| 138 | +// Note: Qwen VL models expect max(image_width, image_height) here |
| 139 | +// to properly calculate M-RoPE positions for vision embeddings |
| 140 | +``` |
| 141 | + |
| 142 | +## Technical Details |
| 143 | + |
| 144 | +### Position Tracking Flow |
| 145 | +1. **Batch Creation**: Initialize `kv_position_of_token` array |
| 146 | +2. **KV Cache Update**: |
| 147 | + - Find actual cache position for each token |
| 148 | + - Store in `ubatch.kv_position_of_token[i]` |
| 149 | +3. **Masking**: |
| 150 | + - Use `kv_position_of_token` for causal checks |
| 151 | + - Token j can attend to token i if `pos_kv_j <= pos_kv_i` |
| 152 | + |
| 153 | +### Example: Vision Processing |
| 154 | +**Input sequence**: |
| 155 | +``` |
| 156 | +Token 0: "Describe" -> pos=0, kv_pos=0 |
| 157 | +Token 1: "this" -> pos=1, kv_pos=1 |
| 158 | +Token 2: <vision_emb_0> -> pos=100, kv_pos=2 // Non-consecutive! |
| 159 | +Token 3: <vision_emb_1> -> pos=101, kv_pos=3 |
| 160 | +Token 4: "image" -> pos=2, kv_pos=4 // Position resets |
| 161 | +``` |
| 162 | + |
| 163 | +**Causal mask** (kv_position_of_token based): |
| 164 | +``` |
| 165 | + 0 1 2 3 4 |
| 166 | +0 [ T F F F F ] Token 0 sees only itself |
| 167 | +1 [ T T F F F ] Token 1 sees 0,1 |
| 168 | +2 [ T T T F F ] Vision 0 sees 0,1,itself |
| 169 | +3 [ T T T T F ] Vision 1 sees 0,1,2,itself |
| 170 | +4 [ T T T T T ] Token 4 sees all previous |
| 171 | +``` |
| 172 | + |
| 173 | +Without this fix, vision tokens would have incorrect masks based on pos=100,101. |
| 174 | + |
| 175 | +### M-RoPE Position Fix |
| 176 | +**Qwen VL M-RoPE** uses 3D positional encoding: |
| 177 | +- **Temporal dimension**: Token sequence position |
| 178 | +- **Height dimension**: For vision, use image height |
| 179 | +- **Width dimension**: For vision, use image width |
| 180 | + |
| 181 | +**Old code**: `return 1` made all vision embeddings have position=1 |
| 182 | +**New code**: `return max(nx, ny)` uses actual image dimensions |
| 183 | +**Result**: Correct RoPE frequencies for vision embeddings |
| 184 | + |
| 185 | +## Benefits |
| 186 | +1. **Correct Vision Processing**: Qwen VL models work properly |
| 187 | +2. **Flexible Position IDs**: Supports non-consecutive positions |
| 188 | +3. **Maintains Causality**: Attention masking still correct |
| 189 | +4. **M-RoPE Fix**: Vision embeddings get proper positional encoding |
| 190 | +5. **No Performance Impact**: Minimal computational overhead |
| 191 | + |
| 192 | +## Testing Recommendations |
| 193 | + |
| 194 | +### Basic Vision Test |
| 195 | +```bash |
| 196 | +ollama run qwen2.5-vl:7b "Describe this image" --image test.jpg |
| 197 | +``` |
| 198 | + |
| 199 | +### Multi-Image Test |
| 200 | +```bash |
| 201 | +ollama run qwen2.5-vl:7b "Compare these images" --image img1.jpg --image img2.jpg |
| 202 | +``` |
| 203 | + |
| 204 | +### Position Tracking Verification |
| 205 | +Add debug logging: |
| 206 | +```cpp |
| 207 | +for (uint32_t i = 0; i < ubatch.n_tokens; ++i) { |
| 208 | + printf("Token %d: pos=%d, kv_pos=%d\n", |
| 209 | + i, ubatch.pos[i], ubatch.kv_position_of_token[i]); |
| 210 | +} |
| 211 | +``` |
| 212 | + |
| 213 | +### Expected Behavior |
| 214 | +- No inference errors with vision inputs |
| 215 | +- Coherent image descriptions |
| 216 | +- Proper multi-image reasoning |
| 217 | +- No position validation assertions |
| 218 | + |
| 219 | +## Models Affected |
| 220 | +- **Qwen2.5-VL** (all sizes: 3B, 7B, 32B, 72B) |
| 221 | +- **Qwen-VL** (original) |
| 222 | +- **Qwen2-VL** |
| 223 | +- Any vision-language model using non-consecutive position IDs |
| 224 | + |
| 225 | +## Known Limitations |
| 226 | +- Assumes vision embeddings use higher position IDs than text |
| 227 | +- M-RoPE calculation depends on correct image dimensions |
| 228 | +- Batch size limited by KV cache size (standard limitation) |
| 229 | + |
| 230 | +## Files Modified |
| 231 | +``` |
| 232 | +llama/llama.cpp/src/llama-batch.h |
| 233 | +llama/llama.cpp/src/llama-batch.cpp |
| 234 | +llama/llama.cpp/src/llama-kv-cache.cpp |
| 235 | +llama/llama.cpp/tools/mtmd/mtmd.cpp |
| 236 | +llama/llama.cpp/tools/mtmd/mtmd.h |
| 237 | +``` |
| 238 | + |
| 239 | +## Statistics |
| 240 | +- **Files Changed**: 5 |
| 241 | +- **Insertions**: 111 |
| 242 | +- **Deletions**: 95 |
| 243 | +- **Net Change**: +16 lines |
| 244 | + |
| 245 | +## Related Issues |
| 246 | +- Fixes vision processing errors in Qwen VL models |
| 247 | +- Resolves "position assertion failed" errors |
| 248 | +- Improves multi-modal inference quality |
| 249 | + |
| 250 | +## References |
| 251 | +- **Upstream Discussion**: https://github.com/ggml-org/llama.cpp/issues/16207 |
| 252 | +- **Qwen VL Discussion**: https://github.com/ggml-org/llama.cpp/issues/16207#issuecomment-3443868720 |
| 253 | +- **Related Work**: https://github.com/LETS-BEE/llama.cpp/commits/qwen3vl/ |
0 commit comments