Skip to content

Commit 1f1b508

Browse files
committed
docs: Add comprehensive documentation and patches for 3 PRs
- 0032: Intel GPU Level Zero memory detection (llama.cpp) - 0033: Vulkan GPU ordering by device ID - 0034: Qwen2.5 VL causal masking fix (llama.cpp) - 0035: Intel GPU Level Zero integration (Ollama) Includes full documentation in English with technical details, application instructions, and testing recommendations. Fixed indentation error in llm/memory.go
1 parent e1a3d85 commit 1f1b508

10 files changed

+1289
-12
lines changed

Z_Iosu/patches/0032-Add-memory-detection-for-Intel-GPU-using-Level-Zero.patch

Lines changed: 478 additions & 0 deletions
Large diffs are not rendered by default.
6.3 KB
Binary file not shown.
33.7 KB
Binary file not shown.
108 KB
Binary file not shown.
Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# PR #12654: Intel GPU Memory Detection using Level Zero Sysman API
2+
3+
## Overview
4+
This PR adds support for detecting Intel GPU memory using the Level Zero System Management API (Sysman). This enhancement improves VRAM detection accuracy for Intel Arc and Flex GPUs when running with Vulkan backend.
5+
6+
## Source
7+
- **Upstream PR**: https://github.com/ollama/ollama/pull/12654
8+
- **Applied**: October 25, 2025
9+
- **Branch**: 12_07_mio
10+
- **Commit**: 8a3856f41
11+
12+
## Problem Statement
13+
Intel GPUs were not reporting accurate VRAM information through Vulkan API alone. The Level Zero Sysman API provides more detailed and accurate memory information for Intel discrete GPUs.
14+
15+
## Changes Made
16+
17+
### 1. Dockerfile
18+
- Added Intel oneAPI Level Zero runtime installation
19+
- Copied Level Zero shared libraries to `/lib/ollama/level_zero/`
20+
- Ensures runtime availability for Level Zero API calls
21+
22+
### 2. ggml CMake Build System
23+
**File**: `ml/backend/ggml/ggml/src/CMakeLists.txt`
24+
- Added `mem_l0_sysman.cpp` to build sources
25+
- Integrated Level Zero memory detection into ggml build
26+
27+
### 3. ggml Implementation Header
28+
**File**: `ml/backend/ggml/ggml/src/ggml-impl.h`
29+
- Added Level Zero Sysman API function declarations
30+
- Defined interface for GPU memory querying:
31+
- `ggml_l0_sysman_init()` - Initialize Level Zero context
32+
- `ggml_l0_sysman_get_device_count()` - Get number of Intel GPUs
33+
- `ggml_l0_sysman_get_total_memory()` - Get total VRAM
34+
- `ggml_l0_sysman_get_free_memory()` - Get available VRAM
35+
36+
### 4. Level Zero Implementation
37+
**File**: `ml/backend/ggml/ggml/src/mem_l0_sysman.cpp` (NEW - 21KB)
38+
- Complete implementation of Intel GPU memory detection
39+
- Dynamic library loading for Windows and Linux
40+
- Key features:
41+
- Fallback mechanism if Level Zero unavailable
42+
- Multiple device support
43+
- Memory query caching for performance
44+
- Thread-safe initialization
45+
- Comprehensive error handling
46+
47+
### 5. Vulkan Backend Integration
48+
**File**: `ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp`
49+
- Enhanced GPU memory detection in `ggml_vk_init()`
50+
- Prioritizes Level Zero data for Intel GPUs
51+
- Falls back to Vulkan memory queries for non-Intel or when L0 unavailable
52+
- Improves accuracy of VRAM reporting
53+
54+
### 6. Patch Documentation
55+
**File**: `llama/patches/0032-Add-memory-detection-for-Intel-GPU-using-Level-Zero.patch`
56+
- Documents changes for llama.cpp submodule
57+
- Tracks Level Zero integration for future updates
58+
59+
## Technical Details
60+
61+
### Level Zero API Functions Used
62+
- `zeInit()` - Initialize Level Zero driver
63+
- `zeDriverGet()` - Enumerate drivers
64+
- `zesDeviceGet()` - Get device handles
65+
- `zesDeviceEnumMemoryModules()` - Enumerate memory modules
66+
- `zesMemoryGetProperties()` - Get memory properties
67+
- `zesMemoryGetState()` - Get current memory state
68+
69+
### Memory Detection Flow
70+
1. Initialize Level Zero driver context
71+
2. Enumerate Intel GPU devices
72+
3. For each device:
73+
- Query memory module properties
74+
- Get total memory capacity
75+
- Get current free memory
76+
4. Cache results for subsequent queries
77+
5. Fallback to Vulkan queries if Level Zero fails
78+
79+
## Benefits
80+
- **Accurate VRAM Detection**: More reliable than Vulkan-only detection
81+
- **Better Resource Management**: Ollama can make informed decisions about model loading
82+
- **Intel GPU Support**: Improved support for Arc A-series and Flex GPUs
83+
- **Cross-Platform**: Works on both Windows and Linux
84+
- **Graceful Degradation**: Falls back to Vulkan if Level Zero unavailable
85+
86+
## Testing Recommendations
87+
1. Test with Intel Arc A770/A750/A380 GPUs
88+
2. Test with Intel Flex 140/170 GPUs
89+
3. Verify VRAM reporting accuracy: `ollama ps` should show correct memory
90+
4. Test multi-GPU scenarios with mixed Intel/NVIDIA/AMD
91+
5. Verify fallback behavior when Level Zero libraries missing
92+
93+
## Dependencies
94+
- Intel Level Zero runtime libraries (Linux: `level-zero`, Windows: bundled)
95+
- Vulkan SDK (existing dependency)
96+
- Compatible Intel GPU driver with Level Zero support
97+
98+
## Known Limitations
99+
- Only detects discrete Intel GPUs (Arc/Flex series)
100+
- Integrated GPUs (UHD/Iris Xe) may have limited Level Zero support
101+
- Requires recent Intel GPU drivers (2023+)
102+
103+
## Files Modified
104+
```
105+
Dockerfile
106+
ml/backend/ggml/ggml/src/CMakeLists.txt
107+
ml/backend/ggml/ggml/src/ggml-impl.h
108+
ml/backend/ggml/ggml/src/mem_l0_sysman.cpp (NEW)
109+
ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp
110+
llama/patches/0032-Add-memory-detection-for-Intel-GPU-using-Level-Zero.patch (NEW)
111+
```
112+
113+
## Statistics
114+
- **Files Changed**: 7
115+
- **Insertions**: 893
116+
- **Deletions**: 9
117+
- **New Files**: 2
118+
119+
## Related Issues
120+
- Improves accuracy of GPU memory detection for Ollama scheduler
121+
- Complements PR #12665 (GPU ordering) for better multi-GPU support
Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
# llama.cpp PR #16745: Fix Qwen2.5 VL Cache Causal Masking
2+
3+
## Overview
4+
This PR fixes causal masking issues in Qwen2.5 Vision-Language models by tracking actual KV cache positions instead of assuming consecutive token positions. This resolves inference errors when processing vision embeddings with non-consecutive position IDs.
5+
6+
## Source
7+
- **Upstream PR**: https://github.com/ggml-org/llama.cpp/pull/16745
8+
- **Applied**: October 25, 2025
9+
- **Branch**: 12_07_mio
10+
- **Commit**: e1a3d8557
11+
12+
## Problem Statement
13+
Qwen2.5 VL models use vision embeddings with **non-consecutive position IDs**:
14+
- Text tokens: positions 0, 1, 2, 3, ...
15+
- Vision embeddings: positions 100, 200, 300, ...
16+
- Continuation: positions 4, 5, 6, ...
17+
18+
The old implementation assumed **consecutive positions** for causal masking, causing:
19+
1. Incorrect attention masks for vision tokens
20+
2. Model inference failures
21+
3. Poor generation quality with vision inputs
22+
23+
## Changes Made
24+
25+
### 1. Batch Structure Enhancement
26+
**File**: `llama/llama.cpp/src/llama-batch.h`
27+
28+
**Added to `llama_ubatch` struct**:
29+
```cpp
30+
int32_t * kv_position_of_token; // actual KV cache position for each token
31+
```
32+
33+
**Added to `llama_ubatch::data_t` struct**:
34+
```cpp
35+
std::vector<int32_t> kv_position_of_token; // storage for KV positions
36+
```
37+
38+
**Purpose**: Track the actual KV cache position for each token in the batch, independent of temporal position.
39+
40+
### 2. Batch Initialization
41+
**File**: `llama/llama.cpp/src/llama-batch.cpp`
42+
43+
**Commented out strict position validation** (lines 259-289):
44+
```cpp
45+
// GGML_ASSERT(ubatch.n_tokens > 0);
46+
// GGML_ASSERT(batch->pos[0] >= 0);
47+
// for (int i = 1; i < ubatch.n_tokens; ++i) {
48+
// GGML_ASSERT(batch->pos[i] == batch->pos[i-1] + 1); // No longer required
49+
// }
50+
```
51+
52+
**Added kv_position_of_token initialization** in 3 locations:
53+
1. Standard batch split (line ~175)
54+
2. Equal split mode (line ~230)
55+
3. Batch sequence processing (line ~315)
56+
57+
**Added code**:
58+
```cpp
59+
ubatch.kv_position_of_token = ubatch_data->kv_position_of_token.data();
60+
```
61+
62+
**Rationale**: Vision embeddings can have non-consecutive positions, validation was too strict.
63+
64+
### 3. KV Cache Causal Masking Rewrite
65+
**File**: `llama/llama.cpp/src/llama-kv-cache.cpp`
66+
67+
**Function**: `llama_kv_cache_update_impl()`
68+
69+
**Old behavior**:
70+
- Used token's temporal position for masking
71+
- Assumed consecutive positions
72+
- Couldn't handle vision embedding position jumps
73+
74+
**New behavior**:
75+
- Builds `map_kv_to_batch` vector to track actual KV positions
76+
- Updates `ubatch.kv_position_of_token[i]` with actual cache position
77+
- Uses batch position indices for causal masking instead of temporal positions
78+
79+
**Key code**:
80+
```cpp
81+
// Build mapping from KV cache position to batch index
82+
std::vector<int32_t> map_kv_to_batch(kv_self.size, -1);
83+
for (uint32_t i = 0; i < ubatch.n_tokens; ++i) {
84+
for (int32_t s = 0; s < ubatch.n_seq_tokens[i]; ++s) {
85+
const llama_seq_id seq_id = ubatch.seq_id[i][s];
86+
// ... find cache position for this token ...
87+
ubatch.kv_position_of_token[i] = (int32_t)idx; // Store actual position
88+
map_kv_to_batch[idx] = (int32_t)i; // Map position to batch index
89+
}
90+
}
91+
92+
// Causal masking using batch indices
93+
for (uint32_t i = 0; i < ubatch.n_tokens; ++i) {
94+
if (has_mask) {
95+
int32_t pos_kv_i = ubatch.kv_position_of_token[i];
96+
for (int32_t s = 0; s < ubatch.n_seq_tokens[i]; ++s) {
97+
const llama_seq_id seq_id = ubatch.seq_id[i][s];
98+
for (uint32_t j = 0; j < ubatch.n_tokens; ++j) {
99+
int32_t pos_kv_j = ubatch.kv_position_of_token[j];
100+
// Check if j can attend to i using batch positions
101+
ubatch.mask[i*ubatch.n_tokens + j] = (
102+
ubatch_seq_id_cmp(ubatch, j, seq_id) &&
103+
pos_kv_j <= pos_kv_i // Causal masking based on KV position
104+
);
105+
}
106+
}
107+
}
108+
}
109+
```
110+
111+
**Benefits**:
112+
- Handles non-consecutive positions correctly
113+
- Vision embeddings masked properly
114+
- Preserves causal attention semantics
115+
116+
### 4. M-RoPE Position Calculation
117+
**File**: `llama/llama.cpp/tools/mtmd/mtmd.cpp`
118+
119+
**Function**: `llama_mtmd_input_text_template::get_position()`
120+
121+
**Changed** (line 113):
122+
```cpp
123+
// Old: return 1; // Always returned 1 for images
124+
// New:
125+
return std::max(nx, ny); // Return max(width, height) for proper image dimensions
126+
```
127+
128+
**Rationale**: Qwen VL uses image dimensions for RoPE position calculation. Returning 1 broke positional encoding for vision embeddings.
129+
130+
### 5. Documentation Update
131+
**File**: `llama/llama.cpp/tools/mtmd/mtmd.h`
132+
133+
**Updated comment** (line 112):
134+
```cpp
135+
// Old comment: return temporal position (usually 1 for images)
136+
// New comment:
137+
// return temporal position for embeddings
138+
// Note: Qwen VL models expect max(image_width, image_height) here
139+
// to properly calculate M-RoPE positions for vision embeddings
140+
```
141+
142+
## Technical Details
143+
144+
### Position Tracking Flow
145+
1. **Batch Creation**: Initialize `kv_position_of_token` array
146+
2. **KV Cache Update**:
147+
- Find actual cache position for each token
148+
- Store in `ubatch.kv_position_of_token[i]`
149+
3. **Masking**:
150+
- Use `kv_position_of_token` for causal checks
151+
- Token j can attend to token i if `pos_kv_j <= pos_kv_i`
152+
153+
### Example: Vision Processing
154+
**Input sequence**:
155+
```
156+
Token 0: "Describe" -> pos=0, kv_pos=0
157+
Token 1: "this" -> pos=1, kv_pos=1
158+
Token 2: <vision_emb_0> -> pos=100, kv_pos=2 // Non-consecutive!
159+
Token 3: <vision_emb_1> -> pos=101, kv_pos=3
160+
Token 4: "image" -> pos=2, kv_pos=4 // Position resets
161+
```
162+
163+
**Causal mask** (kv_position_of_token based):
164+
```
165+
0 1 2 3 4
166+
0 [ T F F F F ] Token 0 sees only itself
167+
1 [ T T F F F ] Token 1 sees 0,1
168+
2 [ T T T F F ] Vision 0 sees 0,1,itself
169+
3 [ T T T T F ] Vision 1 sees 0,1,2,itself
170+
4 [ T T T T T ] Token 4 sees all previous
171+
```
172+
173+
Without this fix, vision tokens would have incorrect masks based on pos=100,101.
174+
175+
### M-RoPE Position Fix
176+
**Qwen VL M-RoPE** uses 3D positional encoding:
177+
- **Temporal dimension**: Token sequence position
178+
- **Height dimension**: For vision, use image height
179+
- **Width dimension**: For vision, use image width
180+
181+
**Old code**: `return 1` made all vision embeddings have position=1
182+
**New code**: `return max(nx, ny)` uses actual image dimensions
183+
**Result**: Correct RoPE frequencies for vision embeddings
184+
185+
## Benefits
186+
1. **Correct Vision Processing**: Qwen VL models work properly
187+
2. **Flexible Position IDs**: Supports non-consecutive positions
188+
3. **Maintains Causality**: Attention masking still correct
189+
4. **M-RoPE Fix**: Vision embeddings get proper positional encoding
190+
5. **No Performance Impact**: Minimal computational overhead
191+
192+
## Testing Recommendations
193+
194+
### Basic Vision Test
195+
```bash
196+
ollama run qwen2.5-vl:7b "Describe this image" --image test.jpg
197+
```
198+
199+
### Multi-Image Test
200+
```bash
201+
ollama run qwen2.5-vl:7b "Compare these images" --image img1.jpg --image img2.jpg
202+
```
203+
204+
### Position Tracking Verification
205+
Add debug logging:
206+
```cpp
207+
for (uint32_t i = 0; i < ubatch.n_tokens; ++i) {
208+
printf("Token %d: pos=%d, kv_pos=%d\n",
209+
i, ubatch.pos[i], ubatch.kv_position_of_token[i]);
210+
}
211+
```
212+
213+
### Expected Behavior
214+
- No inference errors with vision inputs
215+
- Coherent image descriptions
216+
- Proper multi-image reasoning
217+
- No position validation assertions
218+
219+
## Models Affected
220+
- **Qwen2.5-VL** (all sizes: 3B, 7B, 32B, 72B)
221+
- **Qwen-VL** (original)
222+
- **Qwen2-VL**
223+
- Any vision-language model using non-consecutive position IDs
224+
225+
## Known Limitations
226+
- Assumes vision embeddings use higher position IDs than text
227+
- M-RoPE calculation depends on correct image dimensions
228+
- Batch size limited by KV cache size (standard limitation)
229+
230+
## Files Modified
231+
```
232+
llama/llama.cpp/src/llama-batch.h
233+
llama/llama.cpp/src/llama-batch.cpp
234+
llama/llama.cpp/src/llama-kv-cache.cpp
235+
llama/llama.cpp/tools/mtmd/mtmd.cpp
236+
llama/llama.cpp/tools/mtmd/mtmd.h
237+
```
238+
239+
## Statistics
240+
- **Files Changed**: 5
241+
- **Insertions**: 111
242+
- **Deletions**: 95
243+
- **Net Change**: +16 lines
244+
245+
## Related Issues
246+
- Fixes vision processing errors in Qwen VL models
247+
- Resolves "position assertion failed" errors
248+
- Improves multi-modal inference quality
249+
250+
## References
251+
- **Upstream Discussion**: https://github.com/ggml-org/llama.cpp/issues/16207
252+
- **Qwen VL Discussion**: https://github.com/ggml-org/llama.cpp/issues/16207#issuecomment-3443868720
253+
- **Related Work**: https://github.com/LETS-BEE/llama.cpp/commits/qwen3vl/

0 commit comments

Comments
 (0)