iosub
diff --git a/‎Z_Iosu/patches/0032-Add-memory-detection-for-Intel-GPU-using-Level-Zero.patch‎
Lines changed: 478 additions & 0 deletions b/‎Z_Iosu/patches/0032-Add-memory-detection-for-Intel-GPU-using-Level-Zero.patch‎
Lines changed: 478 additions & 0 deletions
diff --git a/‎Z_Iosu/patches/0033-Vulkan-GPU-ordering-by-device-ID.patch‎
6.3 KB b/‎Z_Iosu/patches/0033-Vulkan-GPU-ordering-by-device-ID.patch‎
6.3 KB
diff --git a/‎Z_Iosu/patches/0034-Fix-Qwen25-VL-cache-causal-masking.patch‎
33.7 KB b/‎Z_Iosu/patches/0034-Fix-Qwen25-VL-cache-causal-masking.patch‎
33.7 KB
diff --git a/‎Z_Iosu/patches/0035-Intel-GPU-Level-Zero-memory-detection.patch‎
108 KB b/‎Z_Iosu/patches/0035-Intel-GPU-Level-Zero-memory-detection.patch‎
108 KB
diff --git a/‎Z_Iosu/patches/PR_12654_Intel_Level_Zero_Memory.md‎
Lines changed: 121 additions & 0 deletions b/‎Z_Iosu/patches/PR_12654_Intel_Level_Zero_Memory.md‎
Lines changed: 121 additions & 0 deletions
diff --git a/‎Z_Iosu/patches/PR_16745_Qwen_VL_Causal_Masking.md‎
Lines changed: 253 additions & 0 deletions b/‎Z_Iosu/patches/PR_16745_Qwen_VL_Causal_Masking.md‎
Lines changed: 253 additions & 0 deletions
@@ -0,0 +1,121 @@
+# PR #12654: Intel GPU Memory Detection using Level Zero Sysman API
+
+## Overview
+This PR adds support for detecting Intel GPU memory using the Level Zero System Management API (Sysman). This enhancement improves VRAM detection accuracy for Intel Arc and Flex GPUs when running with Vulkan backend.
+
+## Source
+- **Upstream PR**: https://github.com/ollama/ollama/pull/12654
+- **Applied**: October 25, 2025
+- **Branch**: 12_07_mio
+- **Commit**: 8a3856f41
+
+## Problem Statement
+Intel GPUs were not reporting accurate VRAM information through Vulkan API alone. The Level Zero Sysman API provides more detailed and accurate memory information for Intel discrete GPUs.
+
+## Changes Made
+
+### 1. Dockerfile
+- Added Intel oneAPI Level Zero runtime installation
+- Copied Level Zero shared libraries to `/lib/ollama/level_zero/`
+- Ensures runtime availability for Level Zero API calls
+
+### 2. ggml CMake Build System
+**File**: `ml/backend/ggml/ggml/src/CMakeLists.txt`
+- Added `mem_l0_sysman.cpp` to build sources
+- Integrated Level Zero memory detection into ggml build
+
+### 3. ggml Implementation Header
+**File**: `ml/backend/ggml/ggml/src/ggml-impl.h`
+- Added Level Zero Sysman API function declarations
+- Defined interface for GPU memory querying:
+  - `ggml_l0_sysman_init()` - Initialize Level Zero context
+  - `ggml_l0_sysman_get_device_count()` - Get number of Intel GPUs
+  - `ggml_l0_sysman_get_total_memory()` - Get total VRAM
+  - `ggml_l0_sysman_get_free_memory()` - Get available VRAM
+
+### 4. Level Zero Implementation
+**File**: `ml/backend/ggml/ggml/src/mem_l0_sysman.cpp` (NEW - 21KB)
+- Complete implementation of Intel GPU memory detection
+- Dynamic library loading for Windows and Linux
+- Key features:
+  - Fallback mechanism if Level Zero unavailable
+  - Multiple device support
+  - Memory query caching for performance
+  - Thread-safe initialization
+  - Comprehensive error handling
+
+### 5. Vulkan Backend Integration
+**File**: `ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp`
+- Enhanced GPU memory detection in `ggml_vk_init()`
+- Prioritizes Level Zero data for Intel GPUs
+- Falls back to Vulkan memory queries for non-Intel or when L0 unavailable
+- Improves accuracy of VRAM reporting
+
+### 6. Patch Documentation
+**File**: `llama/patches/0032-Add-memory-detection-for-Intel-GPU-using-Level-Zero.patch`
+- Documents changes for llama.cpp submodule
+- Tracks Level Zero integration for future updates
+
+## Technical Details
+
+### Level Zero API Functions Used
+- `zeInit()` - Initialize Level Zero driver
+- `zeDriverGet()` - Enumerate drivers
+- `zesDeviceGet()` - Get device handles
+- `zesDeviceEnumMemoryModules()` - Enumerate memory modules
+- `zesMemoryGetProperties()` - Get memory properties
+- `zesMemoryGetState()` - Get current memory state
+
+### Memory Detection Flow
+1. Initialize Level Zero driver context
+2. Enumerate Intel GPU devices
+3. For each device:
+   - Query memory module properties
+   - Get total memory capacity
+   - Get current free memory
+4. Cache results for subsequent queries
+5. Fallback to Vulkan queries if Level Zero fails
+
+## Benefits
+- **Accurate VRAM Detection**: More reliable than Vulkan-only detection
+- **Better Resource Management**: Ollama can make informed decisions about model loading
+- **Intel GPU Support**: Improved support for Arc A-series and Flex GPUs
+- **Cross-Platform**: Works on both Windows and Linux
+- **Graceful Degradation**: Falls back to Vulkan if Level Zero unavailable
+
+## Testing Recommendations
+1. Test with Intel Arc A770/A750/A380 GPUs
+2. Test with Intel Flex 140/170 GPUs
+3. Verify VRAM reporting accuracy: `ollama ps` should show correct memory
+4. Test multi-GPU scenarios with mixed Intel/NVIDIA/AMD
+5. Verify fallback behavior when Level Zero libraries missing
+
+## Dependencies
+- Intel Level Zero runtime libraries (Linux: `level-zero`, Windows: bundled)
+- Vulkan SDK (existing dependency)
+- Compatible Intel GPU driver with Level Zero support
+
+## Known Limitations
+- Only detects discrete Intel GPUs (Arc/Flex series)
+- Integrated GPUs (UHD/Iris Xe) may have limited Level Zero support
+- Requires recent Intel GPU drivers (2023+)
+
+## Files Modified
+```
+Dockerfile
+ml/backend/ggml/ggml/src/CMakeLists.txt
+ml/backend/ggml/ggml/src/ggml-impl.h
+ml/backend/ggml/ggml/src/mem_l0_sysman.cpp (NEW)
+ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+llama/patches/0032-Add-memory-detection-for-Intel-GPU-using-Level-Zero.patch (NEW)
+```
+
+## Statistics
+- **Files Changed**: 7
+- **Insertions**: 893
+- **Deletions**: 9
+- **New Files**: 2
+
+## Related Issues
+- Improves accuracy of GPU memory detection for Ollama scheduler
+- Complements PR #12665 (GPU ordering) for better multi-GPU support
@@ -0,0 +1,253 @@
+# llama.cpp PR #16745: Fix Qwen2.5 VL Cache Causal Masking
+
+## Overview
+This PR fixes causal masking issues in Qwen2.5 Vision-Language models by tracking actual KV cache positions instead of assuming consecutive token positions. This resolves inference errors when processing vision embeddings with non-consecutive position IDs.
+
+## Source
+- **Upstream PR**: https://github.com/ggml-org/llama.cpp/pull/16745
+- **Applied**: October 25, 2025
+- **Branch**: 12_07_mio
+- **Commit**: e1a3d8557
+
+## Problem Statement
+Qwen2.5 VL models use vision embeddings with **non-consecutive position IDs**:
+- Text tokens: positions 0, 1, 2, 3, ...
+- Vision embeddings: positions 100, 200, 300, ...
+- Continuation: positions 4, 5, 6, ...
+
+The old implementation assumed **consecutive positions** for causal masking, causing:
+1. Incorrect attention masks for vision tokens
+2. Model inference failures
+3. Poor generation quality with vision inputs
+
+## Changes Made
+
+### 1. Batch Structure Enhancement
+**File**: `llama/llama.cpp/src/llama-batch.h`
+
+**Added to `llama_ubatch` struct**:
+```cpp
+int32_t * kv_position_of_token;  // actual KV cache position for each token
+```
+
+**Added to `llama_ubatch::data_t` struct**:
+```cpp
+std::vector<int32_t> kv_position_of_token;  // storage for KV positions
+```
+
+**Purpose**: Track the actual KV cache position for each token in the batch, independent of temporal position.
+
+### 2. Batch Initialization
+**File**: `llama/llama.cpp/src/llama-batch.cpp`
+
+**Commented out strict position validation** (lines 259-289):
+```cpp
+// GGML_ASSERT(ubatch.n_tokens  > 0);
+// GGML_ASSERT(batch->pos[0] >= 0);
+// for (int i = 1; i < ubatch.n_tokens; ++i) {
+//     GGML_ASSERT(batch->pos[i] == batch->pos[i-1] + 1);  // No longer required
+// }
+```
+
+**Added kv_position_of_token initialization** in 3 locations:
+1. Standard batch split (line ~175)
+2. Equal split mode (line ~230)
+3. Batch sequence processing (line ~315)
+
+**Added code**:
+```cpp
+ubatch.kv_position_of_token = ubatch_data->kv_position_of_token.data();
+```
+
+**Rationale**: Vision embeddings can have non-consecutive positions, validation was too strict.
+
+### 3. KV Cache Causal Masking Rewrite
+**File**: `llama/llama.cpp/src/llama-kv-cache.cpp`
+
+**Function**: `llama_kv_cache_update_impl()`
+
+**Old behavior**:
+- Used token's temporal position for masking
+- Assumed consecutive positions
+- Couldn't handle vision embedding position jumps
+
+**New behavior**:
+- Builds `map_kv_to_batch` vector to track actual KV positions
+- Updates `ubatch.kv_position_of_token[i]` with actual cache position
+- Uses batch position indices for causal masking instead of temporal positions
+
+**Key code**:
+```cpp
+// Build mapping from KV cache position to batch index
+std::vector<int32_t> map_kv_to_batch(kv_self.size, -1);
+for (uint32_t i = 0; i < ubatch.n_tokens; ++i) {
+    for (int32_t s = 0; s < ubatch.n_seq_tokens[i]; ++s) {
+        const llama_seq_id seq_id = ubatch.seq_id[i][s];
+        // ... find cache position for this token ...
+        ubatch.kv_position_of_token[i] = (int32_t)idx;  // Store actual position
+        map_kv_to_batch[idx] = (int32_t)i;              // Map position to batch index
+    }
+}
+
+// Causal masking using batch indices
+for (uint32_t i = 0; i < ubatch.n_tokens; ++i) {
+    if (has_mask) {
+        int32_t pos_kv_i = ubatch.kv_position_of_token[i];
+        for (int32_t s = 0; s < ubatch.n_seq_tokens[i]; ++s) {
+            const llama_seq_id seq_id = ubatch.seq_id[i][s];
+            for (uint32_t j = 0; j < ubatch.n_tokens; ++j) {
+                int32_t pos_kv_j = ubatch.kv_position_of_token[j];
+                // Check if j can attend to i using batch positions
+                ubatch.mask[i*ubatch.n_tokens + j] = (
+                    ubatch_seq_id_cmp(ubatch, j, seq_id) && 
+                    pos_kv_j <= pos_kv_i  // Causal masking based on KV position
+                );
+            }
+        }
+    }
+}
+```
+
+**Benefits**:
+- Handles non-consecutive positions correctly
+- Vision embeddings masked properly
+- Preserves causal attention semantics
+
+### 4. M-RoPE Position Calculation
+**File**: `llama/llama.cpp/tools/mtmd/mtmd.cpp`
+
+**Function**: `llama_mtmd_input_text_template::get_position()`
+
+**Changed** (line 113):
+```cpp
+// Old: return 1;  // Always returned 1 for images
+// New:
+return std::max(nx, ny);  // Return max(width, height) for proper image dimensions
+```
+
+**Rationale**: Qwen VL uses image dimensions for RoPE position calculation. Returning 1 broke positional encoding for vision embeddings.
+
+### 5. Documentation Update
+**File**: `llama/llama.cpp/tools/mtmd/mtmd.h`
+
+**Updated comment** (line 112):
+```cpp
+// Old comment: return temporal position (usually 1 for images)
+// New comment: 
+// return temporal position for embeddings
+// Note: Qwen VL models expect max(image_width, image_height) here
+//       to properly calculate M-RoPE positions for vision embeddings
+```
+
+## Technical Details
+
+### Position Tracking Flow
+1. **Batch Creation**: Initialize `kv_position_of_token` array
+2. **KV Cache Update**: 
+   - Find actual cache position for each token
+   - Store in `ubatch.kv_position_of_token[i]`
+3. **Masking**:
+   - Use `kv_position_of_token` for causal checks
+   - Token j can attend to token i if `pos_kv_j <= pos_kv_i`
+
+### Example: Vision Processing
+**Input sequence**:
+```
+Token 0: "Describe"     -> pos=0,   kv_pos=0
+Token 1: "this"         -> pos=1,   kv_pos=1
+Token 2: <vision_emb_0> -> pos=100, kv_pos=2  // Non-consecutive!
+Token 3: <vision_emb_1> -> pos=101, kv_pos=3
+Token 4: "image"        -> pos=2,   kv_pos=4  // Position resets
+```
+
+**Causal mask** (kv_position_of_token based):
+```
+     0  1  2  3  4
+0 [  T  F  F  F  F ]  Token 0 sees only itself
+1 [  T  T  F  F  F ]  Token 1 sees 0,1
+2 [  T  T  T  F  F ]  Vision 0 sees 0,1,itself
+3 [  T  T  T  T  F ]  Vision 1 sees 0,1,2,itself
+4 [  T  T  T  T  T ]  Token 4 sees all previous
+```
+
+Without this fix, vision tokens would have incorrect masks based on pos=100,101.
+
+### M-RoPE Position Fix
+**Qwen VL M-RoPE** uses 3D positional encoding:
+- **Temporal dimension**: Token sequence position
+- **Height dimension**: For vision, use image height
+- **Width dimension**: For vision, use image width
+
+**Old code**: `return 1` made all vision embeddings have position=1
+**New code**: `return max(nx, ny)` uses actual image dimensions
+**Result**: Correct RoPE frequencies for vision embeddings
+
+## Benefits
+1. **Correct Vision Processing**: Qwen VL models work properly
+2. **Flexible Position IDs**: Supports non-consecutive positions
+3. **Maintains Causality**: Attention masking still correct
+4. **M-RoPE Fix**: Vision embeddings get proper positional encoding
+5. **No Performance Impact**: Minimal computational overhead
+
+## Testing Recommendations
+
+### Basic Vision Test
+```bash
+ollama run qwen2.5-vl:7b "Describe this image" --image test.jpg
+```
+
+### Multi-Image Test
+```bash
+ollama run qwen2.5-vl:7b "Compare these images" --image img1.jpg --image img2.jpg
+```
+
+### Position Tracking Verification
+Add debug logging:
+```cpp
+for (uint32_t i = 0; i < ubatch.n_tokens; ++i) {
+    printf("Token %d: pos=%d, kv_pos=%d\n", 
+           i, ubatch.pos[i], ubatch.kv_position_of_token[i]);
+}
+```
+
+### Expected Behavior
+- No inference errors with vision inputs
+- Coherent image descriptions
+- Proper multi-image reasoning
+- No position validation assertions
+
+## Models Affected
+- **Qwen2.5-VL** (all sizes: 3B, 7B, 32B, 72B)
+- **Qwen-VL** (original)
+- **Qwen2-VL**
+- Any vision-language model using non-consecutive position IDs
+
+## Known Limitations
+- Assumes vision embeddings use higher position IDs than text
+- M-RoPE calculation depends on correct image dimensions
+- Batch size limited by KV cache size (standard limitation)
+
+## Files Modified
+```
+llama/llama.cpp/src/llama-batch.h
+llama/llama.cpp/src/llama-batch.cpp
+llama/llama.cpp/src/llama-kv-cache.cpp
+llama/llama.cpp/tools/mtmd/mtmd.cpp
+llama/llama.cpp/tools/mtmd/mtmd.h
+```
+
+## Statistics
+- **Files Changed**: 5
+- **Insertions**: 111
+- **Deletions**: 95
+- **Net Change**: +16 lines
+
+## Related Issues
+- Fixes vision processing errors in Qwen VL models
+- Resolves "position assertion failed" errors
+- Improves multi-modal inference quality
+
+## References
+- **Upstream Discussion**: https://github.com/ggml-org/llama.cpp/issues/16207
+- **Qwen VL Discussion**: https://github.com/ggml-org/llama.cpp/issues/16207#issuecomment-3443868720
+- **Related Work**: https://github.com/LETS-BEE/llama.cpp/commits/qwen3vl/