|
| 1 | +Add Qwen3-VL and Qwen3-VL-MoE multimodal model support |
| 2 | + |
| 3 | +This commit introduces comprehensive support for Qwen3-VL vision-language |
| 4 | +models, including both the dense variant and the Mixture-of-Experts (MoE) |
| 5 | +architecture with DeepStack fusion capabilities. |
| 6 | + |
| 7 | +## Overview |
| 8 | + |
| 9 | +Qwen3-VL represents Alibaba's advanced multimodal models capable of |
| 10 | +understanding and reasoning about images alongside text. This implementation |
| 11 | +enables running these models for various vision-language tasks including |
| 12 | +image understanding, optical character recognition (OCR), visual question |
| 13 | +answering, and document analysis. |
| 14 | + |
| 15 | +## Architecture Implementation |
| 16 | + |
| 17 | +### Core Architecture (llama-arch.cpp/h) |
| 18 | +- **LLM_ARCH_QWEN3_VL**: Dense vision-language model architecture |
| 19 | +- **LLM_ARCH_QWEN3_VL_MOE**: Mixture-of-Experts variant with expert routing |
| 20 | +- Complete tensor mapping registration for both architectures |
| 21 | +- Architecture-specific parameter handling and validation |
| 22 | + |
| 23 | +### Model Loading (llama-model.cpp) |
| 24 | + |
| 25 | +**Hyperparameter Loading** |
| 26 | +- QWEN3_VL: Standard dense model configuration |
| 27 | + * Uses full n_embd dimension throughout |
| 28 | + * 36 layers for 4B parameter variant |
| 29 | +- QWEN3_VL_MOE: Expert-based configuration |
| 30 | + * 4x n_embd expansion (n_embd/4 per channel × 4 channels) |
| 31 | + * 48 layers (30B-A3B) or 94 layers (235B-A22B) |
| 32 | + * Expert feed-forward network dimensions |
| 33 | + |
| 34 | +**Multi-axis Rotary Position Embedding (M-RoPE)** |
| 35 | +- Configured rope_sections = [24, 20, 20, 0] |
| 36 | + * Temporal dimension: 24 dims |
| 37 | + * Height dimension: 20 dims |
| 38 | + * Width dimension: 20 dims |
| 39 | + * Unused dimension: 0 |
| 40 | +- Enables spatial awareness for image patch processing |
| 41 | +- Added debug logging for MRoPE configuration verification |
| 42 | + |
| 43 | +**Tensor Initialization** |
| 44 | +- QWEN3_VL follows QWEN3 dense structure |
| 45 | + * Token embeddings, output projection |
| 46 | + * Per-layer: attention (Q/K/V/O), normalization, FFN |
| 47 | +- QWEN3_VL_MOE includes expert-specific tensors |
| 48 | + * Expert gate networks for routing |
| 49 | + * Per-expert FFN weights (gate, down, up) |
| 50 | + * Shared and expert-specific parameters |
| 51 | + |
| 52 | +### Graph Building (llama-graph.cpp/h) |
| 53 | + |
| 54 | +**DeepStack Architecture for MoE** |
| 55 | +The Qwen3-VL-MoE variant implements a novel DeepStack fusion mechanism: |
| 56 | + |
| 57 | +1. **Channel Splitting**: Vision embeddings split into 3 processing channels |
| 58 | + - ds0, ds1, ds2 (DeepStack channels 0, 1, 2) |
| 59 | + - Each channel: n_embd/4 dimensions |
| 60 | + |
| 61 | +2. **Per-layer Processing**: Independent expert selection per channel |
| 62 | + - Token-level expert routing |
| 63 | + - Gated mixture-of-experts computation |
| 64 | + - Q/K normalization before attention |
| 65 | + |
| 66 | +3. **Fusion Layers**: Learned merging at early transformer layers |
| 67 | + - Fusion occurs at layers 0, 1, and 2 |
| 68 | + - DeepStack merger combines information across channels |
| 69 | + - Only active when vision embeddings present (text-only safe) |
| 70 | + |
| 71 | +**Batch Processing** |
| 72 | +- Enhanced position array handling for M-RoPE multi-dimensional positions |
| 73 | +- Proper ubatch preparation distinguishing vision vs text tokens |
| 74 | +- Conditional graph construction based on modality |
| 75 | + |
| 76 | +### Vision Processing (clip.cpp/clip-impl.h) |
| 77 | + |
| 78 | +**PROJECTOR_TYPE_QWEN3VLMOE** |
| 79 | +- New projector type for Qwen3-VL-MoE vision encoder |
| 80 | +- Handles projection from vision encoder to language model space |
| 81 | + |
| 82 | +**DeepStack Merger Implementation** |
| 83 | +The merger is a learnable 2-layer MLP with normalization: |
| 84 | +``` |
| 85 | +Input (3 channels) |
| 86 | + → LayerNorm(norm_w, norm_b) |
| 87 | + → Linear(fc1_w, fc1_b) |
| 88 | + → GELU activation |
| 89 | + → Linear(fc2_w, fc2_b) |
| 90 | + → Output (fused representation) |
| 91 | +``` |
| 92 | + |
| 93 | +Components: |
| 94 | +- `norm_w`, `norm_b`: Layer normalization parameters |
| 95 | +- `fc1_w`, `fc1_b`: First linear projection |
| 96 | +- `fc2_w`, `fc2_b`: Second linear projection |
| 97 | + |
| 98 | +**Spatial Operations** |
| 99 | +- Fixed spatial merge for vision patch sequences |
| 100 | +- Proper handling of patch grid dimensions |
| 101 | +- Vision-text boundary management |
| 102 | + |
| 103 | +**Safety Improvements** |
| 104 | +- Removed illegal zero-tensor initialization for text-only inputs |
| 105 | +- Conditional fusion: only processes when vision embeddings exist |
| 106 | +- Prevents memory access violations in text-only inference |
| 107 | + |
| 108 | +### Platform Support (llama-model-loader.cpp) |
| 109 | + |
| 110 | +**Windows File Handle Limit** |
| 111 | +- Increased stdio limit to 2048 handles (from default ~512) |
| 112 | +- Critical for MoE models with many expert weight files |
| 113 | +- Uses `_setmaxstdio()` on Windows platform |
| 114 | +- Prevents "too many open files" errors during loading |
| 115 | + |
| 116 | +### Reference Patches (llama/patches/) |
| 117 | + |
| 118 | +Included for transparency and reproducibility: |
| 119 | +- `0033-qwen3vl-base-architecture.patch` |
| 120 | +- `0034-qwen3vl-deepstack-implementation.patch` |
| 121 | +- `0035-qwen3vl-memory-fix.patch` |
| 122 | +- `0036-qwen3vl-layer-norm-bias.patch` |
| 123 | + |
| 124 | +## Technical Specifications |
| 125 | + |
| 126 | +### Qwen3-VL (Dense) |
| 127 | +- **Type**: Standard transformer with integrated vision encoder |
| 128 | +- **Layers**: 36 (4B parameter model) |
| 129 | +- **Embedding**: Full n_embd dimension |
| 130 | +- **Position Encoding**: M-RoPE with 4 dimensional sections |
| 131 | +- **Use Cases**: General vision-language understanding |
| 132 | + |
| 133 | +### Qwen3-VL-MoE (Mixture of Experts) |
| 134 | +- **Type**: Sparse MoE with DeepStack fusion |
| 135 | +- **Layers**: 48 (30B activated/3B) or 94 (235B activated/22B) |
| 136 | +- **Embedding**: 4-channel architecture (n_embd/4 per channel) |
| 137 | +- **Experts**: Multiple expert networks per layer with learned routing |
| 138 | +- **Fusion**: 3-layer early fusion (layers 0, 1, 2) |
| 139 | +- **Use Cases**: High-quality vision understanding at improved efficiency |
| 140 | + |
| 141 | +### DeepStack Fusion Mechanism |
| 142 | + |
| 143 | +The multi-channel fusion enables: |
| 144 | +1. **Parallel Processing**: Different aspects of vision processed independently |
| 145 | +2. **Early Integration**: Information merged in early transformer layers |
| 146 | +3. **Adaptive Routing**: Expert selection per channel and token |
| 147 | +4. **Efficiency**: Sparse activation patterns reduce computation |
| 148 | + |
| 149 | +## Capabilities Enabled |
| 150 | + |
| 151 | +This implementation supports: |
| 152 | +- **Multimodal Chat**: Conversational AI with image understanding |
| 153 | +- **Image Captioning**: Detailed image descriptions |
| 154 | +- **Visual Question Answering**: Answer questions about image content |
| 155 | +- **Optical Character Recognition**: Extract text from images |
| 156 | +- **Document Understanding**: Analyze documents, tables, charts |
| 157 | +- **Image Analysis**: Detailed visual scene understanding |
| 158 | + |
| 159 | +## References and Acknowledgments |
| 160 | + |
| 161 | +This implementation is based on the outstanding work by the community: |
| 162 | + |
| 163 | +**Primary Source Repository** |
| 164 | +- Branch: https://github.com/LETS-BEE/llama.cpp/commits/qwen3vl/ |
| 165 | +- Author: LETS-BEE |
| 166 | + |
| 167 | +**Source Commits** (applied in llama/patches/): |
| 168 | +1. Base Architecture |
| 169 | + https://github.com/LETS-BEE/llama.cpp/commit/99719122bf16db5db85f0c2d37c059a3aefd3eca |
| 170 | + |
| 171 | +2. DeepStack Implementation |
| 172 | + https://github.com/LETS-BEE/llama.cpp/commit/b913e895a2189b9792da7709b36a36a1aed2feb9 |
| 173 | + |
| 174 | +3. Memory Access Fix |
| 175 | + https://github.com/LETS-BEE/llama.cpp/commit/de0e3d3c3ce4b394746ade9263736c8edb40260e |
| 176 | + |
| 177 | +4. Layer Normalization Update |
| 178 | + https://github.com/LETS-BEE/llama.cpp/commit/e45aecb7b051d3c0fea968d64aadbeb0b777e4a1 |
| 179 | + |
| 180 | +**Related Discussions and Pull Requests** |
| 181 | +- Upstream llama.cpp Discussion: |
| 182 | + https://github.com/ggml-org/llama.cpp/issues/16207#issuecomment-3443868720 |
| 183 | + |
| 184 | +- Upstream llama.cpp PR: |
| 185 | + https://github.com/ggml-org/llama.cpp/pull/16745 |
| 186 | + |
| 187 | +- Related Ollama PR: |
| 188 | + https://github.com/ollama/ollama/pull/12665 |
| 189 | + |
| 190 | +**Additional Context** |
| 191 | +- OCR-related discussion: |
| 192 | + https://github.com/ggml-org/llama.cpp/pull/16764 |
| 193 | + |
| 194 | +## Testing |
| 195 | + |
| 196 | +Tested with: |
| 197 | +- Qwen3-VL 4B parameter models (dense) |
| 198 | +- Qwen3-VL-MoE 30B-A3B models (MoE) |
| 199 | +- Various image understanding tasks |
| 200 | +- Text-only and multimodal inference modes |
| 201 | + |
| 202 | +## Future Work |
| 203 | + |
| 204 | +Potential enhancements: |
| 205 | +- Additional model size variants |
| 206 | +- Performance optimizations for DeepStack fusion |
| 207 | +- Extended M-RoPE configuration options |
| 208 | +- Enhanced vision preprocessing pipelines |
| 209 | + |
| 210 | +--- |
| 211 | + |
| 212 | +Special thanks to the llama.cpp community and all contributors who made |
| 213 | +this multimodal vision-language support possible. |
0 commit comments