Skip to content

Commit 10f047b

Browse files
committed
Add Qwen3-VL and Qwen3-VL-MoE multimodal model support
This commit introduces comprehensive support for Qwen3-VL vision-language models, including both the dense variant and the Mixture-of-Experts (MoE) architecture with DeepStack fusion capabilities. ## Overview Qwen3-VL represents Alibaba's advanced multimodal models capable of understanding and reasoning about images alongside text. This implementation enables running these models for various vision-language tasks including image understanding, optical character recognition (OCR), visual question answering, and document analysis. ## Architecture Implementation ### Core Architecture (llama-arch.cpp/h) - **LLM_ARCH_QWEN3_VL**: Dense vision-language model architecture - **LLM_ARCH_QWEN3_VL_MOE**: Mixture-of-Experts variant with expert routing - Complete tensor mapping registration for both architectures - Architecture-specific parameter handling and validation ### Model Loading (llama-model.cpp) **Hyperparameter Loading** - QWEN3_VL: Standard dense model configuration * Uses full n_embd dimension throughout * 36 layers for 4B parameter variant - QWEN3_VL_MOE: Expert-based configuration * 4x n_embd expansion (n_embd/4 per channel × 4 channels) * 48 layers (30B-A3B) or 94 layers (235B-A22B) * Expert feed-forward network dimensions **Multi-axis Rotary Position Embedding (M-RoPE)** - Configured rope_sections = [24, 20, 20, 0] * Temporal dimension: 24 dims * Height dimension: 20 dims * Width dimension: 20 dims * Unused dimension: 0 - Enables spatial awareness for image patch processing - Added debug logging for MRoPE configuration verification **Tensor Initialization** - QWEN3_VL follows QWEN3 dense structure * Token embeddings, output projection * Per-layer: attention (Q/K/V/O), normalization, FFN - QWEN3_VL_MOE includes expert-specific tensors * Expert gate networks for routing * Per-expert FFN weights (gate, down, up) * Shared and expert-specific parameters ### Graph Building (llama-graph.cpp/h) **DeepStack Architecture for MoE** The Qwen3-VL-MoE variant implements a novel DeepStack fusion mechanism: 1. **Channel Splitting**: Vision embeddings split into 3 processing channels - ds0, ds1, ds2 (DeepStack channels 0, 1, 2) - Each channel: n_embd/4 dimensions 2. **Per-layer Processing**: Independent expert selection per channel - Token-level expert routing - Gated mixture-of-experts computation - Q/K normalization before attention 3. **Fusion Layers**: Learned merging at early transformer layers - Fusion occurs at layers 0, 1, and 2 - DeepStack merger combines information across channels - Only active when vision embeddings present (text-only safe) **Batch Processing** - Enhanced position array handling for M-RoPE multi-dimensional positions - Proper ubatch preparation distinguishing vision vs text tokens - Conditional graph construction based on modality ### Vision Processing (clip.cpp/clip-impl.h) **PROJECTOR_TYPE_QWEN3VLMOE** - New projector type for Qwen3-VL-MoE vision encoder - Handles projection from vision encoder to language model space **DeepStack Merger Implementation** The merger is a learnable 2-layer MLP with normalization: ``` Input (3 channels) → LayerNorm(norm_w, norm_b) → Linear(fc1_w, fc1_b) → GELU activation → Linear(fc2_w, fc2_b) → Output (fused representation) ``` Components: - `norm_w`, `norm_b`: Layer normalization parameters - `fc1_w`, `fc1_b`: First linear projection - `fc2_w`, `fc2_b`: Second linear projection **Spatial Operations** - Fixed spatial merge for vision patch sequences - Proper handling of patch grid dimensions - Vision-text boundary management **Safety Improvements** - Removed illegal zero-tensor initialization for text-only inputs - Conditional fusion: only processes when vision embeddings exist - Prevents memory access violations in text-only inference ### Platform Support (llama-model-loader.cpp) **Windows File Handle Limit** - Increased stdio limit to 2048 handles (from default ~512) - Critical for MoE models with many expert weight files - Uses `_setmaxstdio()` on Windows platform - Prevents "too many open files" errors during loading ### Reference Patches (llama/patches/) Included for transparency and reproducibility: - `0033-qwen3vl-base-architecture.patch` - `0034-qwen3vl-deepstack-implementation.patch` - `0035-qwen3vl-memory-fix.patch` - `0036-qwen3vl-layer-norm-bias.patch` ## Technical Specifications ### Qwen3-VL (Dense) - **Type**: Standard transformer with integrated vision encoder - **Layers**: 36 (4B parameter model) - **Embedding**: Full n_embd dimension - **Position Encoding**: M-RoPE with 4 dimensional sections - **Use Cases**: General vision-language understanding ### Qwen3-VL-MoE (Mixture of Experts) - **Type**: Sparse MoE with DeepStack fusion - **Layers**: 48 (30B activated/3B) or 94 (235B activated/22B) - **Embedding**: 4-channel architecture (n_embd/4 per channel) - **Experts**: Multiple expert networks per layer with learned routing - **Fusion**: 3-layer early fusion (layers 0, 1, 2) - **Use Cases**: High-quality vision understanding at improved efficiency ### DeepStack Fusion Mechanism The multi-channel fusion enables: 1. **Parallel Processing**: Different aspects of vision processed independently 2. **Early Integration**: Information merged in early transformer layers 3. **Adaptive Routing**: Expert selection per channel and token 4. **Efficiency**: Sparse activation patterns reduce computation ## Capabilities Enabled This implementation supports: - **Multimodal Chat**: Conversational AI with image understanding - **Image Captioning**: Detailed image descriptions - **Visual Question Answering**: Answer questions about image content - **Optical Character Recognition**: Extract text from images - **Document Understanding**: Analyze documents, tables, charts - **Image Analysis**: Detailed visual scene understanding ## References and Acknowledgments This implementation is based on the outstanding work by the community: **Primary Source Repository** - Branch: https://github.com/LETS-BEE/llama.cpp/commits/qwen3vl/ - Author: LETS-BEE **Source Commits** (applied in llama/patches/): 1. Base Architecture LETS-BEE/llama.cpp@9971912 2. DeepStack Implementation LETS-BEE/llama.cpp@b913e89 3. Memory Access Fix LETS-BEE/llama.cpp@de0e3d3 4. Layer Normalization Update LETS-BEE/llama.cpp@e45aecb **Related Discussions and Pull Requests** - Upstream llama.cpp Discussion: ggml-org/llama.cpp#16207 (comment) - Upstream llama.cpp PR: ggml-org/llama.cpp#16745 - Related Ollama PR: ollama#12665 **Additional Context** - OCR-related discussion: ggml-org/llama.cpp#16764 ## Testing Tested with: - Qwen3-VL 4B parameter models (dense) - Qwen3-VL-MoE 30B-A3B models (MoE) - Various image understanding tasks - Text-only and multimodal inference modes ## Future Work Potential enhancements: - Additional model size variants - Performance optimizations for DeepStack fusion - Extended M-RoPE configuration options - Enhanced vision preprocessing pipelines --- Special thanks to the llama.cpp community and all contributors who made this multimodal vision-language support possible.
1 parent 1f1b508 commit 10f047b

19 files changed

+14833
-31
lines changed

COMMIT_MESSAGE_1.txt

Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
Add Qwen3-VL and Qwen3-VL-MoE multimodal model support
2+
3+
This commit introduces comprehensive support for Qwen3-VL vision-language
4+
models, including both the dense variant and the Mixture-of-Experts (MoE)
5+
architecture with DeepStack fusion capabilities.
6+
7+
## Overview
8+
9+
Qwen3-VL represents Alibaba's advanced multimodal models capable of
10+
understanding and reasoning about images alongside text. This implementation
11+
enables running these models for various vision-language tasks including
12+
image understanding, optical character recognition (OCR), visual question
13+
answering, and document analysis.
14+
15+
## Architecture Implementation
16+
17+
### Core Architecture (llama-arch.cpp/h)
18+
- **LLM_ARCH_QWEN3_VL**: Dense vision-language model architecture
19+
- **LLM_ARCH_QWEN3_VL_MOE**: Mixture-of-Experts variant with expert routing
20+
- Complete tensor mapping registration for both architectures
21+
- Architecture-specific parameter handling and validation
22+
23+
### Model Loading (llama-model.cpp)
24+
25+
**Hyperparameter Loading**
26+
- QWEN3_VL: Standard dense model configuration
27+
* Uses full n_embd dimension throughout
28+
* 36 layers for 4B parameter variant
29+
- QWEN3_VL_MOE: Expert-based configuration
30+
* 4x n_embd expansion (n_embd/4 per channel × 4 channels)
31+
* 48 layers (30B-A3B) or 94 layers (235B-A22B)
32+
* Expert feed-forward network dimensions
33+
34+
**Multi-axis Rotary Position Embedding (M-RoPE)**
35+
- Configured rope_sections = [24, 20, 20, 0]
36+
* Temporal dimension: 24 dims
37+
* Height dimension: 20 dims
38+
* Width dimension: 20 dims
39+
* Unused dimension: 0
40+
- Enables spatial awareness for image patch processing
41+
- Added debug logging for MRoPE configuration verification
42+
43+
**Tensor Initialization**
44+
- QWEN3_VL follows QWEN3 dense structure
45+
* Token embeddings, output projection
46+
* Per-layer: attention (Q/K/V/O), normalization, FFN
47+
- QWEN3_VL_MOE includes expert-specific tensors
48+
* Expert gate networks for routing
49+
* Per-expert FFN weights (gate, down, up)
50+
* Shared and expert-specific parameters
51+
52+
### Graph Building (llama-graph.cpp/h)
53+
54+
**DeepStack Architecture for MoE**
55+
The Qwen3-VL-MoE variant implements a novel DeepStack fusion mechanism:
56+
57+
1. **Channel Splitting**: Vision embeddings split into 3 processing channels
58+
- ds0, ds1, ds2 (DeepStack channels 0, 1, 2)
59+
- Each channel: n_embd/4 dimensions
60+
61+
2. **Per-layer Processing**: Independent expert selection per channel
62+
- Token-level expert routing
63+
- Gated mixture-of-experts computation
64+
- Q/K normalization before attention
65+
66+
3. **Fusion Layers**: Learned merging at early transformer layers
67+
- Fusion occurs at layers 0, 1, and 2
68+
- DeepStack merger combines information across channels
69+
- Only active when vision embeddings present (text-only safe)
70+
71+
**Batch Processing**
72+
- Enhanced position array handling for M-RoPE multi-dimensional positions
73+
- Proper ubatch preparation distinguishing vision vs text tokens
74+
- Conditional graph construction based on modality
75+
76+
### Vision Processing (clip.cpp/clip-impl.h)
77+
78+
**PROJECTOR_TYPE_QWEN3VLMOE**
79+
- New projector type for Qwen3-VL-MoE vision encoder
80+
- Handles projection from vision encoder to language model space
81+
82+
**DeepStack Merger Implementation**
83+
The merger is a learnable 2-layer MLP with normalization:
84+
```
85+
Input (3 channels)
86+
→ LayerNorm(norm_w, norm_b)
87+
→ Linear(fc1_w, fc1_b)
88+
→ GELU activation
89+
→ Linear(fc2_w, fc2_b)
90+
→ Output (fused representation)
91+
```
92+
93+
Components:
94+
- `norm_w`, `norm_b`: Layer normalization parameters
95+
- `fc1_w`, `fc1_b`: First linear projection
96+
- `fc2_w`, `fc2_b`: Second linear projection
97+
98+
**Spatial Operations**
99+
- Fixed spatial merge for vision patch sequences
100+
- Proper handling of patch grid dimensions
101+
- Vision-text boundary management
102+
103+
**Safety Improvements**
104+
- Removed illegal zero-tensor initialization for text-only inputs
105+
- Conditional fusion: only processes when vision embeddings exist
106+
- Prevents memory access violations in text-only inference
107+
108+
### Platform Support (llama-model-loader.cpp)
109+
110+
**Windows File Handle Limit**
111+
- Increased stdio limit to 2048 handles (from default ~512)
112+
- Critical for MoE models with many expert weight files
113+
- Uses `_setmaxstdio()` on Windows platform
114+
- Prevents "too many open files" errors during loading
115+
116+
### Reference Patches (llama/patches/)
117+
118+
Included for transparency and reproducibility:
119+
- `0033-qwen3vl-base-architecture.patch`
120+
- `0034-qwen3vl-deepstack-implementation.patch`
121+
- `0035-qwen3vl-memory-fix.patch`
122+
- `0036-qwen3vl-layer-norm-bias.patch`
123+
124+
## Technical Specifications
125+
126+
### Qwen3-VL (Dense)
127+
- **Type**: Standard transformer with integrated vision encoder
128+
- **Layers**: 36 (4B parameter model)
129+
- **Embedding**: Full n_embd dimension
130+
- **Position Encoding**: M-RoPE with 4 dimensional sections
131+
- **Use Cases**: General vision-language understanding
132+
133+
### Qwen3-VL-MoE (Mixture of Experts)
134+
- **Type**: Sparse MoE with DeepStack fusion
135+
- **Layers**: 48 (30B activated/3B) or 94 (235B activated/22B)
136+
- **Embedding**: 4-channel architecture (n_embd/4 per channel)
137+
- **Experts**: Multiple expert networks per layer with learned routing
138+
- **Fusion**: 3-layer early fusion (layers 0, 1, 2)
139+
- **Use Cases**: High-quality vision understanding at improved efficiency
140+
141+
### DeepStack Fusion Mechanism
142+
143+
The multi-channel fusion enables:
144+
1. **Parallel Processing**: Different aspects of vision processed independently
145+
2. **Early Integration**: Information merged in early transformer layers
146+
3. **Adaptive Routing**: Expert selection per channel and token
147+
4. **Efficiency**: Sparse activation patterns reduce computation
148+
149+
## Capabilities Enabled
150+
151+
This implementation supports:
152+
- **Multimodal Chat**: Conversational AI with image understanding
153+
- **Image Captioning**: Detailed image descriptions
154+
- **Visual Question Answering**: Answer questions about image content
155+
- **Optical Character Recognition**: Extract text from images
156+
- **Document Understanding**: Analyze documents, tables, charts
157+
- **Image Analysis**: Detailed visual scene understanding
158+
159+
## References and Acknowledgments
160+
161+
This implementation is based on the outstanding work by the community:
162+
163+
**Primary Source Repository**
164+
- Branch: https://github.com/LETS-BEE/llama.cpp/commits/qwen3vl/
165+
- Author: LETS-BEE
166+
167+
**Source Commits** (applied in llama/patches/):
168+
1. Base Architecture
169+
https://github.com/LETS-BEE/llama.cpp/commit/99719122bf16db5db85f0c2d37c059a3aefd3eca
170+
171+
2. DeepStack Implementation
172+
https://github.com/LETS-BEE/llama.cpp/commit/b913e895a2189b9792da7709b36a36a1aed2feb9
173+
174+
3. Memory Access Fix
175+
https://github.com/LETS-BEE/llama.cpp/commit/de0e3d3c3ce4b394746ade9263736c8edb40260e
176+
177+
4. Layer Normalization Update
178+
https://github.com/LETS-BEE/llama.cpp/commit/e45aecb7b051d3c0fea968d64aadbeb0b777e4a1
179+
180+
**Related Discussions and Pull Requests**
181+
- Upstream llama.cpp Discussion:
182+
https://github.com/ggml-org/llama.cpp/issues/16207#issuecomment-3443868720
183+
184+
- Upstream llama.cpp PR:
185+
https://github.com/ggml-org/llama.cpp/pull/16745
186+
187+
- Related Ollama PR:
188+
https://github.com/ollama/ollama/pull/12665
189+
190+
**Additional Context**
191+
- OCR-related discussion:
192+
https://github.com/ggml-org/llama.cpp/pull/16764
193+
194+
## Testing
195+
196+
Tested with:
197+
- Qwen3-VL 4B parameter models (dense)
198+
- Qwen3-VL-MoE 30B-A3B models (MoE)
199+
- Various image understanding tasks
200+
- Text-only and multimodal inference modes
201+
202+
## Future Work
203+
204+
Potential enhancements:
205+
- Additional model size variants
206+
- Performance optimizations for DeepStack fusion
207+
- Extended M-RoPE configuration options
208+
- Enhanced vision preprocessing pipelines
209+
210+
---
211+
212+
Special thanks to the llama.cpp community and all contributors who made
213+
this multimodal vision-language support possible.

COMMIT_MESSAGE_2.txt

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
Add local development patches and scripts for Qwen3-VL
2+
3+
This commit contains local development resources for working with
4+
Qwen3-VL implementation. These files are for internal use and testing.
5+
6+
## Added Files
7+
8+
### Patch Files (Z_Iosu/patches/)
9+
- original_99719122b.patch: Base Qwen3-VL architecture
10+
- original_b913e895a.patch: DeepStack implementation
11+
- original_de0e3d3c3.patch: Memory access fixes
12+
- original_e45aecb7b.patch: Layer normalization updates
13+
- README_QWEN3VL.md: Detailed documentation
14+
- mirar.md: Reference links and documentation
15+
16+
### Verification Scripts (Z_Iosu/scripts/)
17+
- verify_all_patches.ps1: Comprehensive patch verification
18+
- verify_each_patch_detailed.ps1: Individual patch checking
19+
- check_patches.ps1: Quick status check
20+
21+
These resources are maintained locally for development and are not
22+
intended for upstream contribution.

llama/llama.cpp/src/llama-arch.cpp

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,8 @@ static const std::map<llm_arch, const char *> LLM_ARCH_NAMES = {
3131
{ LLM_ARCH_QWEN2VL, "qwen2vl" },
3232
{ LLM_ARCH_QWEN3, "qwen3" },
3333
{ LLM_ARCH_QWEN3MOE, "qwen3moe" },
34+
{ LLM_ARCH_QWEN3_VL, "qwen3vl" },
35+
{ LLM_ARCH_QWEN3_VL_MOE, "qwen3vlmoe" },
3436
{ LLM_ARCH_PHI2, "phi2" },
3537
{ LLM_ARCH_PHI3, "phi3" },
3638
{ LLM_ARCH_PHIMOE, "phimoe" },
@@ -773,6 +775,49 @@ static const std::map<llm_arch, std::map<llm_tensor, const char *>> LLM_TENSOR_N
773775
{ LLM_TENSOR_FFN_UP_EXPS, "blk.%d.ffn_up_exps" },
774776
},
775777
},
778+
{
779+
LLM_ARCH_QWEN3_VL,
780+
{
781+
{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },
782+
{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },
783+
{ LLM_TENSOR_OUTPUT, "output" },
784+
{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },
785+
{ LLM_TENSOR_ATTN_Q, "blk.%d.attn_q" },
786+
{ LLM_TENSOR_ATTN_Q_NORM, "blk.%d.attn_q_norm" },
787+
{ LLM_TENSOR_ATTN_K, "blk.%d.attn_k" },
788+
{ LLM_TENSOR_ATTN_K_NORM, "blk.%d.attn_k_norm" },
789+
{ LLM_TENSOR_ATTN_V, "blk.%d.attn_v" },
790+
{ LLM_TENSOR_ATTN_OUT, "blk.%d.attn_output" },
791+
{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" },
792+
{ LLM_TENSOR_FFN_GATE, "blk.%d.ffn_gate" },
793+
{ LLM_TENSOR_FFN_DOWN, "blk.%d.ffn_down" },
794+
{ LLM_TENSOR_FFN_UP, "blk.%d.ffn_up" },
795+
},
796+
},
797+
{
798+
LLM_ARCH_QWEN3_VL_MOE,
799+
{
800+
{ LLM_TENSOR_TOKEN_EMBD, "token_embd" },
801+
{ LLM_TENSOR_OUTPUT_NORM, "output_norm" },
802+
{ LLM_TENSOR_OUTPUT, "output" },
803+
{ LLM_TENSOR_ATTN_NORM, "blk.%d.attn_norm" },
804+
{ LLM_TENSOR_ATTN_Q, "blk.%d.attn_q" },
805+
{ LLM_TENSOR_ATTN_Q_NORM, "blk.%d.attn_q_norm" },
806+
{ LLM_TENSOR_ATTN_K, "blk.%d.attn_k" },
807+
{ LLM_TENSOR_ATTN_K_NORM, "blk.%d.attn_k_norm" },
808+
{ LLM_TENSOR_ATTN_V, "blk.%d.attn_v" },
809+
{ LLM_TENSOR_ATTN_OUT, "blk.%d.attn_output" },
810+
{ LLM_TENSOR_FFN_NORM, "blk.%d.ffn_norm" },
811+
{ LLM_TENSOR_FFN_GATE_INP, "blk.%d.ffn_gate_inp" },
812+
{ LLM_TENSOR_FFN_GATE_EXPS, "blk.%d.ffn_gate_exps" },
813+
{ LLM_TENSOR_FFN_DOWN_EXPS, "blk.%d.ffn_down_exps" },
814+
{ LLM_TENSOR_FFN_UP_EXPS, "blk.%d.ffn_up_exps" },
815+
{ LLM_TENSOR_FFN_GATE_INP_SHEXP, "blk.%d.ffn_gate_inp_shexp" },
816+
{ LLM_TENSOR_FFN_GATE_SHEXP, "blk.%d.ffn_gate_shexp" },
817+
{ LLM_TENSOR_FFN_DOWN_SHEXP, "blk.%d.ffn_down_shexp" },
818+
{ LLM_TENSOR_FFN_UP_SHEXP, "blk.%d.ffn_up_shexp" },
819+
},
820+
},
776821
{
777822
LLM_ARCH_PHI2,
778823
{

llama/llama.cpp/src/llama-arch.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,8 @@ enum llm_arch {
3535
LLM_ARCH_QWEN2VL,
3636
LLM_ARCH_QWEN3,
3737
LLM_ARCH_QWEN3MOE,
38+
LLM_ARCH_QWEN3_VL,
39+
LLM_ARCH_QWEN3_VL_MOE,
3840
LLM_ARCH_PHI2,
3941
LLM_ARCH_PHI3,
4042
LLM_ARCH_PHIMOE,

llama/llama.cpp/src/llama-batch.cpp

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -658,7 +658,7 @@ llama_ubatch llama_batch_allocr::ubatch_add(const std::vector<int32_t> & idxs, u
658658

659659
auto udata = std::make_shared<llama_ubatch::data_t>();
660660

661-
const int32_t n_pos_cur = batch.embd ? n_pos_per_embd : 1;
661+
const int32_t n_pos_cur = batch.embd ? (n_pos_per_embd + 1) : 1;
662662

663663
const int64_t n_embd_all = batch.embd ? (int64_t) n_tokens*n_embd : 0;
664664
const int64_t n_pos_all = (int64_t) n_tokens*n_pos_cur;

0 commit comments

Comments
 (0)