Skip to content

Commit 8bca133

Browse files
committed
* add qwen3-vlm doc
1 parent 30903fc commit 8bca133

File tree

5 files changed

+512
-0
lines changed

5 files changed

+512
-0
lines changed

docs/doc/en/mllm/vlm_qwen3.md

Lines changed: 255 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
---
2+
title: MaixPy MaixCAM Running VLM Qwen3-VL Visual Language Model
3+
update:
4+
- date: 2025-11-27
5+
author: lxowalle
6+
version: 1.0.0
7+
content: Added Qwen3-VL code and documentation
8+
---
9+
10+
## Supported Devices
11+
12+
| Device | Supported |
13+
| -------- | --------- |
14+
| MaixCAM2 ||
15+
| MaixCAM ||
16+
17+
18+
## Introduction to Qwen3-VL
19+
20+
`Qwen3-VL` is a visual language model from the Qwen series. Compared to the previous generation, it offers superior text comprehension and generation, deeper visual perception and reasoning, extended context length, enhanced spatial and video dynamic understanding, and stronger agent interaction capabilities.
21+
22+
[Qwen3-VL-2B](https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct) has been ported to MaixPy.
23+
24+
## Using Qwen3-VL in MaixPy MaixCAM
25+
26+
### Model and Download Address
27+
28+
MaixPy currently supports `Qwen3-VL-2B-Instruct-GPTQ-Int4-AX630C-P320-CTX448`. Due to the large model size, you need to download the model yourself and save it to the `/root/models directory`.
29+
> !!! IMPORTANT !!! IMPORTANT !!! The model MUST be saved in the `/root/models` directory, otherwise the model cannot be loaded. For example, the save path should be `/root/models/sipeed/Qwen3-VL-2B-Instruct-GPTQ-Int4-AX630C-P320-CTX448-maixcam2`
30+
31+
* **2B**:
32+
* Memory Requirement: 2GiB CMM Memory. Please refer to the Memory Usage Documentation for memory explanation.
33+
* Download Address: [Qwen3-VL-2B-Instruct-GPTQ-Int4-AX630C-P320-CTX448-maixcam2](https://huggingface.co/sipeed/Qwen3-VL-2B-Instruct-GPTQ-Int4-AX630C-P320-CTX448-maixcam2)
34+
35+
### Download Method
36+
37+
Make sure the download tool is installed:
38+
39+
```
40+
pip install huggingface_hub
41+
```
42+
43+
Within China, you can use:
44+
45+
```
46+
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple huggingface_hub
47+
```
48+
49+
If you are in China, you can set a domestic mirror to speed up the download:
50+
Linux/MacOS:
51+
52+
```
53+
export HF_ENDPOINT=https://hf-mirror.com
54+
```
55+
56+
Windows:
57+
CMD terminal: `set HF_ENDPOINT=https://hf-mirror.com`
58+
PowerShell: `$env:HF_ENDPOINT = "https://hf-mirror.com"`
59+
60+
Then download:
61+
62+
```shell
63+
huggingface-cli download sipeed/Qwen3-VL-2B-Instruct-GPTQ-Int4-AX630C-P320-CTX448-maixcam2 --local-dir Qwen3-VL-2B-Instruct-GPTQ-Int4-AX630C-P320-CTX448-maixcam2
64+
```
65+
66+
67+
### Running the Model
68+
69+
```python
70+
from maix import app, nn, err, image, display, time
71+
import requests
72+
import json
73+
74+
model = "/root/models/Qwen3-VL-2B-Instruct-GPTQ-Int4-AX630C-P320-CTX448-maixcam2/model.mud"
75+
disp = display.Display()
76+
77+
qwen3_vl = nn.Qwen3VL(model)
78+
79+
in_w = qwen3_vl.input_width()
80+
in_h = qwen3_vl.input_height()
81+
in_fmt = qwen3_vl.input_format()
82+
print(f"input size: {in_w}x{in_h}, format: {image.format_name(in_fmt)}")
83+
84+
def on_reply(obj, resp):
85+
print(resp.msg_new, end="")
86+
87+
qwen3_vl.set_system_prompt("You are Qwen3VL. You are a helpful vision-to-text assistant.")
88+
qwen3_vl.set_reply_callback(on_reply)
89+
90+
# load and set image
91+
img = image.load("/maixapp/share/picture/2024.1.1/ssd_car.jpg", format=in_fmt)
92+
qwen3_vl.set_image(img, fit=image.Fit.FIT_CONTAIN) # if size not math, will auto resize first
93+
disp.show(img)
94+
95+
while not app.need_exit():
96+
print('wait model is ready')
97+
if qwen3_vl.is_ready():
98+
break
99+
time.sleep(1)
100+
101+
def example1():
102+
print('')
103+
# set prompt
104+
msg = "请描述图中有什么"
105+
print(">>", msg)
106+
resp = qwen3_vl.send(msg)
107+
err.check_raise(resp.err_code)
108+
109+
def example2():
110+
print('')
111+
msg = "Describe the picture"
112+
print(">>", msg)
113+
resp = qwen3_vl.send(msg)
114+
err.check_raise(resp.err_code)
115+
116+
example1()
117+
example2()
118+
119+
del qwen3_vl # Must release vlm object
120+
```
121+
122+
Result:
123+
```
124+
>> 请描述图中有什么
125+
好的,这是一张在城市街道上拍摄的照片。以下是图中包含的详细信息:
126+
127+
这张照片的主体是一位站在红车前的女性,背景是城市街道和建筑。
128+
129+
- **前景中的女性**:一位女性,她站在画面的前景中央。她有深色的头发,穿着一件深色的外套。她正看着镜头,似乎正准备拍照。
130+
131+
- **背景中的红车**:在女性的后方,是一辆红色的“大众”(Volkswagen)汽车的前部。这辆车停在一条小巷或路边,它的前脸部分被遮挡,但可以清楚地看到它红色的车身和前大灯。
132+
133+
- **背景中的建筑物**:在车辆后方是几座多层的建筑,看起来是城市中的居民楼或办公楼。这些建筑的外立面是浅色的,带有许多窗户,窗户的大小和排列方式不同。
134+
135+
- **照片中的细节**:在图像的右上角,可以看到一辆黑色的汽车的后视镜,这可能是一辆停在远处的车辆。在图像的左上角,有一座建筑的窗户上有一个明显的“N”字母,这可能是一个窗户的标识或装饰。
136+
137+
- **拍摄视角和构图**:这是一张从较低角度拍摄的风景照。摄影师可能使用了广角镜头,使建筑和街道的细节被放大。整体构图具有一定的对称性,以车辆和建筑为中心。
138+
139+
总的来说,这张照片展示了一个城市街景,其中一位女性在一辆红色的大众汽车前,而背景则是一排具有多个窗户的建筑。
140+
141+
>> Describe the picture
142+
A woman stands on the pavement in front of a red double-decker bus in a city, likely London, given the distinctive bus and the architecture. She is wearing a black jacket and is looking towards the camera. The bus is parked on a street with a white painted line marking the curb. The background consists of buildings with classic architecture.
143+
```
144+
145+
Here, an image is loaded from the system, and the model is asked to describe what is in the picture. Note that this model does not support context, meaning each call to the `send` function is a completely new conversation and does not remember the content from previous `send` calls.
146+
147+
Additionally, the default model supports an input image resolution of `448 x 448`. Therefore, when calling `set_image`, if the resolution does not match, the `img.resize` method will be automatically called for scaling, using the method specified by `fit`. For example, `image.Fit.FIT_CONTAIN` means that when the aspect ratio of the input image does not match the expected resolution, the original aspect ratio is maintained during scaling, and the surrounding area is filled with black.
148+
`set_system_prompt` is the system prompt statement, which can be appropriately modified to improve accuracy in your application scenario.
149+
150+
Note: For the model `Qwen3-VL-2B-Instruct-GPTQ-Int4-AX630C-P320-CTX448`, `P320` means that `the system prompt` and `user prompt` can only be filled with up to `320 tokens` in total. `CTX448` means the maximum total length for `the system prompt`, `user prompt`, and the model's reply combined is `448 tokens`.
151+
152+
### Calling the Model with HTTP
153+
```python
154+
from maix import app, nn, err, image, display, time
155+
import requests
156+
import json
157+
158+
model = "/root/models/Qwen3-VL-2B-Instruct-GPTQ-Int4-AX630C-P320-CTX448-maixcam2/model.mud"
159+
disp = display.Display()
160+
161+
qwen3_vl = nn.Qwen3VL(model)
162+
163+
in_w = qwen3_vl.input_width()
164+
in_h = qwen3_vl.input_height()
165+
in_fmt = qwen3_vl.input_format()
166+
print(f"input size: {in_w}x{in_h}, format: {image.format_name(in_fmt)}")
167+
168+
def on_reply(obj, resp):
169+
print(resp.msg_new, end="")
170+
171+
qwen3_vl.set_system_prompt("You are Qwen3VL. You are a helpful vision-to-text assistant.")
172+
qwen3_vl.set_reply_callback(on_reply)
173+
174+
# load and set image
175+
img = image.load("/maixapp/share/picture/2024.1.1/ssd_car.jpg", format=in_fmt)
176+
qwen3_vl.set_image(img, fit=image.Fit.FIT_CONTAIN) # if size not math, will auto resize first
177+
disp.show(img)
178+
179+
while not app.need_exit():
180+
print('wait model is ready')
181+
if qwen3_vl.is_ready():
182+
break
183+
time.sleep(1)
184+
185+
def example3():
186+
print('')
187+
url = "http://127.0.0.1:12346"
188+
headers = {
189+
"Content-Type": "application/json",
190+
}
191+
192+
stream = True
193+
data = {
194+
"model": "AXERA-TECH/Qwen3-VL-2B-Instruct-GPTQ-Int4",
195+
"stream":stream,
196+
"temperature":0.7,
197+
"repetition_penalty":1,
198+
"top-p":0.8,
199+
"top-k":20,
200+
"messages": [{
201+
"role":"user",
202+
"content": [{
203+
"type":"text",
204+
"text":"What is your name?"
205+
}, {
206+
"type":"image_url",
207+
"image_url":"images/demo.jpg"
208+
}]
209+
}]
210+
}
211+
response = requests.post(url + '/v1/chat/completions', headers=headers, json=data, stream=stream)
212+
213+
if not stream:
214+
print(response.status_code)
215+
print(response.text)
216+
else:
217+
if response.status_code == 200:
218+
for line in response.iter_lines():
219+
if line:
220+
line = line.decode('utf-8')
221+
if line.startswith('data: '):
222+
data_str = line[6:]
223+
if data_str.strip() == '[DONE]':
224+
print("\nStreaming finished")
225+
break
226+
try:
227+
chunk = json.loads(data_str)
228+
if 'choices' in chunk and len(chunk['choices']) > 0:
229+
delta = chunk['choices'][0].get('delta', {})
230+
if 'content' in delta:
231+
print(delta['content'], end='', flush=True)
232+
except json.JSONDecodeError:
233+
continue
234+
else:
235+
print(f"Request failed: {response.status_code}")
236+
print(response.text)
237+
238+
example3()
239+
240+
del qwen3_vl # Must release vlm object
241+
```
242+
243+
Result:
244+
```
245+
I am an AI assistant without a name. I am a virtual assistant capable of helping you answer questions, provide information, and engage in beneficial discussions.
246+
Streaming finished
247+
```
248+
Qwen3-VL supports an OpenAI-style interface, allowing you to obtain the model's output results via HTTP protocol streaming.
249+
250+
## Custom Quantized Model
251+
252+
The model provided above is a quantized model for MaixCAM2. If you want to quantize your own model, refer to:
253+
254+
* [Pulsar2 Documentation](https://pulsar2-docs.readthedocs.io/zh-cn/latest/appendix/build_llm.html)
255+
* Original model: https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct

docs/doc/en/sidebar.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,6 +122,8 @@ items:
122122
label: DeepSeek LLM
123123
- file: mllm/vlm_internvl.md
124124
label: InternVL Vision-Language Model
125+
- file: mllm/vlm_qwen3.md
126+
label: Qwen3-VL Vision-Language Model
125127

126128
- label: AI Model Convertion and Port
127129
items:

0 commit comments

Comments
 (0)