Skip to content

Commit ad1ef84

Browse files
committed
* add smolvlm doc
1 parent 9001667 commit ad1ef84

File tree

5 files changed

+223
-0
lines changed

5 files changed

+223
-0
lines changed

docs/doc/en/mllm/vlm_smolvlm.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,91 @@
1+
---
2+
title: MaixPy MaixCAM Running SmolVLM Visual Language Model
3+
update:
4+
- date: 2025-12-03
5+
author: lxowalle
6+
version: 1.0.0
7+
content: Added SmolVLM code and documentation
8+
---
9+
10+
## Supported Devices
11+
12+
| Device | Supported |
13+
| -------- | --------- |
14+
| MaixCAM2 ||
15+
| MaixCAM ||
16+
17+
## Introduction to SmolVLM
18+
19+
VLM (Vision-Language Model) refers to models that can take text + image input and output text, such as describing the content in an image—essentially enabling the AI to “see.”
20+
SmolVLM currently supports English only.
21+
22+
## Using SmolVLM in MaixPy MaixCAM
23+
24+
### Model and Download Address
25+
26+
If the `SmolVLM` model is not present in the default `/root/models` directory, you need to download it manually.
27+
* Memory requirement: CMM memory 300MB. For more information, see [the memory usage documentation](../pro/memory.md)
28+
29+
* Download link: https://huggingface.co/sipeed/smolvlm-256m-instruct-maixcam2
30+
31+
The download method is the same as described in [the Qwen documentation](./llm_qwen.md)
32+
33+
### Running the Model
34+
35+
```python
36+
from maix import nn, err, log, sys, image, display
37+
38+
model = "/root/models/smolvlm-256m-instruct-maixcam2/model.mud"
39+
log.set_log_level(log.LogLevel.LEVEL_ERROR, color = False)
40+
disp = display.Display()
41+
42+
smolvlm = nn.SmolVLM(model)
43+
in_w = smolvlm.input_width()
44+
in_h = smolvlm.input_height()
45+
in_fmt = smolvlm.input_format()
46+
print(f"input size: {in_w}x{in_h}, format: {image.format_name(in_fmt)}")
47+
48+
def on_reply(obj, resp):
49+
print(resp.msg_new, end="")
50+
51+
smolvlm.set_system_prompt("Your a helpful assistant.")
52+
smolvlm.set_reply_callback(on_reply)
53+
54+
# load and set image
55+
img = image.load("/maixapp/share/picture/2024.1.1/ssd_car.jpg", format=in_fmt)
56+
smolvlm.set_image(img, fit=image.Fit.FIT_CONTAIN) # if size not math, will auto resize first
57+
disp.show(img)
58+
59+
msg = "Describe the picture"
60+
print(">>", msg)
61+
resp = smolvlm.send(msg)
62+
err.check_raise(resp.err_code)
63+
```
64+
65+
Output:
66+
```
67+
>> Describe the picture
68+
The image depicts a prominent bus stop, specifically in the middle, where a young woman is captured and standing on the sidewalk. The bus, which appears to be a double-decker bus, is prominently displayed in the center of the image. The bus is red with bold white text and design elements on its side. The text on the bus reads "THING'S GET MORE EXCITING."
69+
70+
Below this text is a small image of the bus logo. The bus is parked next to another bus, both in a city background. The background on which the bus is parked is not clearly discernible due to the perspective, but it looks urban due to the buildings and street signs visible.
71+
72+
The woman in the image is looking towards the bus on the street, possibly waiting to board or simply admiring the scene. She is wearing a black coat, and her hair is short and dark. The bus itself has a red roof, and its windows are visible. The bus’s front is also visible, but it is not as prominent as the bus’s front side.
73+
74+
In the background, there are buildings and a large glass window. The sky is not visible, but it is bright, as indicated by the light reflection on the windows. The street is wide and seems to be a busy urban street, possibly with cars and other vehicles.
75+
76+
The bus stop itself seems to be in an area that is busy. There are traffic signs visible, and the sidewalk looks well-maintained. The street is wide enough for a bus to pass by at a distance, though it is not very wide. The overall environment appears modern and functional.
77+
78+
This vivid depiction of the bus stop and the surrounding environment provides a clear and detailed view of the scene.
79+
```
80+
81+
Additionally, the default model supports an image input resolution of `512×512`, so when calling `set_image`, if the image resolution does not match, it will automatically call `img.resize` to scale it. The scaling method is controlled by the `fit` parameter. For example, `image.Fit.FIT_CONTAIN` preserves the original aspect ratio and fills the padding with black when the aspect ratio differs from the required resolution.
82+
83+
## Custom Quantized Model
84+
Some model parameters can be modified. Refer to [the Qwen documentation](./llm_qwen.md) for details.
85+
86+
## Custom Quantized Model
87+
88+
The model provided above is a quantized model for MaixCAM2. If you want to quantize your own model, refer to:
89+
90+
* [Pulsar2 Documentation](https://pulsar2-docs.readthedocs.io/zh-cn/latest/appendix/build_llm.html)
91+
* Original model: https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct

docs/doc/en/sidebar.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,8 @@ items:
126126
label: Qwen3-VL Vision-Language Model
127127
- file: mllm/lm_lora_sdv1_5.md
128128
label: LCM-LoRA-SDv1-5 Model
129+
- file: mllm/vlm_smolvlm.md
130+
label: SmolVLM Vision-Language Model
129131

130132
- label: AI Model Convertion and Port
131133
items:

docs/doc/zh/mllm/vlm_smolvlm.md

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
---
2+
title: MaixPy MaixCAM 运行 SmolVLM 视觉语言模型
3+
update:
4+
- date: 2025-12-03
5+
author: lxowalle
6+
version: 1.0.0
7+
content: 新增 SmolVLM 代码和文档
8+
---
9+
10+
## 支持的设备
11+
12+
| 设备 | 是否支持 |
13+
| -------- | ------- |
14+
| MaixCAM2 ||
15+
| MaixCAM ||
16+
17+
18+
## SmolVLM 简介
19+
20+
VLM(Vision-Language Model) 即视觉语言模型,可以通过文字+图像输入,让 AI 输出文字,比如让 AI 描述图像中的内容,即 AI 学会了看图。
21+
SmolVLM 目前只支持英文。
22+
23+
## MaixPy MaixCAM 中使用 SmolVLM
24+
25+
### 模型和下载地址
26+
27+
默认系统`/root/models`目录下如果没有`SmolVLM`模型,需要自行下载。
28+
29+
* 内存需求:CMM 内存 300M,内存解释请看[内存使用文档](../pro/memory.md)
30+
* 下载地址:https://huggingface.co/sipeed/smolvlm-256m-instruct-maixcam2
31+
32+
下载方法参考[Qwen 文档](./llm_qwen.md) 里面的下载方法。
33+
34+
### 运行模型
35+
36+
```python
37+
from maix import nn, err, log, sys, image, display
38+
39+
model = "/root/models/smolvlm-256m-instruct-maixcam2/model.mud"
40+
log.set_log_level(log.LogLevel.LEVEL_ERROR, color = False)
41+
disp = display.Display()
42+
43+
smolvlm = nn.SmolVLM(model)
44+
in_w = smolvlm.input_width()
45+
in_h = smolvlm.input_height()
46+
in_fmt = smolvlm.input_format()
47+
print(f"input size: {in_w}x{in_h}, format: {image.format_name(in_fmt)}")
48+
49+
def on_reply(obj, resp):
50+
print(resp.msg_new, end="")
51+
52+
smolvlm.set_system_prompt("Your a helpful assistant.")
53+
smolvlm.set_reply_callback(on_reply)
54+
55+
# load and set image
56+
img = image.load("/maixapp/share/picture/2024.1.1/ssd_car.jpg", format=in_fmt)
57+
smolvlm.set_image(img, fit=image.Fit.FIT_CONTAIN) # if size not math, will auto resize first
58+
disp.show(img)
59+
60+
msg = "Describe the picture"
61+
print(">>", msg)
62+
resp = smolvlm.send(msg)
63+
err.check_raise(resp.err_code)
64+
```
65+
66+
结果:
67+
```
68+
>> Describe the picture
69+
The image depicts a prominent bus stop, specifically in the middle, where a young woman is captured and standing on the sidewalk. The bus, which appears to be a double-decker bus, is prominently displayed in the center of the image. The bus is red with bold white text and design elements on its side. The text on the bus reads "THING'S GET MORE EXCITING."
70+
71+
Below this text is a small image of the bus logo. The bus is parked next to another bus, both in a city background. The background on which the bus is parked is not clearly discernible due to the perspective, but it looks urban due to the buildings and street signs visible.
72+
73+
The woman in the image is looking towards the bus on the street, possibly waiting to board or simply admiring the scene. She is wearing a black coat, and her hair is short and dark. The bus itself has a red roof, and its windows are visible. The bus’s front is also visible, but it is not as prominent as the bus’s front side.
74+
75+
In the background, there are buildings and a large glass window. The sky is not visible, but it is bright, as indicated by the light reflection on the windows. The street is wide and seems to be a busy urban street, possibly with cars and other vehicles.
76+
77+
The bus stop itself seems to be in an area that is busy. There are traffic signs visible, and the sidewalk looks well-maintained. The street is wide enough for a bus to pass by at a distance, though it is not very wide. The overall environment appears modern and functional.
78+
79+
This vivid depiction of the bus stop and the surrounding environment provides a clear and detailed view of the scene.
80+
```
81+
82+
另外,默认模型支持`512x512`的图片输入分辨率,所以调用`set_image`时,如果分辨率不是这个分辨率,会自动调用`img.resize`方法进行缩放,缩放方法为`fit`指定的方法,比如`image.Fit.FIT_CONTAIN`就是当输入图片分辨率和期望的分辨率比例不一致时采用保持原比例缩放,周围空白填充黑色。
83+
84+
### 修改参数
85+
86+
模型有一些参数可以修改,参考[Qwen 文档](./llm_qwen.md)
87+
88+
## 自定义量化模型
89+
90+
上面提供的模型是为 MaixCAM2 量化后的模型,如果需要自己量化模型,可以参考:
91+
* [pulsar2文档](https://pulsar2-docs.readthedocs.io/zh-cn/latest/appendix/build_llm.html)
92+
* 原始模型: https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct

docs/doc/zh/sidebar.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,8 @@ items:
127127
label: Qwen3-VL 视觉语言模型
128128
- file: mllm/lm_lora_sdv1_5.md
129129
label: LCM-LoRA-SDv1-5 模型
130+
- file: mllm/vlm_smolvlm.md
131+
label: SmolVLM 视觉语言模型
130132

131133
- label: AI 模型转换和移植
132134
items:

examples/mllm/vlm/vlm_smolvlm.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
'''
2+
SmolVLM VLM example.
3+
Supportted devices: MaixCAM2
4+
Not Supported devices: MaixCAM
5+
Models:
6+
- https://huggingface.co/sipeed/smolvlm-256m-instruct-maixcam2
7+
'''
8+
from maix import nn, err, log, sys, image, display
9+
10+
model = "/root/models/smolvlm-256m-instruct-maixcam2/model.mud"
11+
log.set_log_level(log.LogLevel.LEVEL_ERROR, color = False)
12+
disp = display.Display()
13+
14+
smolvlm = nn.SmolVLM(model)
15+
in_w = smolvlm.input_width()
16+
in_h = smolvlm.input_height()
17+
in_fmt = smolvlm.input_format()
18+
print(f"input size: {in_w}x{in_h}, format: {image.format_name(in_fmt)}")
19+
20+
def on_reply(obj, resp):
21+
print(resp.msg_new, end="")
22+
23+
smolvlm.set_system_prompt("Your a helpful assistant.")
24+
smolvlm.set_reply_callback(on_reply)
25+
26+
# load and set image
27+
img = image.load("/maixapp/share/picture/2024.1.1/ssd_car.jpg", format=in_fmt)
28+
smolvlm.set_image(img, fit=image.Fit.FIT_CONTAIN) # if size not math, will auto resize first
29+
disp.show(img)
30+
31+
msg = "Describe the picture"
32+
print(">>", msg)
33+
resp = smolvlm.send(msg)
34+
err.check_raise(resp.err_code)
35+
36+

0 commit comments

Comments
 (0)