|
| 1 | +--- |
| 2 | +title: MaixPy MaixCAM Running SmolVLM Visual Language Model |
| 3 | +update: |
| 4 | + - date: 2025-12-03 |
| 5 | + author: lxowalle |
| 6 | + version: 1.0.0 |
| 7 | + content: Added SmolVLM code and documentation |
| 8 | +--- |
| 9 | + |
| 10 | +## Supported Devices |
| 11 | + |
| 12 | +| Device | Supported | |
| 13 | +| -------- | --------- | |
| 14 | +| MaixCAM2 | ✅ | |
| 15 | +| MaixCAM | ❌ | |
| 16 | + |
| 17 | +## Introduction to SmolVLM |
| 18 | + |
| 19 | +VLM (Vision-Language Model) refers to models that can take text + image input and output text, such as describing the content in an image—essentially enabling the AI to “see.” |
| 20 | +SmolVLM currently supports English only. |
| 21 | + |
| 22 | +## Using SmolVLM in MaixPy MaixCAM |
| 23 | + |
| 24 | +### Model and Download Address |
| 25 | + |
| 26 | +If the `SmolVLM` model is not present in the default `/root/models` directory, you need to download it manually. |
| 27 | + * Memory requirement: CMM memory 300MB. For more information, see [the memory usage documentation](../pro/memory.md) |
| 28 | + |
| 29 | + * Download link: https://huggingface.co/sipeed/smolvlm-256m-instruct-maixcam2 |
| 30 | + |
| 31 | +The download method is the same as described in [the Qwen documentation](./llm_qwen.md) |
| 32 | + |
| 33 | +### Running the Model |
| 34 | + |
| 35 | +```python |
| 36 | +from maix import nn, err, log, sys, image, display |
| 37 | + |
| 38 | +model = "/root/models/smolvlm-256m-instruct-maixcam2/model.mud" |
| 39 | +log.set_log_level(log.LogLevel.LEVEL_ERROR, color = False) |
| 40 | +disp = display.Display() |
| 41 | + |
| 42 | +smolvlm = nn.SmolVLM(model) |
| 43 | +in_w = smolvlm.input_width() |
| 44 | +in_h = smolvlm.input_height() |
| 45 | +in_fmt = smolvlm.input_format() |
| 46 | +print(f"input size: {in_w}x{in_h}, format: {image.format_name(in_fmt)}") |
| 47 | + |
| 48 | +def on_reply(obj, resp): |
| 49 | + print(resp.msg_new, end="") |
| 50 | + |
| 51 | +smolvlm.set_system_prompt("Your a helpful assistant.") |
| 52 | +smolvlm.set_reply_callback(on_reply) |
| 53 | + |
| 54 | +# load and set image |
| 55 | +img = image.load("/maixapp/share/picture/2024.1.1/ssd_car.jpg", format=in_fmt) |
| 56 | +smolvlm.set_image(img, fit=image.Fit.FIT_CONTAIN) # if size not math, will auto resize first |
| 57 | +disp.show(img) |
| 58 | + |
| 59 | +msg = "Describe the picture" |
| 60 | +print(">>", msg) |
| 61 | +resp = smolvlm.send(msg) |
| 62 | +err.check_raise(resp.err_code) |
| 63 | +``` |
| 64 | + |
| 65 | +Output: |
| 66 | +``` |
| 67 | +>> Describe the picture |
| 68 | +The image depicts a prominent bus stop, specifically in the middle, where a young woman is captured and standing on the sidewalk. The bus, which appears to be a double-decker bus, is prominently displayed in the center of the image. The bus is red with bold white text and design elements on its side. The text on the bus reads "THING'S GET MORE EXCITING." |
| 69 | +
|
| 70 | +Below this text is a small image of the bus logo. The bus is parked next to another bus, both in a city background. The background on which the bus is parked is not clearly discernible due to the perspective, but it looks urban due to the buildings and street signs visible. |
| 71 | +
|
| 72 | +The woman in the image is looking towards the bus on the street, possibly waiting to board or simply admiring the scene. She is wearing a black coat, and her hair is short and dark. The bus itself has a red roof, and its windows are visible. The bus’s front is also visible, but it is not as prominent as the bus’s front side. |
| 73 | +
|
| 74 | +In the background, there are buildings and a large glass window. The sky is not visible, but it is bright, as indicated by the light reflection on the windows. The street is wide and seems to be a busy urban street, possibly with cars and other vehicles. |
| 75 | +
|
| 76 | +The bus stop itself seems to be in an area that is busy. There are traffic signs visible, and the sidewalk looks well-maintained. The street is wide enough for a bus to pass by at a distance, though it is not very wide. The overall environment appears modern and functional. |
| 77 | +
|
| 78 | +This vivid depiction of the bus stop and the surrounding environment provides a clear and detailed view of the scene. |
| 79 | +``` |
| 80 | + |
| 81 | +Additionally, the default model supports an image input resolution of `512×512`, so when calling `set_image`, if the image resolution does not match, it will automatically call `img.resize` to scale it. The scaling method is controlled by the `fit` parameter. For example, `image.Fit.FIT_CONTAIN` preserves the original aspect ratio and fills the padding with black when the aspect ratio differs from the required resolution. |
| 82 | + |
| 83 | +## Custom Quantized Model |
| 84 | +Some model parameters can be modified. Refer to [the Qwen documentation](./llm_qwen.md) for details. |
| 85 | + |
| 86 | +## Custom Quantized Model |
| 87 | + |
| 88 | +The model provided above is a quantized model for MaixCAM2. If you want to quantize your own model, refer to: |
| 89 | + |
| 90 | +* [Pulsar2 Documentation](https://pulsar2-docs.readthedocs.io/zh-cn/latest/appendix/build_llm.html) |
| 91 | +* Original model: https://huggingface.co/HuggingFaceTB/SmolVLM-256M-Instruct |
0 commit comments