-
Notifications
You must be signed in to change notification settings - Fork 13.4k
mtmd : support SmolVLM (version 1 and 2) #13050
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| if self.hparams["model_type"] == "smolvlm_vision": | ||
| self.hparams["hidden_size"] = self.hparams.get("hidden_size", 1152) | ||
| self.hparams["num_attention_heads"] = self.hparams.get("num_attention_heads", 16) | ||
| self.hparams["intermediate_size"] = self.hparams.get("intermediate_size", 3072) | ||
| self.hparams["num_hidden_layers"] = self.hparams.get("num_hidden_layers", 12) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@compilade recently, I have seen many models missing keys in config.json. Just wondering, at some points, should we use AutoConfig in load_hparams to prevent this from happening?
|
I just wanted to say, you are a true legend. Your contributions the last few months have been nothing short but amazing, you are a machine and a great addition to the project especially in terms of vision support. Thank you so much! <3 |
|
Thank you so much! |
|
@compilade I'm merging this PR so that I can continue working with other models. Feel free to continue the discussion (not urgent though), thanks! 🤗 |
* mtmd : support SmolVLM (version 1 and 2) * correct chat template * fix n_patches * scale_factor is an int * add more models to test
|
First of all, I’d like to thank llama.cpp for supporting SmolVLM, as I have been trying to deploy it to mobile devices recently. However, when I tried the latest version of llama.cpp, it seems that it doesn’t support image captioning. The text generated through the following command is just a mess. Did I miss something? output like this: |
|
@wxcchdStar try the 500M, it works much better. I don't know why 256M doesn't give meaningful response |
|
@ngxson Thank you for all the work you've done recently on adding multimodal support! I'm trying to use SmolVLM500M from the pre-quantized link you sent above (q8 model, f16 mmproj), using pre-compiled version b5266 on windows. When I run the following command, the model doesn't output any tokens: Here is the full output from the cli: Strangely enough, if I reduce my prompt to be smaller to something like I see the warnings about the incorrect tokenizer config and the possible template bug. Is this an error on my part or is this a problem with the current implementation of libmtmd? Any help would be appreciated, thank you! |
|
Tbh I'm not even sure if this is the problem with the model itself, given smolVLM is very small (500M params in this case), which may make it hard to prompt the model. Maybe you can try f16 version of the model? For model this small, quantization have a very big impact on quality. And finally, maybe better to use the bigger 2.2B model, it should give a much better result. |
Here I am again~. I ran the SmolVLM-256M-Instruct and SmolVLM-500M-Instruct models using the Transformers library, and the results look pretty good (though there are some minor errors). Then I ran SmolVLM-256M-Instruct-GGUF (q8 and f16) and SmolVLM-500M-Instruct-GGUF (q8 and f16) using llama-mtmd-cli , and the inference results for the same image were significantly worse. By comparing the inference results of the two libraries (Transformers and llama.cpp), it seems that the issue is not related to the model size or precision, but rather to the model reconstruction. The results from running with Transformers are as follows: The inference results from llama.cpp are as follows: So, I began to delve into the source code and found that the most problematic area was the image encoder. The encoding architecture of SmolVLM is as follows: However, in clip.cpp, it seems that it is not aligned with SmolVLM: |
|
@wxcchdStar Ok thanks for the interesting found. Could you firstly check if the text-only inference works correctly? Yes I think I may have missed some details in the vision encoder, but firstly I just want to make sure that the text model is correct |
|
Re. the placement of image before or after the prompt, I think this is not actually something we can fix rn. With the integration of |
|
@ngxson Sure. I compared Transformers and llama.cpp, and both of their text models use LlamaModel. So I checked the llm_build_llama function in llama_model.cpp, and I think it looks fine. (I’m still learning the ggml code and can’t fully understand it yet.) SmolVLM's text model in transformers: |
|
tbh what you show me doesn't make much sense. I can see that probably the diff is that smolvlm uses if you want to test the text model, just try asking it question. simple. |
|
@ngxson OK。 Below are the results of my tests on the SmolVLM-256M model by llama.cpp: Below are the results of my tests on the SmolVLM-256M model by Transformers: |
|
@ngxson It seems that I have resolved the issue with the performance of SmolVLM-256M by making just two modifications:
|
|
ollama not working
https://huggingface.co/ggml-org/SmolVLM-500M-Instruct-GGUF?local-app=ollama |




Add support for SmolVLM model:
Pre-quantized GGUFs are available on https://huggingface.co/ggml-org
To try the pre-quantized model:
To convert the GGUF yourself (both text and mmproj model), use
convert_hf_to_gguf.pyscript:Personal opinion, the model is very small but optimized for vision tasks (OCR, object detection, etc). Could be a fun project to use this model in an AI camera home surveillance system