Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Apr 21, 2025

Add support for SmolVLM model:

Pre-quantized GGUFs are available on https://huggingface.co/ggml-org

To try the pre-quantized model:

llama-mtmd-cli -hf ggml-org/SmolVLM-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/SmolVLM-256M-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/SmolVLM-500M-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/SmolVLM2-2.2B-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/SmolVLM2-256M-Video-Instruct-GGUF
llama-mtmd-cli -hf ggml-org/SmolVLM2-500M-Video-Instruct-GGUF

To convert the GGUF yourself (both text and mmproj model), use convert_hf_to_gguf.py script:

cd SmolVLM2-2.2B-Instruction

# convert text model
python ../llama.cpp/convert_hf_to_gguf.py --outfile model.gguf --outtype f16 .
# output file: model.gguf

# convert vision model
python ../llama.cpp/convert_hf_to_gguf.py --outfile model.gguf --outtype f16 --mmproj .
# output file: mmproj-model.gguf

Personal opinion, the model is very small but optimized for vision tasks (OCR, object detection, etc). Could be a fun project to use this model in an AI camera home surveillance system

@github-actions github-actions bot added examples python python script changes labels Apr 21, 2025
@ngxson ngxson marked this pull request as ready for review April 21, 2025 19:21
@ngxson ngxson requested review from compilade and ggerganov April 21, 2025 19:21
Comment on lines +1894 to +1898
if self.hparams["model_type"] == "smolvlm_vision":
self.hparams["hidden_size"] = self.hparams.get("hidden_size", 1152)
self.hparams["num_attention_heads"] = self.hparams.get("num_attention_heads", 16)
self.hparams["intermediate_size"] = self.hparams.get("intermediate_size", 3072)
self.hparams["num_hidden_layers"] = self.hparams.get("num_hidden_layers", 12)
Copy link
Collaborator Author

@ngxson ngxson Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@compilade recently, I have seen many models missing keys in config.json. Just wondering, at some points, should we use AutoConfig in load_hparams to prevent this from happening?

@Dampfinchen
Copy link

I just wanted to say, you are a true legend. Your contributions the last few months have been nothing short but amazing, you are a machine and a great addition to the project especially in terms of vision support.

Thank you so much! <3

@andimarafioti
Copy link

Thank you so much!

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 22, 2025

@compilade I'm merging this PR so that I can continue working with other models. Feel free to continue the discussion (not urgent though), thanks! 🤗

@ngxson ngxson merged commit dc39a5e into ggml-org:master Apr 22, 2025
51 checks passed
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Apr 28, 2025
* mtmd : support SmolVLM (version 1 and 2)

* correct chat template

* fix n_patches

* scale_factor is an int

* add more models to test
@wxcchdStar
Copy link

wxcchdStar commented Apr 29, 2025

First of all, I’d like to thank llama.cpp for supporting SmolVLM, as I have been trying to deploy it to mobile devices recently. However, when I tried the latest version of llama.cpp, it seems that it doesn’t support image captioning. The text generated through the following command is just a mess.

Did I miss something?

./build/bin/llama-mtmd-cli -m SmolVLM-256M-Instruct/model.gguf --mmproj SmolVLM-256M-Instruct/mmproj-model.gguf --image <image_path> -p "Can you describe this image?" -n 100

output like this:


The text:

The text is as follows:

The text is as follows:

The text is in a language that is not specified in the given text.

The text is in a language that is not specified in the given text.

The text is a question that is asking for the translation of the given text.

The text is in a language that is not specified in the given text.

@ngxson
Copy link
Collaborator Author

ngxson commented Apr 29, 2025

@wxcchdStar try the 500M, it works much better. I don't know why 256M doesn't give meaningful response

@LukeSutor
Copy link

@ngxson Thank you for all the work you've done recently on adding multimodal support!

I'm trying to use SmolVLM500M from the pre-quantized link you sent above (q8 model, f16 mmproj), using pre-compiled version b5266 on windows. When I run the following command, the model doesn't output any tokens:

./llama-mtmd-cli.exe -m text-model.gguf --mmproj mmproj-model.gguf --image <image_path> --chat-template smolvlm --ctx-size 8192 -p "You are a computer screen analysis expert. What is shown in this image?"

Here is the full output from the cli:

build: 5265 (2f567611) with MSVC 19.43.34808.0 for x64
llama_model_loader: loaded meta data with 75 key-value pairs and 291 tensors from C:/Users/Luke/AppData/Roaming/com.tauri.dev/models/vlm/text-model.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = SmolVLM2 500M Video Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Video-Instruct
llama_model_loader: - kv   4:                           general.basename str              = SmolVLM2
llama_model_loader: - kv   5:                         general.size_label str              = 500M
llama_model_loader: - kv   6:                            general.license str              = apache-2.0
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = SmolVLM 500M Instruct    
llama_model_loader: - kv   9:          general.base_model.0.organization str              = HuggingFaceTB
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/HuggingFaceTB/...
llama_model_loader: - kv  11:                      general.dataset.count u32              = 12
llama_model_loader: - kv  12:                     general.dataset.0.name str              = The_Cauldron
llama_model_loader: - kv  13:             general.dataset.0.organization str              = HuggingFaceM4
llama_model_loader: - kv  14:                 general.dataset.0.repo_url str              = https://huggingface.co/HuggingFaceM4/...
llama_model_loader: - kv  15:                     general.dataset.1.name str              = Docmatix
llama_model_loader: - kv  16:             general.dataset.1.organization str              = HuggingFaceM4
llama_model_loader: - kv  17:                 general.dataset.1.repo_url str              = https://huggingface.co/HuggingFaceM4/...
llama_model_loader: - kv  18:                     general.dataset.2.name str              = LLaVA OneVision Data     
llama_model_loader: - kv  19:             general.dataset.2.organization str              = Lmms Lab
llama_model_loader: - kv  20:                 general.dataset.2.repo_url str              = https://huggingface.co/lmms-lab/LLaVA...
llama_model_loader: - kv  21:                     general.dataset.3.name str              = M4 Instruct Data
llama_model_loader: - kv  22:             general.dataset.3.organization str              = Lmms Lab
llama_model_loader: - kv  23:                 general.dataset.3.repo_url str              = https://huggingface.co/lmms-lab/M4-In...
llama_model_loader: - kv  24:                     general.dataset.4.name str              = Finevideo
llama_model_loader: - kv  25:             general.dataset.4.organization str              = HuggingFaceFV
llama_model_loader: - kv  26:                 general.dataset.4.repo_url str              = https://huggingface.co/HuggingFaceFV/...
llama_model_loader: - kv  27:                     general.dataset.5.name str              = MAmmoTH VL Instruct 12M  
llama_model_loader: - kv  28:             general.dataset.5.organization str              = MAmmoTH VL
llama_model_loader: - kv  29:                 general.dataset.5.repo_url str              = https://huggingface.co/MAmmoTH-VL/MAm...
llama_model_loader: - kv  30:                     general.dataset.6.name str              = LLaVA Video 178K
llama_model_loader: - kv  31:             general.dataset.6.organization str              = Lmms Lab
llama_model_loader: - kv  32:                 general.dataset.6.repo_url str              = https://huggingface.co/lmms-lab/LLaVA...
llama_model_loader: - kv  33:                     general.dataset.7.name str              = Video STaR
llama_model_loader: - kv  34:             general.dataset.7.organization str              = Orrzohar
llama_model_loader: - kv  35:                 general.dataset.7.repo_url str              = https://huggingface.co/orrzohar/Video...
llama_model_loader: - kv  36:                     general.dataset.8.name str              = Vript
llama_model_loader: - kv  37:             general.dataset.8.organization str              = Mutonix
llama_model_loader: - kv  38:                 general.dataset.8.repo_url str              = https://huggingface.co/Mutonix/Vript
llama_model_loader: - kv  39:                     general.dataset.9.name str              = VISTA 400K
llama_model_loader: - kv  40:             general.dataset.9.organization str              = TIGER Lab
llama_model_loader: - kv  41:                 general.dataset.9.repo_url str              = https://huggingface.co/TIGER-Lab/VIST...
llama_model_loader: - kv  42:                    general.dataset.10.name str              = MovieChat 1K_train       
llama_model_loader: - kv  43:            general.dataset.10.organization str              = Enxin
llama_model_loader: - kv  44:                general.dataset.10.repo_url str              = https://huggingface.co/Enxin/MovieCha...
llama_model_loader: - kv  45:                    general.dataset.11.name str              = ShareGPT4Video
llama_model_loader: - kv  46:            general.dataset.11.organization str              = ShareGPT4Video
llama_model_loader: - kv  47:                general.dataset.11.repo_url str              = https://huggingface.co/ShareGPT4Video...
llama_model_loader: - kv  48:                               general.tags arr[str,1]       = ["image-text-to-text"]   
llama_model_loader: - kv  49:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  50:                          llama.block_count u32              = 32
llama_model_loader: - kv  51:                       llama.context_length u32              = 8192
llama_model_loader: - kv  52:                     llama.embedding_length u32              = 960
llama_model_loader: - kv  53:                  llama.feed_forward_length u32              = 2560
llama_model_loader: - kv  54:                 llama.attention.head_count u32              = 15
llama_model_loader: - kv  55:              llama.attention.head_count_kv u32              = 5
llama_model_loader: - kv  56:                       llama.rope.freq_base f32              = 100000.000000
llama_model_loader: - kv  57:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  58:                 llama.attention.key_length u32              = 64
llama_model_loader: - kv  59:               llama.attention.value_length u32              = 64
llama_model_loader: - kv  60:                           llama.vocab_size u32              = 49280
llama_model_loader: - kv  61:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv  62:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  63:                         tokenizer.ggml.pre str              = smollm
llama_model_loader: - kv  64:                      tokenizer.ggml.tokens arr[str,49280]   = ["<|endoftext|>", "<|im_start|>", "<|...
llama_model_loader: - kv  65:                  tokenizer.ggml.token_type arr[i32,49280]   = [3, 3, 3, 3, 3, 3, 3, 3, 
3, 3, 3, 3, ...
llama_model_loader: - kv  66:                      tokenizer.ggml.merges arr[str,48900]   = ["─á t", "─á a", "i n", "h e", "─á ─á...
llama_model_loader: - kv  67:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  68:                tokenizer.ggml.eos_token_id u32              = 49279
llama_model_loader: - kv  69:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  70:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  71:                    tokenizer.chat_template str              = <|im_start|>{% for message in message...
llama_model_loader: - kv  72:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  73:               general.quantization_version u32              = 2
llama_model_loader: - kv  74:                          general.file_type u32              = 7
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 414.86 MiB (8.50 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 145
load: token to piece cache size = 0.3199 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 960
print_info: n_layer          = 32
print_info: n_head           = 15
print_info: n_head_kv        = 5
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 320
print_info: n_embd_v_gqa     = 320
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2560
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 100000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 409.25 M
print_info: general.name     = SmolVLM2 500M Video Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 49280
print_info: n_merges         = 48900
print_info: BOS token        = 1 '<|im_start|>'
print_info: EOS token        = 49279 '<end_of_utterance>'
print_info: EOT token        = 0 '<|endoftext|>'
print_info: UNK token        = 0 '<|endoftext|>'
print_info: PAD token        = 2 '<|im_end|>'
print_info: LF token         = 198 '─è'
print_info: FIM REP token    = 4 '<reponame>'
print_info: EOG token        = 0 '<|endoftext|>'
print_info: EOG token        = 2 '<|im_end|>'
print_info: EOG token        = 4 '<reponame>'
print_info: EOG token        = 49279 '<end_of_utterance>'
print_info: max token length = 162
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/33 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   414.86 MiB
...............................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 100000.0
llama_context: freq_scale    = 1
llama_context:        CPU  output buffer size =     0.19 MiB
llama_kv_cache_unified: kv_size = 8192, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32    
llama_kv_cache_unified:        CPU KV buffer size =   320.00 MiB
llama_kv_cache_unified: KV self size  =  320.00 MiB, K (f16):  160.00 MiB, V (f16):  160.00 MiB
llama_context:        CPU compute buffer size =   263.51 MiB
llama_context: graph nodes  = 1094
llama_context: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
Failed to infer a tool call example (possible template bug)
mtmd_cli_context: chat template example:
<|im_start|>You are a helpful assistant

User: Hello<end_of_utterance>
Assistant: Hi there<end_of_utterance>
User: How are you?<end_of_utterance>
Assistant:
clip_ctx: CLIP using CPU backend
clip_model_loader: model name:   SmolVLM2 500M Video Instruct
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    198
clip_model_loader: n_kv:         66

load_hparams: projector:          idefics3
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  4
load_hparams: n_wa_pattern:       0
load_hparams: use_silu:           0
load_hparams: use_gelu:           1
load_hparams: model size:         190.22 MiB
load_hparams: metadata size:      0.07 MiB
alloc_compute_meta:        CPU compute buffer size =    63.00 MiB
main: loading model: C:/Users/Luke/AppData/Roaming/com.tauri.dev/models/vlm/text-model.gguf
encoding image or slice...
image/slice encoded in 1141 ms
decoding image batch 1/1, n_tokens_batch = 256
image decoded (batch 1/1) in 673 ms




llama_perf_context_print:        load time =     198.28 ms
llama_perf_context_print: prompt eval time =    1887.93 ms /   283 tokens (    6.67 ms per token,   149.90 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    2116.36 ms /   284 tokens

Strangely enough, if I reduce my prompt to be smaller to something like -p "What is shown in this image?", I get correctly generated tokens.

I see the warnings about the incorrect tokenizer config and the possible template bug. Is this an error on my part or is this a problem with the current implementation of libmtmd? Any help would be appreciated, thank you!

@ngxson
Copy link
Collaborator Author

ngxson commented May 2, 2025

Tbh I'm not even sure if this is the problem with the model itself, given smolVLM is very small (500M params in this case), which may make it hard to prompt the model.

Maybe you can try f16 version of the model? For model this small, quantization have a very big impact on quality. And finally, maybe better to use the bigger 2.2B model, it should give a much better result.

@wxcchdStar
Copy link

wxcchdStar commented May 4, 2025

Tbh I'm not even sure if this is the problem with the model itself, given smolVLM is very small (500M params in this case), which may make it hard to prompt the model.

Maybe you can try f16 version of the model? For model this small, quantization have a very big impact on quality. And finally, maybe better to use the bigger 2.2B model, it should give a much better result.

Here I am again~.

I ran the SmolVLM-256M-Instruct and SmolVLM-500M-Instruct models using the Transformers library, and the results look pretty good (though there are some minor errors). Then I ran SmolVLM-256M-Instruct-GGUF (q8 and f16) and SmolVLM-500M-Instruct-GGUF (q8 and f16) using llama-mtmd-cli , and the inference results for the same image were significantly worse.

By comparing the inference results of the two libraries (Transformers and llama.cpp), it seems that the issue is not related to the model size or precision, but rather to the model reconstruction.

The results from running with Transformers are as follows:

The image features a young woman walking down a path in a park. She is wearing a yellow suit, which includes a jacket and a skirt, and accessorizes with a necklace. The woman is also wearing a pair of heels, adding elegance to her outfit.

The park setting is lush and green, with various trees and bushes surrounding the path. The woman is the central focus of the image, and her outfit and the park environment create a visually appealing scene.

The inference results from llama.cpp are as follows:

- the
- The
- the
- the
-
- the
- the
- the
- the
- the
- the
- the
- the
- the

So, I began to delve into the source code and found that the most problematic area was the image encoder.

The encoding architecture of SmolVLM is as follows:

    (vision_model): Idefics3VisionTransformer(
      (embeddings): Idefics3VisionEmbeddings(
        (patch_embedding): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16), padding=valid)
        (position_embedding): Embedding(1024, 768)
      )
      (encoder): Idefics3Encoder(
        (layers): ModuleList(
          (0-11): 12 x Idefics3EncoderLayer(
            (self_attn): Idefics3VisionAttention(
              (k_proj): Linear(in_features=768, out_features=768, bias=True)
              (v_proj): Linear(in_features=768, out_features=768, bias=True)
              (q_proj): Linear(in_features=768, out_features=768, bias=True)
              (out_proj): Linear(in_features=768, out_features=768, bias=True)
            )
            (layer_norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
            (mlp): Idefics3VisionMLP(
              (activation_fn): PytorchGELUTanh()
              (fc1): Linear(in_features=768, out_features=3072, bias=True)
              (fc2): Linear(in_features=3072, out_features=768, bias=True)
            )
            (layer_norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
          )
        )
      )
      (post_layernorm): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
    )

However, in clip.cpp, it seems that it is not aligned with SmolVLM:

    // copy from clip_image_build_graph_siglip:
    // input raw
    struct ggml_tensor * inp_raw = ggml_new_tensor_3d(ctx0, GGML_TYPE_F32, image_size_width, image_size_height, 3);
    ggml_set_name(inp_raw, "inp_raw");
    ggml_set_input(inp_raw);

    struct ggml_tensor * inp = ggml_conv_2d(ctx0, model.patch_embeddings_0, inp_raw, patch_size, patch_size, 0, 0, 1, 1);
    inp = ggml_reshape_2d(ctx0, inp, num_patches, hidden_size);
    inp = ggml_cont(ctx0, ggml_transpose(ctx0, inp));
    inp = ggml_add(ctx0, inp, model.patch_bias);

    // position embeddings
    struct ggml_tensor * embeddings = ggml_add(ctx0, inp, model.position_embeddings);

@wxcchdStar
Copy link

wxcchdStar commented May 4, 2025

Additionally, I also found a small detail: The order of <image> can also affect the inference results. If I place it at the beginning (in transformers, is also placed at the beginning), the inference performance will be slightly better.

Snipaste_2025-05-04_21-38-11

transformers output is "<|im_start|>User:<image>Can you describe this image?<end_of_utterance>\nAssistant:"

Although the performance is still somewhat behind that of transformers, it is better than the previous nonsensical outputs.

The image is a photograph of a photograph of a photograph taken in full color, and the image is of an outdoor scene. The photograph is of a natural landscape with green foliage. The photo is of a scene taken in natural light with natural sunlight. The photograph is taken at the same angle and in the same shot, there is no text written, nor any text in the image. The photograph is clear and well lit. The background is a natural landscape. The photo is of a landscape scene with greenery and the sky, with a natural light and natural shadows. 

The photo is taken in the natural light and natural shadows and natural colors. The natural light creates an impression of a natural setting. The photo is taken at a high angle and the subject is looking downward.

or

The image contains a title and a description of the image.

Here is the image:

The image contains a title and a description of the image.

The title of the image is "The image contains a title and a description of the image".

The image consists of a photograph.

In the photograph there is a person. The person is standing in the middle of the image, they have a smile. 
The person has a white shirt and they are wearing a white coat.

@ngxson
Copy link
Collaborator Author

ngxson commented May 4, 2025

@wxcchdStar Ok thanks for the interesting found. Could you firstly check if the text-only inference works correctly?

Yes I think I may have missed some details in the vision encoder, but firstly I just want to make sure that the text model is correct

@ngxson
Copy link
Collaborator Author

ngxson commented May 4, 2025

Re. the placement of image before or after the prompt, I think this is not actually something we can fix rn. With the integration of libmtmd in llama-server coming soon, I think this will give you more option to format the prompt

@wxcchdStar
Copy link

@ngxson Sure. I compared Transformers and llama.cpp, and both of their text models use LlamaModel. So I checked the llm_build_llama function in llama_model.cpp, and I think it looks fine. (I’m still learning the ggml code and can’t fully understand it yet.)

SmolVLM's text model in transformers:

    (text_model): LlamaModel(
      (embed_tokens): Embedding(49280, 576, padding_idx=2)
      (rotary_emb): LlamaRotaryEmbedding()
      (layers): ModuleList(
        (0-29): 30 x LlamaDecoderLayer(
          (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
          (self_attn): LlamaAttention(
            (q_proj): Linear(in_features=576, out_features=576, bias=False)
            (k_proj): Linear(in_features=576, out_features=192, bias=False)
            (v_proj): Linear(in_features=576, out_features=192, bias=False)
            (o_proj): Linear(in_features=576, out_features=576, bias=False)
            (rotary_emb): LlamaRotaryEmbedding()
          )
          (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
          (mlp): LlamaMLP(
            (gate_proj): Linear(in_features=576, out_features=1536, bias=False)
            (up_proj): Linear(in_features=576, out_features=1536, bias=False)
            (down_proj): Linear(in_features=1536, out_features=576, bias=False)
            (act_fn): SiLU()
          )
        )
      )
      (norm): LlamaRMSNorm((576,), eps=1e-05)
    )

@ngxson
Copy link
Collaborator Author

ngxson commented May 5, 2025

tbh what you show me doesn't make much sense. I can see that probably the diff is that smolvlm uses PytorchGELUTanh while we are using normal GELU. don't count on that list too much because it doesn't tell you exactly how one module is linked to another

if you want to test the text model, just try asking it question. simple.

@wxcchdStar
Copy link

wxcchdStar commented May 5, 2025

@ngxson OK。

Below are the results of my tests on the SmolVLM-256M model by llama.cpp:

> where is china?
 China is a country located in East Asia. It is a federal republic with a parliamentary system of government. The capital of China is Beijing.

> what is LLM?
 LLM stands for LLM for the Word Problem.

> what is VLM?
 VLM is a type of linear programming problem that is used to solve linear programming problems. It is a type of linear programming problem where the objective function is the sum of the values of all variables, and the constraints are the values of all variables.
 
 > 1 + 1 = ?
 1 + 1 = 2

Below are the results of my tests on the SmolVLM-256M model by Transformers:

User: where is china?
Assistant: China is a country located in East Asia. It is a federal republic with a single currency, the renminbi. The country is known for its diverse culture, diverse population, and rich history.

User: what is LLM?
Assistant: LLM stands for Lazy Luminar. It is a library of books that allows users to access a vast collection of books, including fiction, history, novels, and scientific research. The library is designed to be user-friendly and accessible, with a focus on providing a wide range of books to suit different preferences and reading levels.

User: what is VLM?
Assistant: VLM stands for Varying Linear Models.

User: 1 + 1 = ?
Assistant: 1 + 1 = 2

@wxcchdStar
Copy link

@ngxson It seems that I have resolved the issue with the performance of SmolVLM-256M by making just two modifications:

  1. The first modification is in the clip_n_patches_by_img function in clip.cpp. I changed the value of n_patches to 64 (which is taken from the image_seq_len field in processor_config.json). Previously, it was 256 (calculated as 1024 / 4).

  2. The second modification is what I mentioned earlier: placing <image> at the beginning.

clip cpp mtmd-cli cpp

@HamzaYslmn
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants