yzma uses models in the GGUF format supported by llama.cpp. You can find many models in GGUF format on Hugging Face (over 163k at last count):
https://huggingface.co/models?library=gguf&sort=trending
Here are just a few examples of yzma with various well-known language models.
Don't forget to set your YZMA_LIB env variable to the directory with your llama.cpp library files!
Vision language models are a type of generative models that take image and text inputs, and generate text outputs.
https://huggingface.co/bartowski/Qwen_Qwen3-VL-2B-Instruct-GGUF
Good quality, default size for most use cases, recommended.
yzma model get -u https://huggingface.co/bartowski/Qwen_Qwen3-VL-2B-Instruct-GGUF/resolve/main/Qwen_Qwen3-VL-2B-Instruct-Q4_K_M.gguf
Or for a smaller model with decent quality, smaller than Q4_K_S with similar performance, recommended.
yzma model get -u https://huggingface.co/bartowski/Qwen_Qwen3-VL-2B-Instruct-GGUF/resolve/main/Qwen_Qwen3-VL-2B-Instruct-IQ4_XS.gguf
In either case, you will need the projector:
yzma model get -u https://huggingface.co/bartowski/Qwen_Qwen3-VL-2B-Instruct-GGUF/resolve/main/mmproj-Qwen_Qwen3-VL-2B-Instruct-f16.gguf
go run ./examples/vlm/ -model ~/models/Qwen_Qwen3-VL-2B-Instruct-Q4_K_M.gguf -mmproj ~/models/mmproj-Qwen_Qwen3-VL-2B-Instruct-f16.gguf -image ./images/domestic_llama.jpg -p "What is in this picture?"
One of the smallest VLM with good performance on smaller devices.
Fast, recommended.
yzma model get -u https://huggingface.co/mradermacher/SmolVLM2-500M-Video-Instruct-i1-GGUF/resolve/main/SmolVLM2-500M-Video-Instruct.i1-Q4_K_M.gguf
Optimal size/speed/quality.
yzma model get -u https://huggingface.co/mradermacher/SmolVLM2-500M-Video-Instruct-i1-GGUF/resolve/main/SmolVLM2-500M-Video-Instruct.i1-Q4_K_S.gguf
In either case, you will need the projector:
yzma model get -u https://huggingface.co/ggml-org/SmolVLM2-500M-Video-Instruct-GGUF/resolve/main/mmproj-SmolVLM2-500M-Video-Instruct-Q8_0.gguf
https://huggingface.co/LiquidAI/LFM2.5-VL-1.6B-GGUF
yzma model get -u https://huggingface.co/LiquidAI/LFM2.5-VL-1.6B-GGUF/resolve/main/LFM2.5-VL-1.6B-Q8_0.gguf
Smaller model
yzma model get -u https://huggingface.co/LiquidAI/LFM2.5-VL-1.6B-GGUF/resolve/main/LFM2.5-VL-1.6B-Q4_0.gguf
In either case, you will need the projector:
yzma model get -u https://huggingface.co/LiquidAI/LFM2.5-VL-1.6B-GGUF/resolve/main/mmproj-LFM2.5-VL-1.6b-Q8_0.gguf
go run ./examples/vlm/ -model ~/models/LFM2.5-VL-1.6B-Q4_0.gguf -mmproj ~/models/mmproj-LFM2.5-VL-1.6b-Q8_0.gguf -image ./images/domestic_llama.jpg -p "What is in this picture?"
https://huggingface.co/ggml-org/Qwen2.5-VL-3B-Instruct-GGUF
yzma model get -u https://huggingface.co/ggml-org/Qwen2.5-VL-3B-Instruct-GGUF/resolve/main/Qwen2.5-VL-3B-Instruct-Q8_0.gguf
yzma model get -u https://huggingface.co/ggml-org/Qwen2.5-VL-3B-Instruct-GGUF/resolve/main/mmproj-Qwen2.5-VL-3B-Instruct-Q8_0.gguf
go run ./examples/vlm/ -model ~/models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf -mmproj ~/models/mmproj-Qwen2.5-VL-3B-Instruct-Q8_0.gguf -image ./images/domestic_llama.jpg -p "What is in this picture?"
https://huggingface.co/ggml-org/moondream2-20250414-GGUF
yzma model get -u https://huggingface.co/ggml-org/moondream2-20250414-GGUF/resolve/main/moondream2-text-model-f16_ct-vicuna.gguf
yzma model get -u https://huggingface.co/ggml-org/moondream2-20250414-GGUF/resolve/main/moondream2-mmproj-f16-20250414.gguf
go run ./examples/vlm/ -model ~/models/moondream2-text-model-f16_ct-vicuna.gguf -mmproj ~/models/moondream2-mmproj-f16-20250414.gguf -image ./images/domestic_llama.jpg -p "What is in this picture?"
https://huggingface.co/Qwen/Qwen3-4B-GGUF
yzma model get -u https://huggingface.co/Qwen/Qwen3-4B-GGUF/blob/main/Qwen3-4B-Q4_K_M.gguf
go run ./examples/chat/ -model ~/models/Qwen3-4B-Q4_K_M.gguf -temp=0.6 -n=512
https://huggingface.co/bartowski/Qwen_Qwen3-0.6B-GGUF
yzma model get -u https://huggingface.co/bartowski/Qwen_Qwen3-0.6B-GGUF/resolve/main/Qwen_Qwen3-0.6B-Q4_K_M.gguf
go run ./examples/chat/ -model ~/models/Qwen_Qwen3-0.6B-Q4_K_M.gguf -temp=0.6 -n=512
https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF
yzma model get -u https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF/resolve/main/qwen2.5-0.5b-instruct-q4_k_m.gguf
go run ./examples/chat/ -model ~/models/qwen2.5-0.5b-instruct-q4_k_m.gguf -temp=0.6 -n=512
https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF
yzma model get -u https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/blob/main/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
go run ./examples/chat/ -model ~/models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -c 2048 -temp 0.7 -n 512 -sys "You are a helpful robot companion."
https://huggingface.co/ggml-org/gemma-3-1b-it-GGUF
yzma model get -u https://huggingface.co/ggml-org/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-Q4_K_M.gguf
go run ./examples/chat/ -model ~/models/gemma-3-1b-it-Q4_K_M.gguf
https://huggingface.co/bartowski/SmolLM2-135M-Instruct-GGUF
yzma model get -u https://huggingface.co/bartowski/SmolLM2-135M-Instruct-GGUF/resolve/main/SmolLM2-135M-Instruct-Q4_K_M.gguf
go run ./examples/chat/ -model ~/models/SmolLM2-135M-Instruct-Q4_K_M.gguf -c 2048 -temp 0.8 -n 48 -sys "You are a helpful robot companion."
A vision-language-action model (VLA) is a class of multimodal foundation models that integrates vision, language and actions. Given an input image (or video) and a text instruction, a VLA outputs actions that can be executed to accomplish the requested task.
https://huggingface.co/mradermacher/Pelican1.0-VL-3B-i1-GGUF
Fast, recommended:
yzma model get -u https://huggingface.co/mradermacher/Pelican1.0-VL-3B-i1-GGUF/resolve/main/Pelican1.0-VL-3B.i1-Q4_K_M.gguf
or alternate with lower quality:
yzma model get -u https://huggingface.co/mradermacher/Pelican1.0-VL-3B-i1-GGUF/resolve/main/Pelican1.0-VL-3B.i1-IQ3_XXS.gguf
In either case, download the projector file:
yzma model get -u https://huggingface.co/mradermacher/Pelican1.0-VL-3B-GGUF/resolve/main/Pelican1.0-VL-3B.mmproj-Q8_0.gguf
$ go run ./examples/vlm/ -model ~/models/Pelican1.0-VL-3B.i1-Q4_K_M.gguf --mmproj ~/models/Pelican1.0-VL-3B.mmproj-Q8_0.gguf -p "What is in this picture? Provide a description, bounding box, and estimated distance for the llama in json format." -sys "You are a helpful robotic drone camera currently in flight." -image ./images/domestic_llama.jpg
{
"description": "The image shows a fluffy white llama standing on a green grassy area with a dirt path nearby. The llama has a curly coat and appears to be in a fenced-in area with trees and some buildings in the background.",
"bounding_box_2d": [40, 155, 635, 811],
"estimated_distance": "The llama is approximately 1 meter away from the camera."
}https://huggingface.co/mradermacher/InternVLA-M1-GGUF
yzma model get -u https://huggingface.co/mradermacher/InternVLA-M1-GGUF/resolve/main/InternVLA-M1.Q8_0.gguf
yzma model get -u https://huggingface.co/mradermacher/InternVLA-M1-GGUF/resolve/main/InternVLA-M1.mmproj-Q8_0.gguf
go run ./examples/vlm/ -model ~/models/InternVLA-M1.Q8_0.gguf --mmproj ~/models/InternVLA-M1.mmproj-Q8_0.gguf -p "What is in this picture? Provide a description, bounding box, and estimated distance for the llama in json format." -sys "You are a helpful robotic drone camera currently in flight." -image ./images/domestic_llama.jpg
{"label": "llama", "bbox_2d": [43, 352, 647, 822], "distance": 10.0}https://huggingface.co/mradermacher/SpaceQwen2.5-VL-3B-Instruct-GGUF
yzma model get -u https://huggingface.co/mradermacher/SpaceQwen2.5-VL-3B-Instruct-i1-GGUF/resolve/main/SpaceQwen2.5-VL-3B-Instruct.i1-Q4_K_M.gguf
yzma model get -u https://huggingface.co/remyxai/SpaceQwen2.5-VL-3B-Instruct/resolve/main/spaceqwen2.5-vl-3b-instruct-vision.gguf
$ go run ./examples/vlm/ -model ~/models/SpaceQwen2.5-VL-3B-Instruct.i1-Q4_K_M.gguf --mmproj ~/models/spaceqwen2.5-vl-3b-instruct-vision.gguf -p "What is in this picture? Provide a description, bounding box, and estimated distance for the llama in json format." -sys "You are a helpful robotic drone camera currently in flight." -image ./images/domestic_llama.jpg
{
"bbox_2d": [40, 20, 67, 35],
"label": "llama",
"estimated_distance": "1.5 meters"
}