Add vision support in llama-server #901

firecoperana · 2025-11-05T04:16:26Z

This PR adds vision support for llama-server. Both llama-server and mtmd are now up to PR #16275 (9/26/2025) in mainline. ggml-org/llama.cpp#12898

Updated webui to support adding pictures and files and other minor ui changes. Test with both current webui and new llama.cpp webui (launched via --webui llamacpp ) using Qwen2.5-VL-7B-Instruct-Q8_0.gguf and mmproj-F16.gguf and both works fine.

Note that when using --mmproj for vision model, context shift is disabled.

Other changes:

Fix KV shift for qwen2vl llama : fix KV shift for qwen2vl ggml-org/llama.cpp#13870
Fix the discrepancy between the usage of slot.id in llama-server. Before the fix, we need to use slot.id+1 when porting mainline's code where they use slot.id, which is very easy to miss. After the fix, one just needs slot.id instead of slot.id+1. server : remove hack for extra parallel slot ggml-org/llama.cpp#10187
Add --no-context-shift parameter to disable context shift
Simplify handle_completions, handle_completions_oai and handle_chat_completions function by using the common function handle_completions_impl to handle inference task. In my test, the results are the same as before, but best to have more people to test.

webui: add support for vision model

ikawrakow · 2025-11-05T08:53:06Z

examples/server/server.cpp

+                        {
+                            const auto& chunk = slot.prompt_tokens.find_chunk(slot.n_past);
+                            slot.cache_tokens.push_back(chunk.get()); // copy
+                            fprintf(stdout, slot.cache_tokens.detokenize(ctx, true).c_str());


Missed this one in the review. What is the intent here? fprintf needs to have a format string for this to work.

Debug purpose. Should be removed.

ikawrakow · 2025-11-05T08:53:50Z

examples/server/server.cpp

-        }
-    };
+            const auto& prompt = data.at("prompt");
+            fprintf(stdout, prompt.get<std::string>().c_str());


What is the intent here? fprintf needs to have a format string.

Debug purpose. Should be removed.

I will fix this in my next PR to address #904

MrHills-2 · 2025-11-05T10:21:48Z

I get an error when sending images (text works fine)

add_text: <|vision_start|>
image_tokens->nx = 13
image_tokens->ny = 18 batch_f32 size = 1
add_text: <|vision_end|>
add_text: <|im_end|>
<|im_start|>assistant n_pos: 1
VERB [ get_new_id] new task id | tid="139746469257216" timestamp=1762337477 new_id=376
VERB [ add_waiting_task_id] waiting for task id | tid="139746469257216" timestamp=1762337477 id_task=376
VERB [ start_loop] new task may arrive | tid="139756608065536" timestamp=1762337477
VERB [ start_loop] callback_new_task | tid="139756608065536" timestamp=1762337477 id_task=376
VERB [ get_available_slot] selected slot by lcp similarity | tid="139756608065536" timestamp=1762337477 id_slot=0 max_lcp_len=53753 similarity=1.0
INFO [ launch_slot_with_task] slot is processing task | tid="139756608065536" timestamp=1762337477 id_slot=0 id_task=376 VERB [ start_loop] update_multitasks | tid="139756608065536" timestamp=1762337477 VERB [ start_loop] callback_update_slots | tid="139756608065536" timestamp=1762337477 VERB [ update_slots] posting NEXT_RESPONSE | tid="139756608065536" timestamp=1762337477 VERB [ post] new task id | tid="139756608065536" timestamp=1762337477 new_id=377 VERB [ update_slots] tokenizing prompt | tid="139756608065536" timestamp=1762337477 id_slot=0 id_task=376
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151936) ./Qwen3_235B-IQ2M: line 4: 313100 Aborted (core dumped) ./llama-server -m ../../models/Qwen3-VL-235B-IQ2_M.gguf --mmproj ../../models/Qwen3-VL-235B-A22B-Instruct.mmproj-Q8_0.gguf -ot "blk.(?:[0-9]|[1-5][0-9]|[6][0-5]).ffn.*=CPU" -c 32768 -b 8192 -ub 4096 -v -ctk q8_0 -ctv q5_1 --threads 7 -ngl 95 -mla 2 -sp --host 0.0.0.0 --port 8080

Ph0rk0z · 2025-11-05T12:06:01Z

Thanks! time to download qwen 235b finally.

firecoperana · 2025-11-05T17:41:13Z

I get an error when sending images (text works fine)

add_text: <|vision_start|> image_tokens->nx = 13 image_tokens->ny = 18 batch_f32 size = 1 add_text: <|vision_end|> add_text: <|im_end|> <|im_start|>assistant n_pos: 1 VERB [ get_new_id] new task id | tid="139746469257216" timestamp=1762337477 new_id=376 VERB [ add_waiting_task_id] waiting for task id | tid="139746469257216" timestamp=1762337477 id_task=376 VERB [ start_loop] new task may arrive | tid="139756608065536" timestamp=1762337477 VERB [ start_loop] callback_new_task | tid="139756608065536" timestamp=1762337477 id_task=376 VERB [ get_available_slot] selected slot by lcp similarity | tid="139756608065536" timestamp=1762337477 id_slot=0 max_lcp_len=53753 similarity=1.0 INFO [ launch_slot_with_task] slot is processing task | tid="139756608065536" timestamp=1762337477 id_slot=0 id_task=376 VERB [ start_loop] update_multitasks | tid="139756608065536" timestamp=1762337477 VERB [ start_loop] callback_update_slots | tid="139756608065536" timestamp=1762337477 VERB [ update_slots] posting NEXT_RESPONSE | tid="139756608065536" timestamp=1762337477 VERB [ post] new task id | tid="139756608065536" timestamp=1762337477 new_id=377 VERB [ update_slots] tokenizing prompt | tid="139756608065536" timestamp=1762337477 id_slot=0 id_task=376 terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151936) ./Qwen3_235B-IQ2M: line 4: 313100 Aborted (core dumped) ./llama-server -m ../../models/Qwen3-VL-235B-IQ2_M.gguf --mmproj ../../models/Qwen3-VL-235B-A22B-Instruct.mmproj-Q8_0.gguf -ot "blk.(?:[0-9]|[1-5][0-9]|[6][0-5]).ffn.*=CPU" -c 32768 -b 8192 -ub 4096 -v -ctk q8_0 -ctv q5_1 --threads 7 -ngl 95 -mla 2 -sp --host 0.0.0.0 --port 8080

What is the resolution of the image you use and does it happen with all the images? I tried test-1.jpeg inside examples\mtmd folder and it's working. I used Qwen3-VL-235B-A22B-Instruct-UD-Q2_K_XL.gguf from unsloth and their mmproj-F16.gguf for mmproj.

MrHills-2 · 2025-11-05T18:24:40Z

I'm using iq2_M from https://huggingface.co/mradermacher/Qwen3-VL-235B-A22B-Instruct-i1-GGUF with his f16 mmproj from here https://huggingface.co/mradermacher/Qwen3-VL-235B-A22B-Instruct-GGUF.

Using latest sillytavern, chat completion. I tried the same image you mentioned test1.jpeg but it still doesn't work.

Thireus · 2025-11-05T18:48:08Z

Which build or commit are you using?

MrHills-2 · 2025-11-05T18:54:23Z

Current main 320fc60

firecoperana · 2025-11-05T20:48:51Z

Does adding --jinja help? Can you also try the built-in webui?

MrHills-2 · 2025-11-05T21:12:57Z

--jinja does not help and the built in web ui gives me the same error. I also downloaded a different gguf (from unsloth) and it gives me the same error.

This reverts commit 15159a8.

Ph0rk0z · 2025-11-06T22:48:11Z

In my case it's working but I need to try jinja or see what the template is doing as it's a bit cracked out. Goes 19t/s too.

MrHills-2 · 2025-11-06T22:51:38Z

In my case it's working but I need to try jinja or see what the template is doing as it's a bit cracked out. Goes 19t/s too.

Can you tell me what are your build arguments and what arguments you use on llama-server? I tried qwen3vl 235b, and also qwen3vl 30b, nothing works.

This reverts commit 15159a8.

Ph0rk0z · 2025-11-07T01:19:42Z

Sure

CUDA_VISIBLE_DEVICES=0,1,2,3 numactl --interleave=all ./bin/llama-server \
    -m /Qwen3-VL-235B-A22B-Instruct-UD-Q4_K_XL-00001-of-00003.gguf \
    --mmproj /mmproj-F16.gguf \
    -t 48 \
    -c 32768 \
    --host 192.168.1.211 \
    --numa distribute \
    -ngl 95 \
    -ctk q8_0 \
    -ctv q8_0 \
    --jinja \
    -v \
    -rtr \
    -amb 512 \
    -ub 1024 \
    -mqkv \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13)\.ffn_.*.=CUDA0" \
    -ot "blk\.(14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29)\.ffn_.*.=CUDA1" \
    -ot "blk\.(30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45)\.ffn_.*.=CUDA2" \
    -ot "blk\.(46|47|48|49|50|51|52|53|54|55|56|57|58|59|60)\.ffn_.*.=CUDA3" \
    -ot "\.ffn_.*_exps.=CPU"

This reverts commit 15159a8.

firecoperana added 4 commits November 4, 2025 21:54

server: add support for vision model

662b84f

webui: add support for vision model

server : remove hack for extra parallel slot#10187

149b086

llama : fix KV shift for qwen2vl #13870

45491b7

add no-context-shift parameter

fb30005

firecoperana requested a review from ikawrakow November 5, 2025 04:16

firecoperana mentioned this pull request Nov 5, 2025

Port of Qwen3-VL support from mainline #883

Merged

4 tasks

Thireus mentioned this pull request Nov 5, 2025

Fcp/mtmd llama server Thireus/ik_llama.cpp#37

Merged

4 tasks

ikawrakow approved these changes Nov 5, 2025

View reviewed changes

ikawrakow merged commit 15159a8 into main Nov 5, 2025

ikawrakow reviewed Nov 5, 2025

View reviewed changes

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Nov 6, 2025

Revert "Add vision support in llama-server (ikawrakow#901)"

46f1c50

This reverts commit 15159a8.

yellowbadbeast1 mentioned this pull request Nov 6, 2025

Bug: Nix flake build fails #908

Closed

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Nov 7, 2025

Revert "Add vision support in llama-server (ikawrakow#901)"

d987999

This reverts commit 15159a8.

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Nov 10, 2025

Revert "Add vision support in llama-server (ikawrakow#901)"

25114e1

This reverts commit 15159a8.

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Nov 10, 2025

Revert "Add vision support in llama-server (ikawrakow#901)"

d3b143f

This reverts commit 15159a8.

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Nov 13, 2025

Revert "Add vision support in llama-server (ikawrakow#901)"

a78cc4c

This reverts commit 15159a8.

firecoperana deleted the fcp/mtmd_llama_server branch November 14, 2025 18:29

Add vision support in llama-server #901

Add vision support in llama-server #901

Uh oh!

Conversation

firecoperana commented Nov 5, 2025

Uh oh!

ikawrakow Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

firecoperana Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

ikawrakow Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

firecoperana Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

firecoperana Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

MrHills-2 commented Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ph0rk0z commented Nov 5, 2025

Uh oh!

firecoperana commented Nov 5, 2025

Uh oh!

MrHills-2 commented Nov 5, 2025

Uh oh!

Thireus commented Nov 5, 2025

Uh oh!

MrHills-2 commented Nov 5, 2025

Uh oh!

firecoperana commented Nov 5, 2025

Uh oh!

MrHills-2 commented Nov 5, 2025

Uh oh!

Ph0rk0z commented Nov 6, 2025

Uh oh!

MrHills-2 commented Nov 6, 2025

Uh oh!

Ph0rk0z commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

MrHills-2 commented Nov 5, 2025 •

edited

Loading