Skip to content

Conversation

@firecoperana
Copy link
Collaborator

This PR adds vision support for llama-server. Both llama-server and mtmd are now up to PR #16275 (9/26/2025) in mainline. ggml-org/llama.cpp#12898

Updated webui to support adding pictures and files and other minor ui changes. Test with both current webui and new llama.cpp webui (launched via --webui llamacpp ) using Qwen2.5-VL-7B-Instruct-Q8_0.gguf and mmproj-F16.gguf and both works fine.

Note that when using --mmproj for vision model, context shift is disabled.

Other changes:

  1. Fix KV shift for qwen2vl llama : fix KV shift for qwen2vl ggml-org/llama.cpp#13870
  2. Fix the discrepancy between the usage of slot.id in llama-server. Before the fix, we need to use slot.id+1 when porting mainline's code where they use slot.id, which is very easy to miss. After the fix, one just needs slot.id instead of slot.id+1. server : remove hack for extra parallel slot ggml-org/llama.cpp#10187
  3. Add --no-context-shift parameter to disable context shift
  4. Simplify handle_completions, handle_completions_oai and handle_chat_completions function by using the common function handle_completions_impl to handle inference task. In my test, the results are the same as before, but best to have more people to test.

@ikawrakow ikawrakow merged commit 15159a8 into main Nov 5, 2025
{
const auto& chunk = slot.prompt_tokens.find_chunk(slot.n_past);
slot.cache_tokens.push_back(chunk.get()); // copy
fprintf(stdout, slot.cache_tokens.detokenize(ctx, true).c_str());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this one in the review. What is the intent here? fprintf needs to have a format string for this to work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug purpose. Should be removed.

}
};
const auto& prompt = data.at("prompt");
fprintf(stdout, prompt.get<std::string>().c_str());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the intent here? fprintf needs to have a format string.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug purpose. Should be removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this in my next PR to address #904

@MrHills-2
Copy link

MrHills-2 commented Nov 5, 2025

I get an error when sending images (text works fine)

add_text: <|vision_start|>
image_tokens->nx = 13
image_tokens->ny = 18 batch_f32 size = 1
add_text: <|vision_end|>
add_text: <|im_end|>
<|im_start|>assistant n_pos: 1
VERB [ get_new_id] new task id | tid="139746469257216" timestamp=1762337477 new_id=376
VERB [ add_waiting_task_id] waiting for task id | tid="139746469257216" timestamp=1762337477 id_task=376
VERB [ start_loop] new task may arrive | tid="139756608065536" timestamp=1762337477
VERB [ start_loop] callback_new_task | tid="139756608065536" timestamp=1762337477 id_task=376
VERB [ get_available_slot] selected slot by lcp similarity | tid="139756608065536" timestamp=1762337477 id_slot=0 max_lcp_len=53753 similarity=1.0
INFO [ launch_slot_with_task] slot is processing task | tid="139756608065536" timestamp=1762337477 id_slot=0 id_task=376 VERB [ start_loop] update_multitasks | tid="139756608065536" timestamp=1762337477 VERB [ start_loop] callback_update_slots | tid="139756608065536" timestamp=1762337477 VERB [ update_slots] posting NEXT_RESPONSE | tid="139756608065536" timestamp=1762337477 VERB [ post] new task id | tid="139756608065536" timestamp=1762337477 new_id=377 VERB [ update_slots] tokenizing prompt | tid="139756608065536" timestamp=1762337477 id_slot=0 id_task=376
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151936) ./Qwen3_235B-IQ2M: line 4: 313100 Aborted (core dumped) ./llama-server -m ../../models/Qwen3-VL-235B-IQ2_M.gguf --mmproj ../../models/Qwen3-VL-235B-A22B-Instruct.mmproj-Q8_0.gguf -ot "blk.(?:[0-9]|[1-5][0-9]|[6][0-5]).ffn.*=CPU" -c 32768 -b 8192 -ub 4096 -v -ctk q8_0 -ctv q5_1 --threads 7 -ngl 95 -mla 2 -sp --host 0.0.0.0 --port 8080

@Ph0rk0z
Copy link

Ph0rk0z commented Nov 5, 2025

Thanks! time to download qwen 235b finally.

@firecoperana
Copy link
Collaborator Author

I get an error when sending images (text works fine)

add_text: <|vision_start|> image_tokens->nx = 13 image_tokens->ny = 18 batch_f32 size = 1 add_text: <|vision_end|> add_text: <|im_end|> <|im_start|>assistant n_pos: 1 VERB [ get_new_id] new task id | tid="139746469257216" timestamp=1762337477 new_id=376 VERB [ add_waiting_task_id] waiting for task id | tid="139746469257216" timestamp=1762337477 id_task=376 VERB [ start_loop] new task may arrive | tid="139756608065536" timestamp=1762337477 VERB [ start_loop] callback_new_task | tid="139756608065536" timestamp=1762337477 id_task=376 VERB [ get_available_slot] selected slot by lcp similarity | tid="139756608065536" timestamp=1762337477 id_slot=0 max_lcp_len=53753 similarity=1.0 INFO [ launch_slot_with_task] slot is processing task | tid="139756608065536" timestamp=1762337477 id_slot=0 id_task=376 VERB [ start_loop] update_multitasks | tid="139756608065536" timestamp=1762337477 VERB [ start_loop] callback_update_slots | tid="139756608065536" timestamp=1762337477 VERB [ update_slots] posting NEXT_RESPONSE | tid="139756608065536" timestamp=1762337477 VERB [ post] new task id | tid="139756608065536" timestamp=1762337477 new_id=377 VERB [ update_slots] tokenizing prompt | tid="139756608065536" timestamp=1762337477 id_slot=0 id_task=376 terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 151936) ./Qwen3_235B-IQ2M: line 4: 313100 Aborted (core dumped) ./llama-server -m ../../models/Qwen3-VL-235B-IQ2_M.gguf --mmproj ../../models/Qwen3-VL-235B-A22B-Instruct.mmproj-Q8_0.gguf -ot "blk.(?:[0-9]|[1-5][0-9]|[6][0-5]).ffn.*=CPU" -c 32768 -b 8192 -ub 4096 -v -ctk q8_0 -ctv q5_1 --threads 7 -ngl 95 -mla 2 -sp --host 0.0.0.0 --port 8080

What is the resolution of the image you use and does it happen with all the images? I tried test-1.jpeg inside examples\mtmd folder and it's working. I used Qwen3-VL-235B-A22B-Instruct-UD-Q2_K_XL.gguf from unsloth and their mmproj-F16.gguf for mmproj.

@MrHills-2
Copy link

I'm using iq2_M from https://huggingface.co/mradermacher/Qwen3-VL-235B-A22B-Instruct-i1-GGUF with his f16 mmproj from here https://huggingface.co/mradermacher/Qwen3-VL-235B-A22B-Instruct-GGUF.

Using latest sillytavern, chat completion. I tried the same image you mentioned test1.jpeg but it still doesn't work.

@Thireus
Copy link
Contributor

Thireus commented Nov 5, 2025

Which build or commit are you using?

@MrHills-2
Copy link

Current main 320fc60

@firecoperana
Copy link
Collaborator Author

Does adding --jinja help? Can you also try the built-in webui?

@MrHills-2
Copy link

--jinja does not help and the built in web ui gives me the same error. I also downloaded a different gguf (from unsloth) and it gives me the same error.

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Nov 6, 2025
@Ph0rk0z
Copy link

Ph0rk0z commented Nov 6, 2025

qwenvl

In my case it's working but I need to try jinja or see what the template is doing as it's a bit cracked out. Goes 19t/s too.

@MrHills-2
Copy link

qwenvl

In my case it's working but I need to try jinja or see what the template is doing as it's a bit cracked out. Goes 19t/s too.

Can you tell me what are your build arguments and what arguments you use on llama-server? I tried qwen3vl 235b, and also qwen3vl 30b, nothing works.

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Nov 7, 2025
@Ph0rk0z
Copy link

Ph0rk0z commented Nov 7, 2025

Sure

CUDA_VISIBLE_DEVICES=0,1,2,3 numactl --interleave=all ./bin/llama-server \
    -m /Qwen3-VL-235B-A22B-Instruct-UD-Q4_K_XL-00001-of-00003.gguf \
    --mmproj /mmproj-F16.gguf \
    -t 48 \
    -c 32768 \
    --host 192.168.1.211 \
    --numa distribute \
    -ngl 95 \
    -ctk q8_0 \
    -ctv q8_0 \
    --jinja \
    -v \
    -rtr \
    -amb 512 \
    -ub 1024 \
    -mqkv \
    -ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13)\.ffn_.*.=CUDA0" \
    -ot "blk\.(14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29)\.ffn_.*.=CUDA1" \
    -ot "blk\.(30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45)\.ffn_.*.=CUDA2" \
    -ot "blk\.(46|47|48|49|50|51|52|53|54|55|56|57|58|59|60)\.ffn_.*.=CUDA3" \
    -ot "\.ffn_.*_exps.=CPU"

Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Nov 10, 2025
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Nov 10, 2025
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Nov 13, 2025
@firecoperana firecoperana deleted the fcp/mtmd_llama_server branch November 14, 2025 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants