Skip to content

Conversation

@ngxson
Copy link
Collaborator

@ngxson ngxson commented Oct 30, 2025

Fix #16842 Fix #13077

Implement "smart resize", aka max_pixels / min_pixels for preprocessing. This will now allow us to limit the input by number of tokens, no more hard limit on image height/width. This is particularly useful if you have an image which have large height but small width.

The preprocessing of QwenVL model is also improved. We now only pad the bottom-right corner of the image, so the x/y coordinate of elements stay unchanged. This should improve the accuracy of bbox application.

@ngxson ngxson linked an issue Oct 30, 2025 that may be closed by this pull request
@ngxson

This comment was marked as outdated.

@ngxson ngxson marked this pull request as ready for review October 31, 2025 15:29
@ngxson
Copy link
Collaborator Author

ngxson commented Oct 31, 2025

All tests passed

[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli THUDM/glm-edge-v-5b-gguf:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/Llava-v1.5-7B-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M
[vision] OK:   llama-mtmd-cli ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-V-2_6-gguf:Q2_K
[vision] OK:   llama-mtmd-cli openbmb/MiniCPM-o-2_6-gguf:Q4_0
[vision] OK:   llama-mtmd-cli bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL2_5-1B-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/LFM2-VL-450M-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/granite-docling-258M-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/LightOnOCR-1B-1025-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/pixtral-12b-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen3-VL-2B-Instruct-GGUF:Q8_0
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF:Q4_K_M
[audio]  OK:   llama-mtmd-cli ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M
[vision] OK:   llama-mtmd-cli ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S

@ggerganov
Copy link
Member

The result could probably better with Q8_0 or F16 language model? (Please test if you can)

What command and image to use for this test?

@ngxson
Copy link
Collaborator Author

ngxson commented Oct 31, 2025

@ggerganov you can use the command here: #16825

the script for overlaying bbox (for visual inspection) is also included: https://gist.github.com/ngxson/039024fb2bdaf2e3c15db702f9fddaff

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@tarruda
Copy link

tarruda commented Oct 31, 2025

In my Qwen3-VL-8b testing, it seems this branch decreases the accuracy: #16842 (comment)

const clip_image_size scaled_size = img_tool::calc_size_preserved_ratio(
original_size,
1, // no need to align here since we will composite onto canvas
std::min(canvas.nx, canvas.ny)); // fit into the canvas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be:

Suggested change
std::min(canvas.nx, canvas.ny)); // fit into the canvas
std::max(canvas.nx, canvas.ny)); // fit into the canvas

@ggerganov
Copy link
Member

Is this missing a factor of 2?

diff --git a/tools/mtmd/clip.cpp b/tools/mtmd/clip.cpp
index 355b42435..62e64040e 100644
--- a/tools/mtmd/clip.cpp
+++ b/tools/mtmd/clip.cpp
@@ -4268,7 +4268,7 @@ int clip_n_output_tokens_x(const struct clip_ctx * ctx, struct clip_image_f32 *
     const auto & params = ctx->model.hparams;
     const int n_total = clip_n_output_tokens(ctx, img);
     if (ctx->proj_type() == PROJECTOR_TYPE_QWEN2VL || ctx->proj_type() == PROJECTOR_TYPE_QWEN25VL || ctx->proj_type() == PROJECTOR_TYPE_QWEN3VL) {
-        return img->nx / (params.patch_size * 2) + (int)(img->nx % params.patch_size > 0);
+        return img->nx / (params.patch_size * 2) + (int)(img->nx % (params.patch_size * 2) > 0);
     }
     return n_total;
 }
@@ -4276,7 +4276,7 @@ int clip_n_output_tokens_x(const struct clip_ctx * ctx, struct clip_image_f32 *
 int clip_n_output_tokens_y(const struct clip_ctx * ctx, struct clip_image_f32 * img) {
     const auto & params = ctx->model.hparams;
     if (ctx->proj_type() == PROJECTOR_TYPE_QWEN2VL || ctx->proj_type() == PROJECTOR_TYPE_QWEN25VL || ctx->proj_type() == PROJECTOR_TYPE_QWEN3VL) {
-        return img->ny / (params.patch_size * 2) + (int)(img->ny % params.patch_size > 0);
+        return img->ny / (params.patch_size * 2) + (int)(img->ny % (params.patch_size * 2) > 0);
     }
     return 1;
 }

@ngxson

This comment was marked as outdated.

@ngxson
Copy link
Collaborator Author

ngxson commented Oct 31, 2025

The code above doesn't have * 2 in the right term because it's for alignment, but we no longer need it anymore as the image is now always aligned.

The token count should also be correct:

def align_next_multiple(value, multiple):
  return ((value + multiple - 1) // multiple) * multiple

w = 320
h = 240
patch_size = 16
n_tok_x = align_next_multiple(w, patch_size*2) // (patch_size*2)
n_tok_y = align_next_multiple(h, patch_size*2) // (patch_size*2)
print(f"n_tok_x: {n_tok_x}, n_tok_y: {n_tok_y}, n_tok: {n_tok_x * n_tok_y}")

# print: n_tok_x: 10, n_tok_y: 8, n_tok: 80

@ngxson
Copy link
Collaborator Author

ngxson commented Oct 31, 2025

@tarruda the last commit should gives a pretty much accurate result even with small input image, please give it a try

image

@tarruda
Copy link

tarruda commented Nov 1, 2025

Looking better now @ngxson :

qwen3-vl-32b-after

Comparing with my previous run, it is a bit worse in this invoice example because the "Quantity" column boxes seems a bit off to the right, but it is better on another example I have that is more complex than this invoice (can't share because it is a customer's screenshot, but basically an AS400 terminal with a lot more text).

Here's a computer desktop example:

image

And a nature picture:

image

Overall the bounding boxes on this branch look solid, but I can't say for sure that this is better or worse than what's on master because it had similar performance on the few examples I tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve pre-processed image size for QwenVL Refactor: (clip.cpp) identify and regroup pre-processing strategies

3 participants