llava: n_patches for clip_image_u8 #12944

mattjcly · 2025-04-14T14:55:23Z

There was no easy API to get n_patches for a given clip_image_u8 (after refactor, used to be possible by directly accessing image u8 dims and creating an f32 to put it through clip_n_patches_by_img)

This PR adds a clip_n_patches_by_img_u8 function that allows you to do just that, and refactors the implementation to call the same clip_n_patches_by_img_dims for both f32 and u8 (and some other functions that had to create images with empty buffers just to get n_patches, which right now only depends on the image dimensions).

Chose not to expose clip_n_patches_by_img_dims to only make a minor addition to the API and because potentially at some point in the future the underlying precision could impact n_patches?

mattjcly · 2025-04-14T15:15:35Z

I am unsure how my changes could have impacted build-linux-cross, but happy to make any necessary changes/investigate

examples/llava/clip.cpp

ngxson · 2025-04-14T16:59:07Z

examples/llava/clip.h

+CLIP_API int clip_n_patches_by_img    (const struct clip_ctx * ctx, struct clip_image_f32 * img);
+CLIP_API int clip_n_patches_by_img_u8 (const struct clip_ctx * ctx, struct clip_image_u8 * img);


I think these 2 API calls should be regrouped under prefix clip_img_*_get_n_output_tokens

Suggested change

CLIP_API int clip_n_patches_by_img (const struct clip_ctx * ctx, struct clip_image_f32 * img);

CLIP_API int clip_n_patches_by_img_u8 (const struct clip_ctx * ctx, struct clip_image_u8 * img);

CLIP_API int clip_img_f32_get_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * img);

CLIP_API int clip_img_u8_get_n_output_tokens (const struct clip_ctx * ctx, struct clip_image_u8 * img);

The old clip_n_patches_by_img can be marked as deprecated (we can add a simple comment for now and will add proper __attribute__ in the future)

mm, okay - I just deleted because it seemed like lots of breaking changes were already occurring amongst the refactor. But I can certainly re-add and mark deprecated if you'd prefer.

ngxson · 2025-04-14T17:32:13Z

On second thought, I think it would be better to know in which situation you need this API. Could you provide an example (on which models) where this API can be useful?

The problem is that now this API looks very confusing. If someone comes from transformers library looking at these 3 API calls, it will be unclear when to use which one.

And the worst case is that someone will try to use clip_img_u8_get_n_output_tokens on an image not yet preprocessed (i.e. image not yet downscaled or being sliced), which may give a wrong result. This can happen on llava-uhd where this API will give you the fixed number of patches, without taking into account the fact that the image must be sliced. So, I think at least the clip_img_u8_get_n_output_tokens is redundant (because u8 image will eventually be converted to f32, right?)

clip_img_f32_get_n_output_tokens as I imagine, could be useful for models having variable number of patches based on image size (like minicpmv or qwen2-vl), could you confirm this?

mattjcly · 2025-04-14T18:04:16Z

clip_img_f32_get_n_output_tokens as I imagine, could be useful for models having variable number of patches based on image size (like minicpmv or qwen2-vl), could you confirm this?

Yes, 100%. If you want to say, ensure you don't overflow loaded ctx length its important to be able to check this ahead of time to properly manage the context so that you don't get crashes

The problem is that now this API looks very confusing. If someone comes from transformers library looking at these 3 now API calls, it will be unclear when to use which one.

And the worst case is that someone will try to use clip_img_u8_get_n_output_tokens on an image not yet preprocessed (i.e. image not yet downscaled or being sliced), which may give a wrong result. So, I think at least the clip_img_u8_get_n_output_tokens is redundant (because u8 image will eventually be converted to f32, right?)

Totally hear you on the redundancy/confusion. The use case (whether the best way to use the APIs or not) was that pre clip_image_* structs becoming hidden from public API, you could use clip_image_load_from_bytes to load an image to a clip_image_u8 (and only a clip_image_u8, no api to load to f32):

llama.cpp/examples/llava/llava.cpp

Lines 505 to 510 in d6d2c2a

    
           clip_image_u8 * img = clip_image_u8_init(); 
        
           if (!clip_image_load_from_bytes(image_bytes, image_bytes_length, img)) { 
        
               clip_image_u8_free(img); 
        
               LOG_ERR("%s: can't load image from bytes, is it a valid image?", __func__); 
        
               return NULL; 
        
           }

And then you want to get the number of output tokens the loaded image would be embedded to, you could do:

    clip_image_f32 f32_image;
    f32_image.nx = img->nx;
    f32_image.ny = img->ny;
    return clip_n_patches_by_img(clip_ctx, &f32_image);

After clip_image_f32 became hidden, this is no longer possible AFAICT.

This worked just fine because there was no dependency on buffer precision type within the clip_n_patches_by_img function.

llama.cpp/examples/llava/clip.cpp

Lines 2303 to 2330 in d6d2c2a

    
           int clip_n_patches_by_img(const struct clip_ctx * ctx, struct clip_image_f32 * img) { 
        
               const auto & params = ctx->vision_model.hparams; 
        
               int n_patches = (params.image_size / params.patch_size) * (params.image_size / params.patch_size); 
        
               if (ctx->proj_type == PROJECTOR_TYPE_LDP || ctx->proj_type == PROJECTOR_TYPE_LDPV2 || ctx->proj_type == PROJECTOR_TYPE_GLM_EDGE) { 
        
                   n_patches /= 4; 
        
               } else if (ctx->proj_type == PROJECTOR_TYPE_RESAMPLER) { 
        
                   if (ctx->minicpmv_version == 2) { 
        
                       n_patches = 96; 
        
                   } 
        
                   else if (ctx->minicpmv_version == 3) { 
        
                       n_patches = 64; 
        
                   } 
        
                   else if (ctx->minicpmv_version == 4) { 
        
                       n_patches = 64; 
        
                   } 
        
               } else if (ctx->proj_type == PROJECTOR_TYPE_MERGER) { 
        
                   int patch_size = params.patch_size * 2; 
        
                   int x_patch = img->nx / patch_size + (int)(img->nx % patch_size > 0); 
        
                   int y_patch = img->ny / patch_size + (int)(img->ny % patch_size > 0); 
        
                   n_patches = x_patch * y_patch; 
        
               } else if (ctx->proj_type == PROJECTOR_TYPE_GEMMA3) { 
        
                   n_patches = 256; 
        
               } 
        
               return n_patches; 
        
           }

One option maybe worth considering: Maybe it'd be clearer for now to just make the clip_img_n_output_tokens_by_dim (as I've created in this PR but only put in clip.cpp) public and remove the u8 function, so that it's clear it has no dependency on buffer type? And then in the future if a distinction must be made for u8 vs f32 a new API can be created?

So the new api surface area could be something like:

// deprecated, use n_output_tokens functions
CLIP_API int clip_n_patches        (const struct clip_ctx * ctx);
CLIP_API int clip_n_patches_by_img (const struct clip_ctx * ctx, struct clip_image_f32 * img);

CLIP_API int clip_n_output_tokens (const struct clip_ctx * ctx);
CLIP_API int clip_img_n_output_tokens (const struct clip_ctx * ctx, struct clip_image_f32 * img);
CLIP_API int clip_img_n_output_tokens_by_dim (const struct clip_ctx * ctx, int x, int y);

In this case maybe even mark clip_img_n_output_tokens as deprecated too because internally it would just call clip_img_n_output_tokens_by_dim

mattjcly · 2025-04-14T18:18:34Z

Or ofc totally open to any other ideas you may have on the proper flow from load image -> get num image output tokens

ngxson · 2025-04-14T19:42:56Z

The use case (whether the best way to use the APIs or not) was that pre clip_image_* structs becoming hidden from public API, you could use clip_image_load_from_bytes to load an image to a clip_image_u8 (and only a clip_image_u8, no api to load to f32):

Thanks for the clarification. Indeed, one of the main reason why I now longer allow modifying clip_image_f32 freely (in your case, you want to set img_f32.x/y manually) was because this could be used in a wrong way.

Indeed, I think one misconception here is an u8 image correspond to exactly one f32 image, but this is not true for models using slices like llava-uhd or minicpm-v. These models can produce multiple f32 (multiple slices) from one u8. In your case, the logic only returns the num of tokens of the first slice, which is not correct.

One option maybe worth considering: Maybe it'd be clearer for now to just make the clip_img_n_output_tokens_by_dim

I prefer this way, but the actual implementation will not be as straight-forward as you thought. In case of llava-uhd, you will need to call the slicing logic to determine the number of slices and their size respectively.

Indeed, I already planned to work on this refactoring soon, basically we should separate the image manipulation into 2 dedicated parts:

One part decide how to slice or resize the image. Input is image size and output is a list of slices coordinates.
The other part manipulate the image. Input is the u8 image + coordinates that we want to slice.

So with this separation, clip_img_n_output_tokens_by_dim only need the image size and it will output the exact number of tokens, even when the image has multiple slices.

I will work on this in the coming days and will tag you on the PR.

NOTE: You may also need to take into account wrapper tokens like <image>, <slice>, which currently not handled by clip.cpp but is supported by my new mtmd_tokenize

mattjcly · 2025-04-14T20:11:27Z

Sounds great. Thanks for your time here @ngxson. Will close this, and can be used for reference however needed 👍

mattjcly added 3 commits April 14, 2025 10:50

llava: n_patches for clip_image_u8

c276a91

No breaking change

14b5868

formatting fix

92b8e6f

mattjcly changed the title ~~Matt/n patches for u8~~ llava: n_patches for clip_image_u8 Apr 14, 2025

github-actions bot added the examples label Apr 14, 2025

Fix type in clip_n_patches_by_img_u8

941efbc

ngxson reviewed Apr 14, 2025

View reviewed changes

Rename API

6e79b92

mattjcly requested a review from ngxson April 14, 2025 18:08

mattjcly closed this Apr 14, 2025

ngxson mentioned this pull request Apr 18, 2025

clip : refactor, add image_manipulation and llava_uhd classes #13011

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llava: n_patches for clip_image_u8 #12944

llava: n_patches for clip_image_u8 #12944

Uh oh!

mattjcly commented Apr 14, 2025 •

edited

Loading

Uh oh!

mattjcly commented Apr 14, 2025

Uh oh!

Uh oh!

ngxson Apr 14, 2025

Uh oh!

ngxson Apr 14, 2025

Uh oh!

mattjcly Apr 14, 2025

Uh oh!

ngxson commented Apr 14, 2025 •

edited

Loading

Uh oh!

mattjcly commented Apr 14, 2025 •

edited

Loading

Uh oh!

mattjcly commented Apr 14, 2025

Uh oh!

ngxson commented Apr 14, 2025 •

edited

Loading

Uh oh!

mattjcly commented Apr 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		CLIP_API int clip_n_patches_by_img (const struct clip_ctx * ctx, struct clip_image_f32 * img);
		CLIP_API int clip_n_patches_by_img_u8 (const struct clip_ctx * ctx, struct clip_image_u8 * img);

llava: n_patches for clip_image_u8 #12944

llava: n_patches for clip_image_u8 #12944

Uh oh!

Conversation

mattjcly commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattjcly commented Apr 14, 2025

Uh oh!

Uh oh!

ngxson Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

mattjcly Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattjcly commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattjcly commented Apr 14, 2025

Uh oh!

ngxson commented Apr 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattjcly commented Apr 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mattjcly commented Apr 14, 2025 •

edited

Loading

ngxson commented Apr 14, 2025 •

edited

Loading

mattjcly commented Apr 14, 2025 •

edited

Loading

ngxson commented Apr 14, 2025 •

edited

Loading