Skip to content

Conversation

@mattjcly
Copy link
Contributor

@mattjcly mattjcly commented Apr 14, 2025

There was no easy API to get n_patches for a given clip_image_u8 (after refactor, used to be possible by directly accessing image u8 dims and creating an f32 to put it through clip_n_patches_by_img)

This PR adds a clip_n_patches_by_img_u8 function that allows you to do just that, and refactors the implementation to call the same clip_n_patches_by_img_dims for both f32 and u8 (and some other functions that had to create images with empty buffers just to get n_patches, which right now only depends on the image dimensions).

Chose not to expose clip_n_patches_by_img_dims to only make a minor addition to the API and because potentially at some point in the future the underlying precision could impact n_patches?

@mattjcly mattjcly changed the title Matt/n patches for u8 llava: n_patches for clip_image_u8 Apr 14, 2025
@mattjcly
Copy link
Contributor Author

I am unsure how my changes could have impacted build-linux-cross, but happy to make any necessary changes/investigate

Comment on lines 62 to 63
CLIP_API int clip_n_patches_by_img (const struct clip_ctx * ctx, struct clip_image_f32 * img);
CLIP_API int clip_n_patches_by_img_u8 (const struct clip_ctx * ctx, struct clip_image_u8 * img);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these 2 API calls should be regrouped under prefix clip_img_*_get_n_output_tokens

Suggested change
CLIP_API int clip_n_patches_by_img (const struct clip_ctx * ctx, struct clip_image_f32 * img);
CLIP_API int clip_n_patches_by_img_u8 (const struct clip_ctx * ctx, struct clip_image_u8 * img);
CLIP_API int clip_img_f32_get_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * img);
CLIP_API int clip_img_u8_get_n_output_tokens (const struct clip_ctx * ctx, struct clip_image_u8 * img);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old clip_n_patches_by_img can be marked as deprecated (we can add a simple comment for now and will add proper __attribute__ in the future)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm, okay - I just deleted because it seemed like lots of breaking changes were already occurring amongst the refactor. But I can certainly re-add and mark deprecated if you'd prefer.

@ngxson
Copy link
Collaborator

ngxson commented Apr 14, 2025

On second thought, I think it would be better to know in which situation you need this API. Could you provide an example (on which models) where this API can be useful?

The problem is that now this API looks very confusing. If someone comes from transformers library looking at these 3 API calls, it will be unclear when to use which one.

And the worst case is that someone will try to use clip_img_u8_get_n_output_tokens on an image not yet preprocessed (i.e. image not yet downscaled or being sliced), which may give a wrong result. This can happen on llava-uhd where this API will give you the fixed number of patches, without taking into account the fact that the image must be sliced. So, I think at least the clip_img_u8_get_n_output_tokens is redundant (because u8 image will eventually be converted to f32, right?)

clip_img_f32_get_n_output_tokens as I imagine, could be useful for models having variable number of patches based on image size (like minicpmv or qwen2-vl), could you confirm this?

@mattjcly
Copy link
Contributor Author

mattjcly commented Apr 14, 2025

clip_img_f32_get_n_output_tokens as I imagine, could be useful for models having variable number of patches based on image size (like minicpmv or qwen2-vl), could you confirm this?

Yes, 100%. If you want to say, ensure you don't overflow loaded ctx length its important to be able to check this ahead of time to properly manage the context so that you don't get crashes

The problem is that now this API looks very confusing. If someone comes from transformers library looking at these 3 now API calls, it will be unclear when to use which one.

And the worst case is that someone will try to use clip_img_u8_get_n_output_tokens on an image not yet preprocessed (i.e. image not yet downscaled or being sliced), which may give a wrong result. So, I think at least the clip_img_u8_get_n_output_tokens is redundant (because u8 image will eventually be converted to f32, right?)

Totally hear you on the redundancy/confusion. The use case (whether the best way to use the APIs or not) was that pre clip_image_* structs becoming hidden from public API, you could use clip_image_load_from_bytes to load an image to a clip_image_u8 (and only a clip_image_u8, no api to load to f32):

clip_image_u8 * img = clip_image_u8_init();
if (!clip_image_load_from_bytes(image_bytes, image_bytes_length, img)) {
clip_image_u8_free(img);
LOG_ERR("%s: can't load image from bytes, is it a valid image?", __func__);
return NULL;
}

And then you want to get the number of output tokens the loaded image would be embedded to, you could do:

    clip_image_f32 f32_image;
    f32_image.nx = img->nx;
    f32_image.ny = img->ny;
    return clip_n_patches_by_img(clip_ctx, &f32_image);

After clip_image_f32 became hidden, this is no longer possible AFAICT.

This worked just fine because there was no dependency on buffer precision type within the clip_n_patches_by_img function.

int clip_n_patches_by_img(const struct clip_ctx * ctx, struct clip_image_f32 * img) {
const auto & params = ctx->vision_model.hparams;
int n_patches = (params.image_size / params.patch_size) * (params.image_size / params.patch_size);
if (ctx->proj_type == PROJECTOR_TYPE_LDP || ctx->proj_type == PROJECTOR_TYPE_LDPV2 || ctx->proj_type == PROJECTOR_TYPE_GLM_EDGE) {
n_patches /= 4;
} else if (ctx->proj_type == PROJECTOR_TYPE_RESAMPLER) {
if (ctx->minicpmv_version == 2) {
n_patches = 96;
}
else if (ctx->minicpmv_version == 3) {
n_patches = 64;
}
else if (ctx->minicpmv_version == 4) {
n_patches = 64;
}
} else if (ctx->proj_type == PROJECTOR_TYPE_MERGER) {
int patch_size = params.patch_size * 2;
int x_patch = img->nx / patch_size + (int)(img->nx % patch_size > 0);
int y_patch = img->ny / patch_size + (int)(img->ny % patch_size > 0);
n_patches = x_patch * y_patch;
} else if (ctx->proj_type == PROJECTOR_TYPE_GEMMA3) {
n_patches = 256;
}
return n_patches;
}

One option maybe worth considering: Maybe it'd be clearer for now to just make the clip_img_n_output_tokens_by_dim (as I've created in this PR but only put in clip.cpp) public and remove the u8 function, so that it's clear it has no dependency on buffer type? And then in the future if a distinction must be made for u8 vs f32 a new API can be created?

So the new api surface area could be something like:

// deprecated, use n_output_tokens functions
CLIP_API int clip_n_patches        (const struct clip_ctx * ctx);
CLIP_API int clip_n_patches_by_img (const struct clip_ctx * ctx, struct clip_image_f32 * img);

CLIP_API int clip_n_output_tokens (const struct clip_ctx * ctx);
CLIP_API int clip_img_n_output_tokens (const struct clip_ctx * ctx, struct clip_image_f32 * img);
CLIP_API int clip_img_n_output_tokens_by_dim (const struct clip_ctx * ctx, int x, int y);

In this case maybe even mark clip_img_n_output_tokens as deprecated too because internally it would just call clip_img_n_output_tokens_by_dim

@mattjcly mattjcly requested a review from ngxson April 14, 2025 18:08
@mattjcly
Copy link
Contributor Author

Or ofc totally open to any other ideas you may have on the proper flow from load image -> get num image output tokens

@ngxson
Copy link
Collaborator

ngxson commented Apr 14, 2025

The use case (whether the best way to use the APIs or not) was that pre clip_image_* structs becoming hidden from public API, you could use clip_image_load_from_bytes to load an image to a clip_image_u8 (and only a clip_image_u8, no api to load to f32):

Thanks for the clarification. Indeed, one of the main reason why I now longer allow modifying clip_image_f32 freely (in your case, you want to set img_f32.x/y manually) was because this could be used in a wrong way.

Indeed, I think one misconception here is an u8 image correspond to exactly one f32 image, but this is not true for models using slices like llava-uhd or minicpm-v. These models can produce multiple f32 (multiple slices) from one u8. In your case, the logic only returns the num of tokens of the first slice, which is not correct.

One option maybe worth considering: Maybe it'd be clearer for now to just make the clip_img_n_output_tokens_by_dim

I prefer this way, but the actual implementation will not be as straight-forward as you thought. In case of llava-uhd, you will need to call the slicing logic to determine the number of slices and their size respectively.

Indeed, I already planned to work on this refactoring soon, basically we should separate the image manipulation into 2 dedicated parts:

  • One part decide how to slice or resize the image. Input is image size and output is a list of slices coordinates.
  • The other part manipulate the image. Input is the u8 image + coordinates that we want to slice.

So with this separation, clip_img_n_output_tokens_by_dim only need the image size and it will output the exact number of tokens, even when the image has multiple slices.

I will work on this in the coming days and will tag you on the PR.


NOTE: You may also need to take into account wrapper tokens like <image>, <slice>, which currently not handled by clip.cpp but is supported by my new mtmd_tokenize

@mattjcly
Copy link
Contributor Author

Sounds great. Thanks for your time here @ngxson. Will close this, and can be used for reference however needed 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants