-
Notifications
You must be signed in to change notification settings - Fork 13.5k
llava: n_patches for clip_image_u8 #12944
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I am unsure how my changes could have impacted build-linux-cross, but happy to make any necessary changes/investigate |
examples/llava/clip.h
Outdated
| CLIP_API int clip_n_patches_by_img (const struct clip_ctx * ctx, struct clip_image_f32 * img); | ||
| CLIP_API int clip_n_patches_by_img_u8 (const struct clip_ctx * ctx, struct clip_image_u8 * img); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these 2 API calls should be regrouped under prefix clip_img_*_get_n_output_tokens
| CLIP_API int clip_n_patches_by_img (const struct clip_ctx * ctx, struct clip_image_f32 * img); | |
| CLIP_API int clip_n_patches_by_img_u8 (const struct clip_ctx * ctx, struct clip_image_u8 * img); | |
| CLIP_API int clip_img_f32_get_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * img); | |
| CLIP_API int clip_img_u8_get_n_output_tokens (const struct clip_ctx * ctx, struct clip_image_u8 * img); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old clip_n_patches_by_img can be marked as deprecated (we can add a simple comment for now and will add proper __attribute__ in the future)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mm, okay - I just deleted because it seemed like lots of breaking changes were already occurring amongst the refactor. But I can certainly re-add and mark deprecated if you'd prefer.
|
On second thought, I think it would be better to know in which situation you need this API. Could you provide an example (on which models) where this API can be useful? The problem is that now this API looks very confusing. If someone comes from And the worst case is that someone will try to use
|
Yes, 100%. If you want to say, ensure you don't overflow loaded ctx length its important to be able to check this ahead of time to properly manage the context so that you don't get crashes
Totally hear you on the redundancy/confusion. The use case (whether the best way to use the APIs or not) was that pre llama.cpp/examples/llava/llava.cpp Lines 505 to 510 in d6d2c2a
And then you want to get the number of output tokens the loaded image would be embedded to, you could do: After This worked just fine because there was no dependency on buffer precision type within the llama.cpp/examples/llava/clip.cpp Lines 2303 to 2330 in d6d2c2a
One option maybe worth considering: Maybe it'd be clearer for now to just make the So the new api surface area could be something like: In this case maybe even mark |
|
Or ofc totally open to any other ideas you may have on the proper flow from load image -> get num image output tokens |
Thanks for the clarification. Indeed, one of the main reason why I now longer allow modifying Indeed, I think one misconception here is an u8 image correspond to exactly one f32 image, but this is not true for models using slices like llava-uhd or minicpm-v. These models can produce multiple f32 (multiple slices) from one u8. In your case, the logic only returns the num of tokens of the first slice, which is not correct.
I prefer this way, but the actual implementation will not be as straight-forward as you thought. In case of llava-uhd, you will need to call the slicing logic to determine the number of slices and their size respectively. Indeed, I already planned to work on this refactoring soon, basically we should separate the image manipulation into 2 dedicated parts:
So with this separation, I will work on this in the coming days and will tag you on the PR. NOTE: You may also need to take into account wrapper tokens like |
|
Sounds great. Thanks for your time here @ngxson. Will close this, and can be used for reference however needed 👍 |
There was no easy API to get
n_patchesfor a givenclip_image_u8(after refactor, used to be possible by directly accessing image u8 dims and creating an f32 to put it throughclip_n_patches_by_img)This PR adds a
clip_n_patches_by_img_u8function that allows you to do just that, and refactors the implementation to call the sameclip_n_patches_by_img_dimsfor both f32 and u8 (and some other functions that had to create images with empty buffers just to get n_patches, which right now only depends on the image dimensions).Chose not to expose
clip_n_patches_by_img_dimsto only make a minor addition to the API and because potentially at some point in the future the underlying precision could impact n_patches?