-
Notifications
You must be signed in to change notification settings - Fork 13.3k
mtmd : support Kimi VL model #15458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mtmd : support Kimi VL model #15458
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
tools/mtmd/clip.cpp
Outdated
cur = ggml_permute(ctx0, cur, 0, 2, 1, 3); | ||
|
||
cur = ggml_cont_2d(ctx0, cur, cur->ne[0], cur->ne[1] * cur->ne[2]); | ||
cur = build_pixel_shuffle(cur, scale_factor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the HF checkpoint, the block is called PixelUnshuffleBlock
. I followed that when I added a comment above // pixel unshuffle block
. The function name is build_pixel_shuffle
though.
Should it be shuffle
or unshuffle
? :) I'm personally fine with any as soon as we are consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In smolvlm (siglip1) it's called "shuffle", siglip2 is "unshuffle" and here in kimi-vl it's called "merge_patches"
I think the more appropriate name can be merge_patches_permute
since it's rely on permutation, to differentiate with patch merging using pool_avg_2d
(used by gemma 3)
It should work now. No idea why Kimi-VL-A3B-Instruct outputs gibberish even with text-only input, so I removed the GGUF
|
GGML_ASSERT(pos_embd); | ||
|
||
if (!pos_embd || height * width == pos_embd->ne[1]) { | ||
if (height == n_per_side && width == n_per_side) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdakhran FYI, I attempted to fix a potential bug here. For example, if the input shape is [n_embd, 256]
and h = 8, w = 32
, then 8 * 32 == 256
which skip ggml_interpolate
even when it needs to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ngxson , I missed this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the new models and the fixes to existing ones, @ngxson!
I would like to report that Kimi-VL-A3B-Instruct does not generate gibberish outputs in my implementation: ![]() |
* convert : fix tensor naming conflict for llama 4 vision * convert ok * support kimi vision model * clean up * fix style * fix calc number of output tokens * refactor resize_position_embeddings * add test case * rename build fn * correct a small bug
@foldl feel free to open a PR to fix it |
I can also report that |
Fix #14318
Support https://huggingface.co/moonshotai/Kimi-VL-A3B-Thinking-2506 (and newer "Thinking" variants) with dynamic resolution
GGUF: https://huggingface.co/ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF
Large part of code is copied from LFM2 implementation, huge kudos to @tdakhran for the LFM2 code 😄
NOTE: Kimi-VL-A3B-Instruct generates gibberish output even in text-only mode - I have no idea why, if someone knows, please comment