- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.4k
clip : refactor graph builder #13321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clip : refactor graph builder #13321
Conversation
| Thanks @mattjcly from LM Studio for testing this PR, we confirmed that this produce exactly the same results as  I'm merging this PR once the CI is green. | 
| I benchmarked a few VL models, and they all perform well except for Qwen-VL v2 (2B and 8B). Currently, all accuracy scores are zero for both classification and extraction tasks. I tested with llama-mtmd-cli on this branch and can reproduce the problem on Windows using the CUDA backend. It works fine on CPU. To reproduce, simply use the attached image with the prompt “Give me the number.” Every run with the new CLIP yields an incorrect result (for example, “The number is 3.”). With the previous version, it worked correctly 100% of the time. List of tested models: 
 Let me know if you need any further information or testing from my end. | 
| @lcarrere ok thanks for testing. So if I understand correctly, you're getting this problem only with qwen2 VL, while the other models are running correctly, right? | 
| I tested on my side but it always give me the correct answer: 
 Running on Metal backend, mac M3 Max (Same result with  Which commit or build number on  | 
| Just to be sure, I updated this PR to latest master, feel free to give it a try | 
| I tested with the master to which I added your new clip.cpp. | 
| Ok hopefully 56b41af resolves the problem. I forgot that only qwen 2.5 uses RMS norm, the older version use the "normal" norm | 
| I believe you got it! llama.cpp is OK and all my unit test are now green with all backends. Thanks a lot @ngxson | 
| OK, so everything I'm testing directly with llama.cpp is working on CUDA and other backends. | 
| ok me again. @ngxson please bear with me. I though the master was merged on this PR but it is actually not, it is only on the fork. In other words: -> OK git checkout master -> KO | 
| Ok thanks for testing, no problem, I'll investigate this more. This is kinda expected because before this PR, qwen2 use the same graph builder with llava, while in this PR I merge qwen 2 + 2.5 vl into once since they all use M-RoPE. Probably just missing some nodes | 
| Ok seems like I miss one spot, hopefully this fixes the problem: 37e24b1 | 
| Thanks, I just hope I'm not wasting your time.... I dig a little further. I've updated everything from the master branch (#13344). And I set the temperature on 0 to get greedy sampling as I've observed a fairly higher perplexity with the new clip and my code base, which was obviously more noticeable with smaller model and leading to random conclusions... See the screenshot: 
 tested model: qwen 2VL 2B. I hope this helps! | 
| 
 on it... | 
| okay this seems to work!! Thanks again Xuan | 
| OK, all regressive tests are OK. Same accuracy / same speed. Thanks again for your hard work!! | 
| Perfect, thanks again for testing! | 
| Okay, on my Todo list. | 


Motivation:
In this PR:
struct clip_graphwhich contains various graph builder, for exampleclip_graph::build_llava()clip_graph::build_attnand reuse it in the arch-specific graphs --> we can experiment with flash attn support a next PRclip_graph::build_qwen2vl()TODO:
remove the--> not doing it here, but will be a dedicated PRload_image_size, this is a non-thread-safe hack introduced by minicpmvTest: