Skip every other token during llava-cli? #7591
nkasmanoff
started this conversation in
Ideas
Replies: 1 comment
-
|
Hey @ggerganov, just want to re-surface this in case you missed. I know a lot of VLMs are adding pooling layers or resamplers to help, but I feel like making the option model agnostic like this one could make it a lot easier to test. I am happy to give it a try, but would appreciate any help you can think of for updating the basic implementation above to use alternating tokens rather than say the first half. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I am wondering if this is something that's possible to do (and if so where) on llava-cli.
For limited resource compute, i.e. a Raspberry Pi, it takes quite a while for the model to start generating a response, due to the fact that there are so many image tokens which must be passed into context first before the output is.
While this will undoubtedly harm performance, something I am keen to try is reducing that number of image tokens that gets sent.
To make this something easy to experiment with I was thinking about slicing the array, and taking every Nth token, or some other variant until finding what works best.
I'm coming from a Python background where this is something very easy to update on say PyTorch, but I am not sure where to start here.
It appears possible to do this, but so far I only have figured out how to do so for slicing a portion of the image embeddings, rather than taking an alternating one.
From this function
https://github.com/ggerganov/llama.cpp/blob/d041d2ceaaf50e058622d92921b3e680ffa4e9e7/examples/llava/llava.cpp#L318
Update it to:
and the # of image tokens processed gets dropped in half. Is there a way to easily do this for every other, or every Nth token instead?
Beta Was this translation helpful? Give feedback.
All reactions