Skip every other token during llava-cli? #7591

nkasmanoff · 2024-05-28T13:43:16Z

nkasmanoff
May 28, 2024

Hi, I am wondering if this is something that's possible to do (and if so where) on llava-cli.

For limited resource compute, i.e. a Raspberry Pi, it takes quite a while for the model to start generating a response, due to the fact that there are so many image tokens which must be passed into context first before the output is.

While this will undoubtedly harm performance, something I am keen to try is reducing that number of image tokens that gets sent.

To make this something easy to experiment with I was thinking about slicing the array, and taking every Nth token, or some other variant until finding what works best.

I'm coming from a Python background where this is something very easy to update on say PyTorch, but I am not sure where to start here.

It appears possible to do this, but so far I only have figured out how to do so for slicing a portion of the image embeddings, rather than taking an alternating one.

From this function

https://github.com/ggerganov/llama.cpp/blob/d041d2ceaaf50e058622d92921b3e680ffa4e9e7/examples/llava/llava.cpp#L318

Update it to:

bool llava_eval_image_embed(llama_context * ctx_llama, const struct llava_image_embed * image_embed, int n_batch, int * n_past) {
    int n_embd  = llama_n_embd(llama_get_model(ctx_llama));
    int slice_n_image_pos = image_embed->n_image_pos / 2;
    for (int i = 0; i < slice_n_image_pos; i += n_batch) {
        int n_eval = slice_n_image_pos - i;
        if (n_eval > n_batch) {
            n_eval = n_batch;
        }
        llama_batch batch = {int32_t(n_eval), nullptr, (image_embed->embed+i*n_embd), nullptr, nullptr, nullptr, nullptr, *n_past, 1, 0, };
        if (llama_decode(ctx_llama, batch)) {
            LOG_TEE("%s : failed to eval\n", __func__);
            return false;
        }
        *n_past += n_eval;
    }
    return true;

and the # of image tokens processed gets dropped in half. Is there a way to easily do this for every other, or every Nth token instead?

nkasmanoff · 2024-06-10T19:43:42Z

nkasmanoff
Jun 10, 2024
Author

Hey @ggerganov, just want to re-surface this in case you missed.

I know a lot of VLMs are adding pooling layers or resamplers to help, but I feel like making the option model agnostic like this one could make it a lot easier to test. I am happy to give it a try, but would appreciate any help you can think of for updating the basic implementation above to use alternating tokens rather than say the first half.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Skip every other token during llava-cli? #7591

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Skip every other token during llava-cli? #7591

Uh oh!

nkasmanoff May 28, 2024

Replies: 1 comment

Uh oh!

nkasmanoff Jun 10, 2024 Author

nkasmanoff
May 28, 2024

nkasmanoff
Jun 10, 2024
Author