How to clear out llama_context state to be able to keep passing new inputs to LLaVa? #3620

trzy · 2023-10-13T22:13:58Z

trzy
Oct 13, 2023

Hi,

I've wrapped llama.cpp's llava example in a web server so that I can send multiple requests without having to incur the overhead of starting up the app each time. However, I'm not sure how to reset the model state to pass in new requests. I currently free and re-create llama_context on each inference request but this is still a fairly heavyweight operation. Surely there is a way to clear out the context without having to reallocate all of the memory, load the Metal shader again (on macOS), etc.?

I'm having trouble following the interactive llama code but will keep digging. In the meantime, any pointers or explanation of what might need to be done would be greatly appreciated!

My code is here: https://github.com/trzy/llava-cpp-server/blob/main/llava_server.cpp

Note that run_llava_thread() calls perform_inference(), which has to create a new llama_context each time. This is what I'm hoping to streamline.

Thank you,

Bart

Answered by ggerganov

Oct 15, 2023

Can't guarantee it will work, but I think you just have to call llama_kv_cache_tokens_rm(ctx, -1, -1); before every new input

View full answer

ianscrivener · 2023-10-13T22:28:26Z

ianscrivener
Oct 13, 2023

The first release of LLaVA doesn't seem to support interactive mode... it processes one prompt and finishes. My guess that llava interactive mode will become functional in later releases... and I'm keenly awaiting that. Also llava server is on the ToDo list.

1 reply

trzy Oct 13, 2023
Author

Hah! They're using the same server lib I added. I obviously have not been keeping up with this project :D
Nevertheless, rather than wait, I'd like to see if I can get it working in my repo in any way. I guess looking at llama.cpp's interactive mode would offer some hints as to how to accomplish it?

ggerganov · 2023-10-15T10:15:55Z

ggerganov
Oct 15, 2023
Maintainer

Can't guarantee it will work, but I think you just have to call llama_kv_cache_tokens_rm(ctx, -1, -1); before every new input

1 reply

trzy Oct 15, 2023
Author

That seems to work! Thanks a lot! If no one else is looking at web server integration I could take a pass at it but not sure what other modes besides a straightforward inference (prompt, system prompt, image) would be needed/supported by LLaVA.

monatis · 2023-10-15T22:53:32Z

monatis
Oct 15, 2023
Collaborator

#3589 also includes an attempt to support LLaVA inference in server, but the main focus of that PR is different --so I'm not sure when it comes to master. However, it offers a right approach to this I think:

Determine a placeholder tag that can be used to mark the position of image embeddings in the whole prompt.
The client should pass a base64-encoded image in the request in addition to the prompt text.
The server should check the existence of a image_data key in the JSON and base64-decode it if it's available.
Then, eval the part before the image tag as a string, eval the image embeddings, eval the part after the image tag as a string again.

This is versatile and flexible enough to make all sorts of experiments with LLaVA, i.e., image-only input, text-only input, image + text input, text + image + text input, placing image in anywhere you want etc., making it possible to converse with LLaVA on several turns.

0 replies

syntheticgio · 2025-02-28T14:25:50Z

syntheticgio
Feb 28, 2025

For anyone else that stumbles upon this and finds that llama_kv_cache_tokens_rm(ctx, -1, -1); no longer seems to exist, it looks like it was changed to llama_kv_cache_seq_rm(ctx_, -1, -1, -1); according to this PR: #3843

2 replies

solix Feb 28, 2025

Hi @syntheticgio
, thanks for pointing out that llama_kv_cache_seq_rm(ctx_, -1, -1, -1) replaced llama_kv_cache_tokens_rm in PR #3843! I’m using the llama_cpp Python bindings (version 0.2.9) and trying to clear the KV cache with this function. I’ve attempted to access it via llama_cpp.lib, but it doesn’t seem to expose llama_kv_cache_seq_rm directly. Do you know how to properly import the lib module or call this function from Python? Any guidance on accessing these C functions through the bindings would be super helpful. Thanks a lot!

syntheticgio Feb 28, 2025

I'm guessing a bit since I'm not working in python, but I'd imagine that this would work:

(this might be what you've already tried, not sure)

import llama_cpp
import ctypes

ctx = llama_cpp.llama_context_default_params()
llama_cpp.llama_kv_cache_seq_rm(ctx, -1, -1, -1)

This is ignoring other things like setting up the backend, but sounds like you probably have all of that working. All of the cbindings should be in llama_cpp, as far as I understand.

vista497 · 2025-03-18T18:50:56Z

vista497
Mar 18, 2025

Is there a way to refill the cache with new data after clearing it (like it's done in Ollama, where the entire message history is passed with each request)? Is this even possible when using llama.cpp directly in a C++ project? Currently, I tried clearing the cache using llama_kv_cache_clear and refilling it through llama_decode, but this function is very resource-intensive.

0 replies

How to clear out llama_context state to be able to keep passing new inputs to LLaVa? #3620

Uh oh!

trzy Oct 13, 2023

Replies: 5 comments · 4 replies

Uh oh!

ianscrivener Oct 13, 2023

Uh oh!

trzy Oct 13, 2023 Author

Uh oh!

ggerganov Oct 15, 2023 Maintainer

Uh oh!

trzy Oct 15, 2023 Author

Uh oh!

monatis Oct 15, 2023 Collaborator

Uh oh!

syntheticgio Feb 28, 2025

Uh oh!

solix Feb 28, 2025

Uh oh!

Uh oh!

syntheticgio Feb 28, 2025

Uh oh!

vista497 Mar 18, 2025

trzy
Oct 13, 2023

Replies: 5 comments 4 replies

ianscrivener
Oct 13, 2023

trzy Oct 13, 2023
Author

ggerganov
Oct 15, 2023
Maintainer

trzy Oct 15, 2023
Author

monatis
Oct 15, 2023
Collaborator

syntheticgio
Feb 28, 2025

vista497
Mar 18, 2025