--prompt-cache FILE meaning #8538
Replies: 2 comments 2 replies
-
Yes and no. It's not that simple, otherwise the file would be much smaller and loading would be slower. But you can still imagine it as such and not be too far from what it actually does. It includes metadata about the model (including the RNG state), the tokens themselves, as well as the KV cache and the logits of the last eval. (These last two take up most of the space)
It simply means that you should not rely on restoring to be perfect, but in theory it should make the inner state exactly the same as it was before saving, although there might be bugs (like that RNG save and restore has been broken for a while in Also note that the format of the session files can change between versions which means it might not always be possible to load from them and get the original saved state (I think you'll get an error in that case anyway). |
Beta Was this translation helpful? Give feedback.
-
First of all, thank you for your reply. Let me recap to understand if I figure it out. In the prompt-cache file I will find the following:
Is my understanding correct? Even if I understand the general LLM architecture (see the figure): For simplicity, I always consider this big LLM, like a big multidimensional matrix (even if I know there are multiple of them connected in cascade mode with attention layers and activation function) which is a simple function that, for a given input, produces an output. So I assume this prompt cache doesn't alter the model itself, it simply speeds up its pre and post-processing. Is my understanding correct? If no, which layer of the LLM architecture the prompt cache impacts? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Team,
I am new to llama.ccp and I would like to understand the meaning of the option in the subject.
I read the doc here:
https://github.com/ggerganov/llama.cpp/tree/master/examples/main#prompt-caching
in particular:
according to this sentence:
can I assume the binary stored in the file is a simple representation of the tokens in the prompt?
When you say:
you are not guaranteed to get the same sequence of tokens as the original generation
does it mean it store a sort of summary of the prompt.I think that if my assumption is correct then at the next question this question + the old chat history kept in the cache is provided in input to the model. Is my understanding correct?
Beta Was this translation helpful? Give feedback.
All reactions