--prompt-cache FILE meaning #8538

sasadangelo · 2024-07-17T09:35:06Z

sasadangelo
Jul 17, 2024

Hi Team,

I am new to llama.ccp and I would like to understand the meaning of the option in the subject.
I read the doc here:
https://github.com/ggerganov/llama.cpp/tree/master/examples/main#prompt-caching

in particular:

--prompt-cache FNAME: Specify a file to cache the model state after the initial prompt. This can significantly speed up the startup time when you're using longer prompts. The file is created during the first run and is reused and updated in subsequent runs. Note: Restoring a cached prompt does not imply restoring the exact state of the session at the point it was saved. So even when specifying a specific seed, you are not guaranteed to get the same sequence of tokens as the original generation.

according to this sentence:

you are not guaranteed to get the same sequence of tokens as the original generation.

can I assume the binary stored in the file is a simple representation of the tokens in the prompt?
When you say: you are not guaranteed to get the same sequence of tokens as the original generationdoes it mean it store a sort of summary of the prompt.

I think that if my assumption is correct then at the next question this question + the old chat history kept in the cache is provided in input to the model. Is my understanding correct?

compilade · 2024-07-20T06:29:32Z

compilade
Jul 20, 2024
Collaborator

can I assume the binary stored in the file is a simple representation of the tokens in the prompt?

Yes and no. It's not that simple, otherwise the file would be much smaller and loading would be slower. But you can still imagine it as such and not be too far from what it actually does.

It includes metadata about the model (including the RNG state), the tokens themselves, as well as the KV cache and the logits of the last eval. (These last two take up most of the space)

does it mean it store a sort of summary of the prompt.

It simply means that you should not rely on restoring to be perfect, but in theory it should make the inner state exactly the same as it was before saving, although there might be bugs (like that RNG save and restore has been broken for a while in llama-cli: #8699 (comment)).

Also note that the format of the session files can change between versions which means it might not always be possible to load from them and get the original saved state (I think you'll get an error in that case anyway).

0 replies

sasadangelo · 2024-07-20T09:01:45Z

sasadangelo
Jul 20, 2024
Author

First of all, thank you for your reply.

Let me recap to understand if I figure it out.

In the prompt-cache file I will find the following:

tokens: they are IDs representing the prompt. The prompt could include the question and chat history.
metadata about the model.
Random Number Generator (RNG) state so that we have a coherent reply.
KV cache. It contains key-value pairs where the value contains attention layers calculated on previous text generation.
logits last eval (raw output of the previous run before softmax activation function).

Is my understanding correct?

Even if I understand the general LLM architecture (see the figure):
https://images.appypie.com/wp-content/uploads/2023/08/22063921/The-Transformer-Model-Architecture.jpg

For simplicity, I always consider this big LLM, like a big multidimensional matrix (even if I know there are multiple of them connected in cascade mode with attention layers and activation function) which is a simple function that, for a given input, produces an output. So I assume this prompt cache doesn't alter the model itself, it simply speeds up its pre and post-processing.

Is my understanding correct? If no, which layer of the LLM architecture the prompt cache impacts?
Thank you kn advance for your help.

2 replies

compilade Jul 21, 2024
Collaborator

First of all, thank you for your reply.

You're welcome. It's nice to make this easier to understand (especially since there isn't really any documentation about this).

In the prompt-cache file I will find the following:

You pretty much have it right, except for the order, but that's an implementation detail.

The actual order is:

Model hyper-parameters (also called hparams).
- This includes everything in llama_hparams.
The raw token ids
- Includes the whole session so far
The RNG state
- Used to allow resuming with exactly the same next pseudo-random numbers used for sampling.
The logits of the last eval
- This can be used to sample tokens from the last eval
The KV cache data
The KV cache metadata (like token positions and sequence ids)

If you're curious about how exactly this is packed, you can check llama_state_save_file_internal and llama_state_get_data_internal.

For simplicity, I always consider this big LLM, like a big multidimensional matrix (even if I know there are multiple of them connected in cascade mode with attention layers and activation function) which is a simple function that, for a given input, produces an output. So I assume this prompt cache doesn't alter the model itself, it simply speeds up its pre and post-processing.

The prompt cache simply allows to avoid having to re-build the KV cache from scratch when restarting llama-cli, which means it doesn't actually have to re-process the tokens, it can continue where it left off.

Is my understanding correct?

Yes, it seems to be.

which layer of the LLM architecture the prompt cache impacts?

It affects mostly the KV cache, which isn't in that diagram. Note that most LLMs supported by llama.cpp are decoder-only Transformers, and that the KV cache is only a thing in the decoder, because the encoder (if present, like in T5) tends to use non-causal attention (output depends on past and future token instead of only past tokens).

compilade Jul 27, 2024
Collaborator

@sasadangelo It seems like RNG save and restore has been broken in llama-cli for a while. See #8699 (comment).

This means you are not guaranteed to get the same sequence of tokens as the original generation is definitely true considering the random number generator doesn't get its state properly restored.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

--prompt-cache FILE meaning #8538

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

--prompt-cache FILE meaning #8538

Uh oh!

sasadangelo Jul 17, 2024

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

compilade Jul 20, 2024 Collaborator

Uh oh!

Uh oh!

sasadangelo Jul 20, 2024 Author

Uh oh!

compilade Jul 21, 2024 Collaborator

Uh oh!

compilade Jul 27, 2024 Collaborator

sasadangelo
Jul 17, 2024

Replies: 2 comments 2 replies

compilade
Jul 20, 2024
Collaborator

sasadangelo
Jul 20, 2024
Author

compilade Jul 21, 2024
Collaborator

compilade Jul 27, 2024
Collaborator