Tips for improving prompt processing speed. #1699
mercurial-moon
started this conversation in
General
Replies: 1 comment
-
Yes, prompts get cached if they have the same prefix. If you send two prompts in sequence.
and then
then only the Also for faster speeds, try running it in CUDA mode with 0 layers offloaded. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi i have system that has a Nivdia GPU with low VRAM(2GB) but system ram is 24GB DDR4 and cpu is decent Intel 11th Gen i5 processor.
I tested with a large initial prompt of around 8000 tokens and it takes around 40mins to process the prompt before generating the reply at around 2-3 tok/sec. Model used was Gemma3 27B 4bit quantized IQuants
The kobold settings were running in CPU mode, no offload, no flash attention, context shift was on, high priority off, kv cache was not quantized, Max Context was set to around 14000. I was running the kobold.exe (the cuda one) binary. The prompt was sent via koboldcpp api.
Is there any way to improve prompt processing time, would the next prompt take lesser time than this, due to context shift.
Is context shift another name for prompt caching?
My prompt which is 8000 tokens has mostly a static part that is around 7500 tokens and only the 500 tokens will change across requests, can I preprocess that 7500 tokens so as to not let it re-evaluate it on every request?
I'm thinking of something on these lines, send 7500 tokens (static part) to kobold using some special api command kobold processes it and caches it. Next I send the dynamic portion of the prompt and kobold processing the remain 500 tokens and starts generating reply.
On subsequent queries again I send a new dynamic portion but kobold still uses the precalculated static part.
Is something like this possible?
Beta Was this translation helpful? Give feedback.
All reactions