Slow KV cache rm operation #586
Unanswered
jneloexpirements
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Is this related to #451 ?
I am running DeepSeek-R1-V3-0324-IQ4_K_R4 (ubergarm's Q4) quant and while the token generation is decent (i have seen 12 tps at 0, around 66% when it goes to)
I use intel Xeon QYFS, 512GB DDR5 4800 RAM, and a RTX PRO 6000.
I run the command below and also for real use case change it from sweep-bench to server with host/port
The above command puts VRAM usage to 90376 out of 97887 MiB.
The raw PP seems to be proper and not irregularly slow from sweep-bench (in this example and also past ones)
I can tolerate the TG but...
In real use cases however which are RAG heavy (feeding it long documents, then chatting for a while on it and websearch) and I like to flip flop between conversations, I have to wait for 2-5 minutes for KV cache removal.
The time it took to for KV removal was around 3 minutes thats imo too slow. even if it is 8192 I tried with 4096 2048 or any number KV is just too slow.
ggml_cuda_host_malloc: failed to allocate 3296.09 MiB of pinned memory: invalid argument
have anything to do with that? How to fix this problem?Any help is appreciated so that I can mitigate before-generation slowdowns
Beta Was this translation helpful? Give feedback.
All reactions