Can only use up to 85% of total VRAM during generation with --gpulayers #200
Unanswered
VraethrDalkr
asked this question in
Q&A
Replies: 1 comment
-
About 67% of my 4GB 1650 is the max if I want to use ~2040 token prompts. I can get about 88% (3633/4096) but in situations like that I have to keep the prompt pretty small. In my case, I assume CLBlast is dynamically allocating more VRAM for prompt ingestion (because of the larger prompt) and that causes me to error out. Even when I've set the --contextsize 2048 in an advance command line arg. Because CLBlast and GPUlayers both apparently dip out of the same VRAM pot, I had assumed this situation was just one that there was no elegant solution to. But maybe we're talking apples and oranges. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I'm just wondering if it's only me. With the CUDA only version I was able to fill near 100% of my VRAM when using --gpulayers, but with opencl I can only go as far as 85%. It's not such a big deal but it's nice to be able to cut every possible milliseconds per token when generating. As an example, for a 13b q5_1, with CUDA I was able to offload 25 layers, while with opencl I cannot offload more than 23 layers. I have a 3070 Ti Laptop GPU with 8Gb VRAM. I just wanted to point this out since it may be a bug, or simply due to my GPU or opencl limitations. Thanks :)
Beta Was this translation helpful? Give feedback.
All reactions