Can only use up to 85% of total VRAM during generation with --gpulayers #200

VraethrDalkr · 2023-05-27T16:26:39Z

VraethrDalkr
May 27, 2023

I'm just wondering if it's only me. With the CUDA only version I was able to fill near 100% of my VRAM when using --gpulayers, but with opencl I can only go as far as 85%. It's not such a big deal but it's nice to be able to cut every possible milliseconds per token when generating. As an example, for a 13b q5_1, with CUDA I was able to offload 25 layers, while with opencl I cannot offload more than 23 layers. I have a 3070 Ti Laptop GPU with 8Gb VRAM. I just wanted to point this out since it may be a bug, or simply due to my GPU or opencl limitations. Thanks :)

cgessai · 2023-05-28T05:24:28Z

cgessai
May 28, 2023

About 67% of my 4GB 1650 is the max if I want to use ~2040 token prompts. I can get about 88% (3633/4096) but in situations like that I have to keep the prompt pretty small. In my case, I assume CLBlast is dynamically allocating more VRAM for prompt ingestion (because of the larger prompt) and that causes me to error out. Even when I've set the --contextsize 2048 in an advance command line arg.

Because CLBlast and GPUlayers both apparently dip out of the same VRAM pot, I had assumed this situation was just one that there was no elegant solution to. But maybe we're talking apples and oranges.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can only use up to 85% of total VRAM during generation with --gpulayers #200

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can only use up to 85% of total VRAM during generation with --gpulayers #200

Uh oh!

VraethrDalkr May 27, 2023

Replies: 1 comment

Uh oh!

cgessai May 28, 2023

VraethrDalkr
May 27, 2023

cgessai
May 28, 2023