Replies: 4 comments 8 replies
-
Unfortunately not likely at this immediate, as this is a CUDA specific implementation which will not work on other GPUs, and requires huge (300 mb+) libraries to be bundled for it to work, which goes against the lightweight and portable approach of koboldcpp. But its potentially possible in future if someone gets around to implementing the necessary kernels in opencl, or a more portable solution is found. |
Beta Was this translation helpful? Give feedback.
-
This is now added as a special edition https://github.com/LostRuins/koboldcpp/releases/tag/koboldcpp-1.22-CUDA-ONLY Don't get too used to it though, the long term goal is still to use clblast and keep koboldcpp lean and clean. |
Beta Was this translation helpful? Give feedback.
-
Hey, does this feature work if one only uses the --gpulayers CMD line argument or does some other CMD line argument need to be invoked, as well? I experimented with the build quite a bit but despite seeing my 4GB of VRAM fill up (I maxed out at --gpulayers 8), my RAM was still maxed out as usual and token generation speed was still the same both with and without --stream. I'm on CUDA 11.8. It all appears to be 'working' fine, I'm just not seeing any benefit in performance or in RAM savings. I inferred that "GPU offloading" implied the change would speed up token generation and possibly save RAM but I couldn't figure out if that was correct. I just though it was odd that nothing appeared to functionally change for me at all. Any direction is greatly appreciated! |
Beta Was this translation helpful? Give feedback.
-
If you have multiple gpus will koboldcpp split the LLM model to combine the GPU memory |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Will GPU acceleration (below) be integrated into Koboldcpp? If so, when is it likely? Very exciting stuff.
ggml-org#1375
Beta Was this translation helpful? Give feedback.
All reactions