GPU Acceleration #157

Sehkmet · 2023-05-11T07:59:10Z

Sehkmet
May 11, 2023

Will GPU acceleration (below) be integrated into Koboldcpp? If so, when is it likely? Very exciting stuff.

LostRuins · 2023-05-11T08:03:24Z

LostRuins
May 11, 2023
Maintainer

Unfortunately not likely at this immediate, as this is a CUDA specific implementation which will not work on other GPUs, and requires huge (300 mb+) libraries to be bundled for it to work, which goes against the lightweight and portable approach of koboldcpp. But its potentially possible in future if someone gets around to implementing the necessary kernels in opencl, or a more portable solution is found.

4 replies

mlbrnm May 13, 2023

Would it be possible to make it some kind of optional secondary download? It is nice that the standard KoboldCPP is less than 20MB, but 320MB would still be tiny compared to KoboldAI standard, and the performance improvements look to be massive. When we're downloading potentially 40GB+ models it seems kind of "penny wise pound foolish" to sacrifice huge speedups for a couple hundred MB of downloads, with all due respect.

Thanks for making this though, it is a very nice little piece of software. The update yesterday to add support for the new models while maintaining backwards compatibility was amazing.

LostRuins May 14, 2023
Maintainer

I will try to look into getting GPU acceleration support too, but cuda would be the last resort for me

NordGeit May 14, 2023

I quote from some plonks at an esoteric site:

"Assuming you have CUDA installed, the built DLL/lib for llama.cpp with cuBLAS and GPU inferencing support is like 400KB. He doesn't need to include the entire CUDA file tree, he's just trying to make everything be a single .exe for some reason. He could just say those features require installing CUDA, unless I'm missing something. Happy to be corrected, but that's how I custom build my llama.cpp for llama-cpp-python."

"Yes that's right, the main.exe from llama.cpp is only 400kb, that means you have to install cuda on your computer, which is the best solution for everyone"

To be honest, I have no clue about anything that's going on, but someone asked for anyone to post this in reply if they had a github, so here I am.

LostRuins May 15, 2023
Maintainer

It's really not that straightforward tbh but I've gone ahead and did all the hard work so you can try https://github.com/LostRuins/koboldcpp/releases/tag/koboldcpp-1.22-CUDA-ONLY

LostRuins · 2023-05-15T16:26:47Z

LostRuins
May 15, 2023
Maintainer

This is now added as a special edition https://github.com/LostRuins/koboldcpp/releases/tag/koboldcpp-1.22-CUDA-ONLY

Don't get too used to it though, the long term goal is still to use clblast and keep koboldcpp lean and clean.

3 replies

mlbrnm May 15, 2023

Thank you, that is very kind of you. Do you know how to know how many layers a model is? I don't see anything that indicates that on Hugging Face GGML model pages or in the terminal as the program runs. My strategy so far has just been to run it with 40 by default and then I'll reduce it if I get out of memory errors.

LostRuins May 16, 2023
Maintainer

You'd have to estimate it, trial and error

noprotocolunit May 16, 2023

I've assumed that the number of layers is the same as that of the unquantized model, but I have no idea if that's true or not. :(

Before models get turned into .bin files using some kind of ggml voodoo magic, you can look at the pytorch_model.bin.index.json to see how many layers there are. I don't know if the above-mentioned magic changes that in some fundamental way. But if not, that might be one way to find out.

cgessai · 2023-05-17T05:07:16Z

cgessai
May 17, 2023

Hey, does this feature work if one only uses the --gpulayers CMD line argument or does some other CMD line argument need to be invoked, as well? I experimented with the build quite a bit but despite seeing my 4GB of VRAM fill up (I maxed out at --gpulayers 8), my RAM was still maxed out as usual and token generation speed was still the same both with and without --stream. I'm on CUDA 11.8.

It all appears to be 'working' fine, I'm just not seeing any benefit in performance or in RAM savings. I inferred that "GPU offloading" implied the change would speed up token generation and possibly save RAM but I couldn't figure out if that was correct. I just though it was odd that nothing appeared to functionally change for me at all. Any direction is greatly appreciated!

1 reply

pacmanincarnate Jun 29, 2023

Did you ever find an answer to this question? I recently realized that my RAM is filling regardless of GPU layers used and that increasing GPU layers doesn't actually seem to increase speed noticeably.

TronMetatron · 2023-06-23T18:00:20Z

TronMetatron
Jun 23, 2023

If you have multiple gpus will koboldcpp split the LLM model to combine the GPU memory

0 replies

GPU Acceleration #157

Uh oh!

Replies: 4 comments · 8 replies

Uh oh!

LostRuins May 11, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

LostRuins May 14, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

LostRuins May 15, 2023 Maintainer

Uh oh!

LostRuins May 15, 2023 Maintainer

Uh oh!

Uh oh!

LostRuins May 16, 2023 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 4 comments 8 replies

LostRuins
May 11, 2023
Maintainer

LostRuins May 14, 2023
Maintainer

LostRuins May 15, 2023
Maintainer

LostRuins
May 15, 2023
Maintainer

LostRuins May 16, 2023
Maintainer