How does Koboldcpp/Lama.cpp Change the Math on System Requirements? #215

CallMeAl1973 · 2023-06-05T18:10:38Z

CallMeAl1973
Jun 5, 2023

I have a 10 year old HP ZBook with an NVIDIA Quadro K1100M, Intel i7-4800MQ processor and 32 gigs of ram and I'm shocked at how well it's running Koboldcpp.
I just tested with TheBloke/WizardLM-30B-Uncensored-GGML. Performance is by no means amazing, but it's really not at all bad, which is shocking in it's own right. I'm able to generate 80 tokens in under 5 minutes or 150 tokens in under 10.
I've been doing most of my text generation via Runpod, cause I didn't want to spend $5000 or more on a system with a massive GPU. However, considering how well this old beater laptop is handling a 30B model, how much would I really need to spend now to get something that performs significantly better with Koboldcpp? If I wnted to be able to run 65B models, how would something like this perform? https://www.lenovo.com/us/en/p/workstations/thinkstation-p-series/thinkstation-p360-tiny/30fa0024us?cid=us:seo:41h72l&nis=8
If not that machine, what configurations are people using successfully?
Thanks,

cgessai · 2023-06-06T00:05:20Z

cgessai
Jun 6, 2023

FWIW, if you can't find the answer you're looking for here, try searching around Local Llama on Reddit. There's also a wiki there with some interesting numbers but I'm not sure how frequently it's updated.

0 replies

CallMeAl1973 · 2023-06-06T00:17:14Z

CallMeAl1973
Jun 6, 2023
Author

Well, looks like I'll get to do some first hand research. I jumped on that machine I posted earlier, and should have it by this weekend. Seemed like too good a deal to pass up. That configuration typically goes for almost $4,000.

0 replies

CallMeAl1973 · 2023-06-09T17:25:20Z

CallMeAl1973
Jun 9, 2023
Author

So, I purchased the machine I referred to earlier in this thread. I’ve had it for a couple days now and my experiences thus far are as follows:
I’ve tested several GGMl models, ranging from 13B to 65B. With 13B models, the machine is generating roughly 5 tokens per second. With 65B models, roughly 1 token per second, and for 30B models, you can split the difference, approximately 2 tokens per second.
I typically like to set it to generate 200tokens per output, so the wait times are approximately: 40 seconds for 13B, 100 seconds for 30B and 3.5 mins for 65B. Note however, whenever the system processes the prompt, these times will increase significantly. I’ve played with the --noblas and --smartcontext parameters, but results have been inconclusive so far.
The machine is currently running Windows 11, but I’m also going to put Ubuntu on it this weekend. Will be curious to see if performance is at all different under Linux.
It might be interesting if we could make this thread a repository of sorts for anecdotal experiences such as this. My sense is that as of June 2023, it would be hard to find a significantly better performing PC at a similar price point; but if I’m wrong about that, I would happily stand corrected. What about an M2 MAC Mini with 32 gigs of ram? I thought about going the Mac route but I wasn’t sure if 32 gigs was enough unified memory to handle the really large models.

1 reply

LostRuins Jun 9, 2023
Maintainer

Noblas will make it even slower. Have you tried --useclblast yet? That should help a lot. Combined with --gpulayers it will be even faster

CallMeAl1973 · 2023-06-09T19:46:33Z

CallMeAl1973
Jun 9, 2023
Author

Thanks for the response! I've thought about using both flags, but I was unclear on what parameters to use. For --useclblast, is it just 1 for on and0 for off? As for --gpulayers, how to know what a good starting point would be? my GPU has 8 gigs of VRAM, so 8?
Sorry for all the basic questions. If this stuff is documented anywhere, haven't found it yet.
Thanks again!

1 reply

LostRuins Jun 10, 2023
Maintainer

Take a look at the console output when you select 0 0, it will show all the devices and their corresponding IDs

CallMeAl1973 · 2023-06-09T22:43:37Z

CallMeAl1973
Jun 9, 2023
Author

Ah, got it after some tinkering. --useclblast 0 0 on my machine and for --gpulayers, looks like 10 is about the most I can do, at least with the Guanaco 65B model I'm currently using. Does seem to help a lot though!

0 replies

JimmyInsane · 2023-06-10T06:16:44Z

JimmyInsane
Jun 10, 2023

Somehow for me --gpulayers slow down the ingestion and toke generation. Even when using only like 10 layers. But that could be because i have only a AMD 580 8GB

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How does Koboldcpp/Lama.cpp Change the Math on System Requirements? #215

Uh oh!

{{title}}

Uh oh!

Replies: 6 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How does Koboldcpp/Lama.cpp Change the Math on System Requirements? #215

Uh oh!

CallMeAl1973 Jun 5, 2023

Replies: 6 comments · 2 replies

Uh oh!

cgessai Jun 6, 2023

Uh oh!

CallMeAl1973 Jun 6, 2023 Author

Uh oh!

CallMeAl1973 Jun 9, 2023 Author

Uh oh!

LostRuins Jun 9, 2023 Maintainer

Uh oh!

CallMeAl1973 Jun 9, 2023 Author

Uh oh!

LostRuins Jun 10, 2023 Maintainer

Uh oh!

CallMeAl1973 Jun 9, 2023 Author

Uh oh!

JimmyInsane Jun 10, 2023

CallMeAl1973
Jun 5, 2023

Replies: 6 comments 2 replies

cgessai
Jun 6, 2023

CallMeAl1973
Jun 6, 2023
Author

CallMeAl1973
Jun 9, 2023
Author

LostRuins Jun 9, 2023
Maintainer

CallMeAl1973
Jun 9, 2023
Author

LostRuins Jun 10, 2023
Maintainer

CallMeAl1973
Jun 9, 2023
Author

JimmyInsane
Jun 10, 2023