How does Koboldcpp/Lama.cpp Change the Math on System Requirements? #215
Replies: 6 comments 2 replies
-
FWIW, if you can't find the answer you're looking for here, try searching around Local Llama on Reddit. There's also a wiki there with some interesting numbers but I'm not sure how frequently it's updated. |
Beta Was this translation helpful? Give feedback.
-
Well, looks like I'll get to do some first hand research. I jumped on that machine I posted earlier, and should have it by this weekend. Seemed like too good a deal to pass up. That configuration typically goes for almost $4,000. |
Beta Was this translation helpful? Give feedback.
-
So, I purchased the machine I referred to earlier in this thread. I’ve had it for a couple days now and my experiences thus far are as follows: |
Beta Was this translation helpful? Give feedback.
-
Thanks for the response! I've thought about using both flags, but I was unclear on what parameters to use. For --useclblast, is it just 1 for on and0 for off? As for --gpulayers, how to know what a good starting point would be? my GPU has 8 gigs of VRAM, so 8? |
Beta Was this translation helpful? Give feedback.
-
Ah, got it after some tinkering. --useclblast 0 0 on my machine and for --gpulayers, looks like 10 is about the most I can do, at least with the Guanaco 65B model I'm currently using. Does seem to help a lot though! |
Beta Was this translation helpful? Give feedback.
-
Somehow for me --gpulayers slow down the ingestion and toke generation. Even when using only like 10 layers. But that could be because i have only a AMD 580 8GB |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I have a 10 year old HP ZBook with an NVIDIA Quadro K1100M, Intel i7-4800MQ processor and 32 gigs of ram and I'm shocked at how well it's running Koboldcpp.
I just tested with TheBloke/WizardLM-30B-Uncensored-GGML. Performance is by no means amazing, but it's really not at all bad, which is shocking in it's own right. I'm able to generate 80 tokens in under 5 minutes or 150 tokens in under 10.
I've been doing most of my text generation via Runpod, cause I didn't want to spend $5000 or more on a system with a massive GPU. However, considering how well this old beater laptop is handling a 30B model, how much would I really need to spend now to get something that performs significantly better with Koboldcpp? If I wnted to be able to run 65B models, how would something like this perform? https://www.lenovo.com/us/en/p/workstations/thinkstation-p-series/thinkstation-p360-tiny/30fa0024us?cid=us:seo:41h72l&nis=8
If not that machine, what configurations are people using successfully?
Thanks,
Beta Was this translation helpful? Give feedback.
All reactions