Replies: 5 comments 4 replies
-
I have 3 1080Ti GPUs on my system with 68GB RAM. My setup takes somewhere around 30s to 50s to generate a response. Currently the code base uses langchain through transformers library. In the current setup, we can only use unquantized models. I am looking into LlamaIndex to replace langchain. We might be able to integrate quantized models which will give speed boost. |
Beta Was this translation helpful? Give feedback.
-
Thank you for reply. Quantized models compatibility would be a nice step forward in the project! :-) |
Beta Was this translation helpful? Give feedback.
-
I am using 64GB RAM, an I5 with 6 cores and a NVIDIA GeForce GTX 1060 with 6GB VRAM. The OS i am currently using is Ubuntu 22.04.. The model i used is the "vicuna-7b-1.1-HF" model with the "hkunlp/instructoir-large" instructor. The answers i got took about 2 to 3min to compute, which is slower than i anticipated, but the answers i got were pretty good. I tracked the GPU and CPU usage during the tests, but the GPU usage was only at around 5 to 10% which confuses me and might actually be from a different program that was running at the same time. The CPU however was at 99% for almost the entire time. So i have also a question: I asked chatGPT about the performance of a system like mine and it told me that a LLM like chatGPT is not that much dependent on the GPU but rather on the CPU, because the operations are not quite as parallelizable. |
Beta Was this translation helpful? Give feedback.
-
I was curious about the performance on GPU but I don't own a beafy PC, so I ported parts of this project to Modal.com. (I have no affiliation with Modal whatsoever, it was just a cloud provider I found that bills on demand and was relatively easy to use.)
If anyone wants to take a look and experiment with different types of GPU: https://github.com/fjsousa/modal-localGPT/. They have free credits as well. |
Beta Was this translation helpful? Give feedback.
-
Running everything default settings on CPU on an old HP Proliant DL360p Gen8 Dual 8 core CPU on a minimal Ubunutu 22.04 in a container in proxmox. Allocated 32GB of the 80 on the system, only uses 3GB of ram. Proxmox claims 50% CPU usage no matter how many I allocate. 16, 32. Takes about 3-4 minutes to provide an answer in this setup. A quad core i7-1165G7 laptop with 16GB of ram running on CPU mode takes about 5 minutes to generate an answer. Not sure yet if the size of the ingested documents matter. Couple of dozen PDFs about a half gig worth. Takes 4 hours to ingest on the proliant, after 18 hours on the laptop I killed the ingest job because I needed to work on the laptop. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
Average response form model TheBloke/Wizard-Vicuna-7B-Uncensored-HF takes about: 50s to 1.5 minute.
I use downgraded CUDA ver. 11.8, but it seems like performance on CPU and GPU in this laptop in localGPT is very simmilar.
What kind of performance should I expect with this laptop and GPU?
Is there any space for optimization and speed up with this LLM?
Could you share your performance for comparison?
Best regards! :-)
Beta Was this translation helpful? Give feedback.
All reactions