PERFORMANCE: Vicuna-7B on RTX 3070ti 8 GB - CPU Intel 12700H, 64 GB RAM (Lenovo legion laptop) #105

MocMilo · 2023-06-06T20:05:38Z

MocMilo
Jun 6, 2023

Hi,
Average response form model TheBloke/Wizard-Vicuna-7B-Uncensored-HF takes about: 50s to 1.5 minute.
I use downgraded CUDA ver. 11.8, but it seems like performance on CPU and GPU in this laptop in localGPT is very simmilar.
What kind of performance should I expect with this laptop and GPU?
Is there any space for optimization and speed up with this LLM?
Could you share your performance for comparison?

Best regards! :-)

PromtEngineer · 2023-06-08T04:36:02Z

PromtEngineer
Jun 8, 2023
Maintainer

I have 3 1080Ti GPUs on my system with 68GB RAM. My setup takes somewhere around 30s to 50s to generate a response.

Currently the code base uses langchain through transformers library. In the current setup, we can only use unquantized models. I am looking into LlamaIndex to replace langchain. We might be able to integrate quantized models which will give speed boost.

0 replies

MocMilo · 2023-06-08T18:04:45Z

MocMilo
Jun 8, 2023
Author

Thank you for reply. Quantized models compatibility would be a nice step forward in the project! :-)

2 replies

MocMilo Jun 8, 2023
Author

Small performance update:
I managed to run Vicuna 13B on laptop CPU 12700H 64GB RAM, RTX 3070ti 8 GB, response is in about 3.5 minutes.
I used: rnosov/Wizard-Vicuna-13B-Uncensored-HF-sharded (note this is shraded version of model, easier to download)

PromtEngineer Jun 9, 2023
Maintainer

That's glad to hear. I will try to test it out on my system and see how much I can push it. There seems to be a new version of accelerate package. Seems to support quantized models now.

FrozenLedger · 2023-06-29T13:46:01Z

FrozenLedger
Jun 29, 2023

I am using 64GB RAM, an I5 with 6 cores and a NVIDIA GeForce GTX 1060 with 6GB VRAM. The OS i am currently using is Ubuntu 22.04..

The model i used is the "vicuna-7b-1.1-HF" model with the "hkunlp/instructoir-large" instructor.
The pdf i tested this system on is a 20mb PDF.

The answers i got took about 2 to 3min to compute, which is slower than i anticipated, but the answers i got were pretty good.

I tracked the GPU and CPU usage during the tests, but the GPU usage was only at around 5 to 10% which confuses me and might actually be from a different program that was running at the same time. The CPU however was at 99% for almost the entire time.

So i have also a question:
What is the expected performance of the GPU when using this model? Is it likely that the setup is not entirely working, as the GPU is not contributing that much or is this normal?

I asked chatGPT about the performance of a system like mine and it told me that a LLM like chatGPT is not that much dependent on the GPU but rather on the CPU, because the operations are not quite as parallelizable.

2 replies

kravetz Aug 16, 2023

My GPU tops at 50% (not sure why) for most of the time during response generation (RTX 3090, 24GB, vicuna7b). It seems your GPU is not being used.

FrozenLedger Aug 17, 2023

I currently suspect that the 6GB VRAM might be to low, but i am not entirely sure.

fjsousa · 2023-08-01T20:38:01Z

fjsousa
Aug 1, 2023

I was curious about the performance on GPU but I don't own a beafy PC, so I ported parts of this project to Modal.com. (I have no affiliation with Modal whatsoever, it was just a cloud provider I found that bills on demand and was relatively easy to use.)

llama-2-7b-chat.ggmlv3.q4_0 is taking around 40 seconds on an A10G to generate a response. In my example I used the recent paper on superconductors at room temperature and asked "based on the paper I provided whats lk-99"

If anyone wants to take a look and experiment with different types of GPU: https://github.com/fjsousa/modal-localGPT/. They have free credits as well.

0 replies

creuzerm · 2023-08-04T02:15:06Z

creuzerm
Aug 4, 2023

Running everything default settings on CPU on an old HP Proliant DL360p Gen8 Dual 8 core CPU on a minimal Ubunutu 22.04 in a container in proxmox. Allocated 32GB of the 80 on the system, only uses 3GB of ram. Proxmox claims 50% CPU usage no matter how many I allocate. 16, 32. Takes about 3-4 minutes to provide an answer in this setup.

A quad core i7-1165G7 laptop with 16GB of ram running on CPU mode takes about 5 minutes to generate an answer.

Not sure yet if the size of the ingested documents matter. Couple of dozen PDFs about a half gig worth. Takes 4 hours to ingest on the proliant, after 18 hours on the laptop I killed the ingest job because I needed to work on the laptop.

0 replies

PERFORMANCE: Vicuna-7B on RTX 3070ti 8 GB - CPU Intel 12700H, 64 GB RAM (Lenovo legion laptop) #105

Uh oh!

MocMilo Jun 6, 2023

Replies: 5 comments · 4 replies

Uh oh!

PromtEngineer Jun 8, 2023 Maintainer

Uh oh!

MocMilo Jun 8, 2023 Author

Uh oh!

MocMilo Jun 8, 2023 Author

Uh oh!

PromtEngineer Jun 9, 2023 Maintainer

Uh oh!

FrozenLedger Jun 29, 2023

Uh oh!

kravetz Aug 16, 2023

Uh oh!

Uh oh!

FrozenLedger Aug 17, 2023

Uh oh!

fjsousa Aug 1, 2023

Uh oh!

creuzerm Aug 4, 2023

MocMilo
Jun 6, 2023

Replies: 5 comments 4 replies

PromtEngineer
Jun 8, 2023
Maintainer

MocMilo
Jun 8, 2023
Author

MocMilo Jun 8, 2023
Author

PromtEngineer Jun 9, 2023
Maintainer

FrozenLedger
Jun 29, 2023

fjsousa
Aug 1, 2023

creuzerm
Aug 4, 2023