run_localGPT.py runs but very slow - How to run it faster? #231

ahmetax · 2023-07-16T11:50:40Z

ahmetax
Jul 16, 2023

I have NVIDIA GeForce GTX 1060, 6GB. My OS is Ubuntu 22.04. Python 3.10.11. Ram 32GB.
ingest.py runs with no problems.
When I use default values of the installation in run_localGPT.py, I get memory error after submitting a question.
Then, after an internet search, I added 3 more parameters to the method AutoModelForCasualLM.from_pretrained(
...,
offload_folder="offload",
offload_state_dict=True,
max_memory={0:"2GB","cpu":"24GB"}
) in run_localGPT.py file.
Also, before running the script, I give a console command:
export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:256
Now, run_localGPT.py can create answers to my questions.
Average execution times are as follow:
Model preparation ~ 400-450 seconds
Answering ~ 80-100 seconds
Are these values normal?
My graphics card has 6 GB memory, but I can use only 2GB of it in max_memory parameter. Is it possible to increase that value?
My model_id is "TheBloke/vicuna-7B-1.1-HF".
What can I do in order to get better results?

Answered by KonradHoeffner

Jul 31, 2023

Your GPU is probably not used at all, which would explain the slow speed in answering.

You are using a 7 billion parameter model without quantization, which means that with 16 bit weights ( = 2 byte), your model is 14 GB in size.

As your GPU only has 6 GB it will probably not be useful for any reasonable model.

For example, I have a 3070 with 8 GB and even with the 2-bit quantized version (which probably has a very low quality) of a 7 billion parameter model I run out of GPU RAM due to cuBLAS requiring extra space.

View full answer

KonradHoeffner · 2023-07-31T11:29:55Z

KonradHoeffner
Jul 31, 2023

Your GPU is probably not used at all, which would explain the slow speed in answering.

You are using a 7 billion parameter model without quantization, which means that with 16 bit weights ( = 2 byte), your model is 14 GB in size.

As your GPU only has 6 GB it will probably not be useful for any reasonable model.

For example, I have a 3070 with 8 GB and even with the 2-bit quantized version (which probably has a very low quality) of a 7 billion parameter model I run out of GPU RAM due to cuBLAS requiring extra space.

1 reply

ahmetax Aug 8, 2023
Author

Thanks. I could run the script limiting to use only 2GB of GPU ram a few times. But it was not stable. Then I forgave trying to use GPU.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

run_localGPT.py runs but very slow - How to run it faster? #231

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

run_localGPT.py runs but very slow - How to run it faster? #231

Uh oh!

Uh oh!

ahmetax Jul 16, 2023

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

KonradHoeffner Jul 31, 2023

Uh oh!

Uh oh!

ahmetax Aug 8, 2023 Author

ahmetax
Jul 16, 2023

Replies: 1 comment 1 reply

KonradHoeffner
Jul 31, 2023

ahmetax Aug 8, 2023
Author