Skip to content
Discussion options

You must be logged in to vote

Your GPU is probably not used at all, which would explain the slow speed in answering.

You are using a 7 billion parameter model without quantization, which means that with 16 bit weights ( = 2 byte), your model is 14 GB in size.

As your GPU only has 6 GB it will probably not be useful for any reasonable model.

For example, I have a 3070 with 8 GB and even with the 2-bit quantized version (which probably has a very low quality) of a 7 billion parameter model I run out of GPU RAM due to cuBLAS requiring extra space.

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@ahmetax
Comment options

Answer selected by ahmetax
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants