Using VLLM with a Tesla T4 on SageMaker Studio (ml.g4dn.xlarge instance) #5165
Replies: 11 comments
-
|
You should use 'half' or 'float16' dtype, since T4 doesn't support 'bfloat16'. |
Beta Was this translation helpful? Give feedback.
-
|
I get na OOM error by either using |
Beta Was this translation helpful? Give feedback.
-
|
@paulovasconcellos-hotmart You could set max-model-len parameter to lower until error message gone. |
Beta Was this translation helpful? Give feedback.
-
|
And i didn't see that you set gpu_memory_utilization=0.5 which is too small(almost 8GiB left). You can also try to increase that. |
Beta Was this translation helpful? Give feedback.
-
|
Hey @esmeetu, I tried to run the following code using And I received the following error: |
Beta Was this translation helpful? Give feedback.
-
|
@paulovasconcellos-hotmart Add quantization parameter in your code. |
Beta Was this translation helpful? Give feedback.
-
|
I ran with quantization parameter and Do you think I can increase |
Beta Was this translation helpful? Give feedback.
-
|
@paulovasconcellos-hotmart Of course, you can increase that param gradually until you got oom error, finally you will know how much model length your T4 can support. |
Beta Was this translation helpful? Give feedback.
-
|
I try to do nearly the same and always get CUDA memory errors. python -m vllm.entrypoints.openai.api_server --model TheBloke/dolphin-2.1-mistral-7B-AWQ --tensor-parallel-size 1 --dtype half --gpu-memory-utilization .95run from git-installation or as docker has the same result: the machine has 4 x 3070 8GB, UBU 20.04 I also tried with and without tensor-parallel-size 1/2, no avail. |
Beta Was this translation helpful? Give feedback.
-
|
I'm having an issue running the openai-mock server on colab (ngrok tunneled to a public url). |
Beta Was this translation helpful? Give feedback.
-
use this this work for @me |
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
-
Hi everyone. I'm trying to use vLLM using a T4, but I'm facing some problems.
I'm trying to run Mistral models using vllm 0.2.1
With the following code, I receive a
ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your Tesla T4 GPU has compute capability 7.5.If I use another dtype or remove the
quantizationparameter, I get an OOM error.Beta Was this translation helpful? Give feedback.
All reactions