-
-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Description
Your current environment
0: NVIDIA GeForce RTX 5070 Ti 16303MiB
1: NVIDIA GeForce RTX 5090 D 32607MiB
vllm-0.15.2rc1.dev93+g11a4c9d30.cu128-cp312-cp312-linux_x86_64.whl
π Describe the bug
I am trying to deploy a vLLM service locally for my OpenClaw backend, so I need a sufficiently long context length.
I set VLLM_PP_LAYER_PARTITION=16,32 to assign the model layers across the two GPUs (an RTX 5070 Ti with 16GB and an RTX 5090D with 32GB):
bash VLLM_PP_LAYER_PARTITION=16,32 CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1 vllm serve ./Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 --gpu-memory-utilization 0.9 --served-model-name Qwen3-Coder-30B-A3B-Instruct-FP8 --pipeline-parallel-size 2 --enable-chunked-prefill --enable-auto-tool-choice --tool-call-parser qwen3_coder
I encountered the following error:
(EngineCore_DP0 pid=1094) ValueError: To serve at least one request with the models's max seq len (262144), (24.0 GiB KV cache is needed, which is larger than the available KV cache memory (3.18 GiB). Based on the available memory, the estimated maximum model length is 34736. Try increasing gpu_memory_utilizationor decreasingmax_model_len when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.
Following the suggestion, I added --max-model-len=34736 to the command.
After doing so, the following logs appeared:
(EngineCore_DP0 pid=1351) INFO 02-08 06:33:47 [kv_cache_utils.py:1307] GPU KV cache size: 104,224 tokens (EngineCore_DP0 pid=1351) INFO 02-08 06:33:47 [kv_cache_utils.py:1312] Maximum concurrency for 34,736 tokens per request: 3.00x
This suggests that the estimated maximum model length supported should be 104,224 tokens, not 34,736.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.