[Bug]: KV Cache Memory Bottleneck Calculation in Pipeline Parallel (_check_enough_kv_cache_memory in get_kv_cache_configs)

### Your current environment

0: NVIDIA GeForce RTX 5070 Ti 16303MiB
1: NVIDIA GeForce RTX 5090 D 32607MiB

vllm-0.15.2rc1.dev93+g11a4c9d30.cu128-cp312-cp312-linux_x86_64.whl

### 🐛 Describe the bug

I am trying to deploy a vLLM service locally for my OpenClaw backend, so I need a sufficiently long context length.

I set VLLM_PP_LAYER_PARTITION=16,32 to assign the model layers across the two GPUs (an RTX 5070 Ti with 16GB and an RTX 5090D with 32GB):
`bash VLLM_PP_LAYER_PARTITION=16,32 CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1 vllm serve ./Qwen/Qwen3-Coder-30B-A3B-Instruct-FP8 --gpu-memory-utilization 0.9 --served-model-name Qwen3-Coder-30B-A3B-Instruct-FP8 --pipeline-parallel-size 2 --enable-chunked-prefill --enable-auto-tool-choice --tool-call-parser qwen3_coder`

I encountered the following error:
`(EngineCore_DP0 pid=1094) ValueError: To serve at least one request with the models's max seq len (262144), (24.0 GiB KV cache is needed, which is larger than the available KV cache memory (3.18 GiB). Based on the available memory, the estimated maximum model length is 34736. Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.`

Following the suggestion, I added --max-model-len=34736 to the command.
After doing so, the following logs appeared:
`(EngineCore_DP0 pid=1351) INFO 02-08 06:33:47 [kv_cache_utils.py:1307] GPU KV cache size: 104,224 tokens
(EngineCore_DP0 pid=1351) INFO 02-08 06:33:47 [kv_cache_utils.py:1312] Maximum concurrency for 34,736 tokens per request: 3.00x
`
This suggests that the estimated maximum model length supported should be 104,224 tokens, not 34,736.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: KV Cache Memory Bottleneck Calculation in Pipeline Parallel (_check_enough_kv_cache_memory in get_kv_cache_configs) #34076

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: KV Cache Memory Bottleneck Calculation in Pipeline Parallel (_check_enough_kv_cache_memory in get_kv_cache_configs) #34076

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions