[BUG]: Vllm serve hang when Running KVBM in vLLM

### Describe the Bug

When I tested Running KVBM in vLLM, the vLLM service would hang when using random data for benchmark( batch_size:64 ,input_len:1024, output_len:1024),After responding to some requests, it stopped responding, and the GPU utilization dropped to 0.

### Steps to Reproduce

start docker
```
root@llmops-tdc01:/llmops-data/zty/repo/dynamo# ./container/run.sh --framework vllm -it --mount-workspace --use-nixl-gds -v /mnt/llm32/:/models --image harbor.transwarp.io/aip/vllm-runtime:0.6.0
+ docker run --gpus all -it --rm --network host --runtime nvidia --shm-size=10G --ulimit memlock=-1 --ulimit stack=67108864 --ulimit nofile=65536:65536 -e HF_TOKEN -v /mnt/llm32/:/models -v /llmops-data/zty/repo/dynamo/container/..:/workspace -v /llmops-data/zty/tmp:/tmp -v /mnt/:/mnt -v /llmops-data/zty/repo/dynamo/container/.cache/huggingface:/root/.cache/huggingface -v /run/udev:/run/udev:ro -w /workspace --cap-add CAP_SYS_PTRACE --cap-add=IPC_LOCK --ipc host --privileged harbor.transwarp.io/aip/vllm-runtime:0.6.0
```
start vllm serve
```
 DYN_KVBM_CPU_CACHE_GB=4  DYN_KVBM_DISK_CACHE_GB=8 CUDA_VISIBLE_DEVICES=1 vllm serve /models/Qwen/Qwen3-30B-A3B/ --served-model-name=atom -tp 1 --kv-transfer-config  '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "dynamo.llm.vllm_integration.connector"}'
```


### Expected Behavior

work well

### Actual Behavior

vllm serve hang

### Environment

H100 gpu
ubuntu 22.04

### Additional Context

log is:
```
/opt/dynamo/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO 11-04 12:23:12 [__init__.py:216] Automatically detected platform cuda.
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:23:17 [api_server.py:1839] vLLM API server version 0.11.0
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:23:17 [utils.py:233] non-default args: {'model_tag': '/models/Qwen/Qwen3-30B-A3B/', 'model': '/models/Qwen/Qwen3-30B-A3B/', 'served_model_name': ['atom'], 'kv_transfer_config': KVTransferConfig(kv_connector='DynamoConnector', engine_id='ec002b7b-674b-45d1-9cf8-5511bb5654bd', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_both', kv_rank=None, kv_parallel_size=1, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={}, kv_connector_module_path='dynamo.llm.vllm_integration.connector')}
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:23:17 [model.py:547] Resolved architecture: Qwen3MoeForCausalLM
[1;36m(APIServer pid=1240)[0;0m `torch_dtype` is deprecated! Use `dtype` instead!
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:23:17 [model.py:1510] Using max model len 40960
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:23:17 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=8192.
/opt/dynamo/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO 11-04 12:23:22 [__init__.py:216] Automatically detected platform cuda.
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:23:27 [core.py:644] Waiting for init message from front-end.
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:23:27 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='/models/Qwen/Qwen3-30B-A3B/', speculative_config=None, tokenizer='/models/Qwen/Qwen3-30B-A3B/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=atom, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
[1;36m(EngineCore_DP0 pid=1379)[0;0m W1104 12:23:28.668000 1379 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
[1;36m(EngineCore_DP0 pid=1379)[0;0m W1104 12:23:28.668000 1379 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:23:29 [parallel_state.py:1208] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:23:29 [nixl_connector.py:56] NIXL is available
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:23:29 [factory.py:51] Creating v1 connector with name: DynamoConnector and engine_id: ec002b7b-674b-45d1-9cf8-5511bb5654bd
[1;36m(EngineCore_DP0 pid=1379)[0;0m WARNING 11-04 12:23:29 [base.py:86] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
[2m2025-11-04T12:23:29.663106Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::worker[0m[2m:[0m KvConnectorWorker initialized with worker_id: ec002b7b-674b-45d1-9cf8-5511bb5654bd
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:23:29 [topk_topp_sampler.py:55] Using FlashInfer for top-p & top-k sampling.
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:23:30 [gpu_model_runner.py:2602] Starting to load model /models/Qwen/Qwen3-30B-A3B/...
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:23:30 [gpu_model_runner.py:2634] Loading model from scratch...
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:23:30 [cuda.py:366] Using Flash Attention backend on V1 engine.
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards:   0% Completed | 0/16 [00:00<?, ?it/s]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards:   6% Completed | 1/16 [00:01<00:16,  1.07s/it]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards:  12% Completed | 2/16 [00:02<00:14,  1.07s/it]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards:  19% Completed | 3/16 [00:03<00:13,  1.04s/it]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards:  25% Completed | 4/16 [00:04<00:12,  1.02s/it]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards:  31% Completed | 5/16 [00:05<00:10,  1.00it/s]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards:  38% Completed | 6/16 [00:06<00:10,  1.02s/it]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards:  44% Completed | 7/16 [00:07<00:09,  1.04s/it]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards:  50% Completed | 8/16 [00:08<00:08,  1.03s/it]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards:  56% Completed | 9/16 [00:09<00:07,  1.03s/it]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards:  62% Completed | 10/16 [00:10<00:06,  1.04s/it]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards:  69% Completed | 11/16 [00:11<00:05,  1.04s/it]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards:  75% Completed | 12/16 [00:12<00:04,  1.02s/it]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards:  81% Completed | 13/16 [00:13<00:03,  1.03s/it]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards:  88% Completed | 14/16 [00:13<00:01,  1.22it/s]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards:  94% Completed | 15/16 [00:14<00:00,  1.18it/s]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:15<00:00,  1.13it/s]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:15<00:00,  1.03it/s]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:23:46 [default_loader.py:267] Loading weights took 15.67 seconds
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:23:47 [gpu_model_runner.py:2653] Model loading took 56.8814 GiB and 16.046670 seconds
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:23:57 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/b435bf2ea3/rank_0_0/backbone for vLLM's torch.compile
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:23:57 [backends.py:559] Dynamo bytecode transform time: 9.94 s
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:23:58 [backends.py:197] Cache the graph for dynamic shape for later use
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:24:34 [backends.py:218] Compiling a graph for dynamic shape takes 36.33 s
[1;36m(EngineCore_DP0 pid=1379)[0;0m WARNING 11-04 12:24:35 [fused_moe.py:798] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/opt/dynamo/venv/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=768,device_name=NVIDIA_H100_80GB_HBM3.json']
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:24:39 [monitor.py:34] torch.compile takes 46.27 s in total
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:25:40 [gpu_worker.py:298] Available KV cache memory: 12.15 GiB
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:25:41 [kv_cache_utils.py:1087] GPU KV cache size: 132,752 tokens
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:25:41 [kv_cache_utils.py:1091] Maximum concurrency for 40,960 tokens per request: 3.24x
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:25:41 [utils.py:114] Connectors do not specify a kv cache layout, defaulting to NHD.
[2m2025-11-04T12:25:41.391597Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::worker[0m[2m:[0m Auto-detected device layout from tensor shape: LayerSeparate { outer_contiguous: true }
[2m2025-11-04T12:25:41.391681Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::worker[0m[2m:[0m Initializing KvbmWorker with params: num_device_blocks=8297, page_size=16, dtype_width_bytes=2
[2m2025-11-04T12:25:41.392177Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::worker[0m[2m:[0m Inferred layout: num_layers=48, outer_dim=2, outer_contiguous=true, page_size=16, inner_dim=512
[2m2025-11-04T12:25:41.392572Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::worker[0m[2m:[0m Worker 7587890605597737418 waiting on barrier kvbm-worker-to-leader
[1;36m(EngineCore_DP0 pid=1379)[0;0m 2025-11-04 12:25:41,406 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
[1;36m(EngineCore_DP0 pid=1379)[0;0m 2025-11-04 12:25:41,572 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/67 [00:00<?, ?it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   1%|▏         | 1/67 [00:00<00:10,  6.18it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   3%|▎         | 2/67 [00:00<00:09,  6.77it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   4%|▍         | 3/67 [00:00<00:09,  6.81it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   6%|▌         | 4/67 [00:00<00:08,  7.30it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   7%|▋         | 5/67 [00:00<00:08,  7.47it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   9%|▉         | 6/67 [00:00<00:07,  7.65it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  10%|█         | 7/67 [00:00<00:07,  7.59it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  12%|█▏        | 8/67 [00:01<00:07,  7.65it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  13%|█▎        | 9/67 [00:01<00:07,  7.58it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  15%|█▍        | 10/67 [00:01<00:07,  7.26it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  16%|█▋        | 11/67 [00:01<00:07,  7.47it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  18%|█▊        | 12/67 [00:01<00:07,  7.31it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  19%|█▉        | 13/67 [00:01<00:07,  7.47it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  21%|██        | 14/67 [00:01<00:07,  7.27it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  22%|██▏       | 15/67 [00:02<00:07,  7.36it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  24%|██▍       | 16/67 [00:02<00:06,  7.40it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  25%|██▌       | 17/67 [00:02<00:07,  6.94it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  27%|██▋       | 18/67 [00:02<00:07,  6.74it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  28%|██▊       | 19/67 [00:02<00:07,  6.55it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  30%|██▉       | 20/67 [00:02<00:07,  6.56it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  31%|███▏      | 21/67 [00:02<00:06,  6.66it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  33%|███▎      | 22/67 [00:03<00:06,  6.76it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  34%|███▍      | 23/67 [00:03<00:06,  6.97it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  36%|███▌      | 24/67 [00:03<00:06,  7.00it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  37%|███▋      | 25/67 [00:03<00:05,  7.03it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  39%|███▉      | 26/67 [00:03<00:06,  6.55it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  40%|████      | 27/67 [00:03<00:06,  6.56it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  42%|████▏     | 28/67 [00:04<00:06,  5.70it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  43%|████▎     | 29/67 [00:04<00:06,  5.88it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  45%|████▍     | 30/67 [00:04<00:06,  6.00it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  46%|████▋     | 31/67 [00:04<00:06,  5.97it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  48%|████▊     | 32/67 [00:04<00:06,  5.81it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  49%|████▉     | 33/67 [00:04<00:05,  5.83it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  51%|█████     | 34/67 [00:05<00:05,  6.05it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  52%|█████▏    | 35/67 [00:05<00:05,  6.32it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  54%|█████▎    | 36/67 [00:05<00:04,  6.53it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  55%|█████▌    | 37/67 [00:05<00:04,  6.68it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  57%|█████▋    | 38/67 [00:05<00:04,  6.67it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  58%|█████▊    | 39/67 [00:05<00:04,  6.45it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  60%|█████▉    | 40/67 [00:05<00:04,  6.46it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  61%|██████    | 41/67 [00:06<00:04,  6.30it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  63%|██████▎   | 42/67 [00:06<00:04,  6.15it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  64%|██████▍   | 43/67 [00:06<00:03,  6.32it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  66%|██████▌   | 44/67 [00:06<00:03,  6.49it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  67%|██████▋   | 45/67 [00:06<00:03,  6.72it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  69%|██████▊   | 46/67 [00:06<00:03,  6.87it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  70%|███████   | 47/67 [00:07<00:02,  6.97it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  72%|███████▏  | 48/67 [00:07<00:02,  7.13it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  73%|███████▎  | 49/67 [00:07<00:06,  2.99it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  75%|███████▍  | 50/67 [00:08<00:04,  3.62it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  76%|███████▌  | 51/67 [00:08<00:03,  4.15it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  78%|███████▊  | 52/67 [00:08<00:03,  4.68it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  79%|███████▉  | 53/67 [00:08<00:02,  5.07it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  81%|████████  | 54/67 [00:08<00:02,  5.47it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  82%|████████▏ | 55/67 [00:08<00:02,  5.64it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  84%|████████▎ | 56/67 [00:08<00:01,  6.02it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  85%|████████▌ | 57/67 [00:09<00:01,  6.01it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  87%|████████▋ | 58/67 [00:09<00:01,  5.72it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  88%|████████▊ | 59/67 [00:09<00:01,  5.34it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  90%|████████▉ | 60/67 [00:09<00:01,  4.90it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  91%|█████████ | 61/67 [00:09<00:01,  4.99it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  93%|█████████▎| 62/67 [00:10<00:00,  5.50it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  94%|█████████▍| 63/67 [00:10<00:00,  5.83it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  96%|█████████▌| 64/67 [00:10<00:00,  5.93it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  97%|█████████▋| 65/67 [00:10<00:00,  6.40it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  99%|█████████▊| 66/67 [00:10<00:00,  6.57it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67 [00:11<00:00,  2.84it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67 [00:11<00:00,  5.81it/s]
[1;36m(EngineCore_DP0 pid=1379)[0;0m 
Capturing CUDA graphs (decode, FULL):   0%|          | 0/67 [00:00<?, ?it/s]
Capturing CUDA graphs (decode, FULL):   1%|▏         | 1/67 [00:00<00:08,  7.59it/s]
Capturing CUDA graphs (decode, FULL):   3%|▎         | 2/67 [00:00<00:07,  8.41it/s]
Capturing CUDA graphs (decode, FULL):   4%|▍         | 3/67 [00:00<00:07,  8.80it/s]
Capturing CUDA graphs (decode, FULL):   6%|▌         | 4/67 [00:00<00:07,  8.44it/s]
Capturing CUDA graphs (decode, FULL):   7%|▋         | 5/67 [00:00<00:07,  8.68it/s]
Capturing CUDA graphs (decode, FULL):   9%|▉         | 6/67 [00:00<00:07,  8.51it/s]
Capturing CUDA graphs (decode, FULL):  10%|█         | 7/67 [00:00<00:07,  7.99it/s]
Capturing CUDA graphs (decode, FULL):  12%|█▏        | 8/67 [00:00<00:07,  7.57it/s]
Capturing CUDA graphs (decode, FULL):  13%|█▎        | 9/67 [00:01<00:07,  7.94it/s]
Capturing CUDA graphs (decode, FULL):  15%|█▍        | 10/67 [00:01<00:07,  8.06it/s]
Capturing CUDA graphs (decode, FULL):  16%|█▋        | 11/67 [00:01<00:06,  8.04it/s]
Capturing CUDA graphs (decode, FULL):  18%|█▊        | 12/67 [00:01<00:07,  7.77it/s]
Capturing CUDA graphs (decode, FULL):  19%|█▉        | 13/67 [00:01<00:07,  7.55it/s]
Capturing CUDA graphs (decode, FULL):  21%|██        | 14/67 [00:01<00:07,  7.49it/s]
Capturing CUDA graphs (decode, FULL):  22%|██▏       | 15/67 [00:01<00:06,  7.64it/s]
Capturing CUDA graphs (decode, FULL):  24%|██▍       | 16/67 [00:02<00:06,  7.51it/s]
Capturing CUDA graphs (decode, FULL):  25%|██▌       | 17/67 [00:02<00:06,  7.71it/s]
Capturing CUDA graphs (decode, FULL):  27%|██▋       | 18/67 [00:02<00:06,  7.95it/s]
Capturing CUDA graphs (decode, FULL):  28%|██▊       | 19/67 [00:02<00:05,  8.18it/s]
Capturing CUDA graphs (decode, FULL):  30%|██▉       | 20/67 [00:02<00:05,  8.19it/s]
Capturing CUDA graphs (decode, FULL):  31%|███▏      | 21/67 [00:02<00:05,  8.32it/s]
Capturing CUDA graphs (decode, FULL):  33%|███▎      | 22/67 [00:02<00:05,  8.18it/s]
Capturing CUDA graphs (decode, FULL):  34%|███▍      | 23/67 [00:02<00:05,  8.27it/s]
Capturing CUDA graphs (decode, FULL):  36%|███▌      | 24/67 [00:02<00:05,  8.48it/s]
Capturing CUDA graphs (decode, FULL):  37%|███▋      | 25/67 [00:03<00:05,  8.23it/s]
Capturing CUDA graphs (decode, FULL):  39%|███▉      | 26/67 [00:03<00:04,  8.44it/s]
Capturing CUDA graphs (decode, FULL):  40%|████      | 27/67 [00:03<00:04,  8.38it/s]
Capturing CUDA graphs (decode, FULL):  42%|████▏     | 28/67 [00:03<00:04,  8.29it/s]
Capturing CUDA graphs (decode, FULL):  43%|████▎     | 29/67 [00:03<00:04,  8.48it/s]
Capturing CUDA graphs (decode, FULL):  45%|████▍     | 30/67 [00:03<00:04,  8.50it/s]
Capturing CUDA graphs (decode, FULL):  46%|████▋     | 31/67 [00:03<00:04,  8.68it/s]
Capturing CUDA graphs (decode, FULL):  48%|████▊     | 32/67 [00:03<00:04,  8.74it/s]
Capturing CUDA graphs (decode, FULL):  49%|████▉     | 33/67 [00:04<00:03,  8.68it/s]
Capturing CUDA graphs (decode, FULL):  51%|█████     | 34/67 [00:04<00:03,  8.45it/s]
Capturing CUDA graphs (decode, FULL):  52%|█████▏    | 35/67 [00:04<00:04,  7.77it/s]
Capturing CUDA graphs (decode, FULL):  54%|█████▎    | 36/67 [00:04<00:04,  7.54it/s]
Capturing CUDA graphs (decode, FULL):  55%|█████▌    | 37/67 [00:04<00:04,  7.41it/s]
Capturing CUDA graphs (decode, FULL):  57%|█████▋    | 38/67 [00:04<00:03,  7.25it/s]
Capturing CUDA graphs (decode, FULL):  58%|█████▊    | 39/67 [00:04<00:03,  7.06it/s]
Capturing CUDA graphs (decode, FULL):  60%|█████▉    | 40/67 [00:05<00:03,  7.12it/s]
Capturing CUDA graphs (decode, FULL):  61%|██████    | 41/67 [00:05<00:03,  7.24it/s]
Capturing CUDA graphs (decode, FULL):  63%|██████▎   | 42/67 [00:05<00:03,  7.45it/s]
Capturing CUDA graphs (decode, FULL):  64%|██████▍   | 43/67 [00:05<00:03,  7.52it/s]
Capturing CUDA graphs (decode, FULL):  66%|██████▌   | 44/67 [00:05<00:03,  7.36it/s]
Capturing CUDA graphs (decode, FULL):  67%|██████▋   | 45/67 [00:05<00:02,  7.53it/s]
Capturing CUDA graphs (decode, FULL):  69%|██████▊   | 46/67 [00:05<00:02,  7.64it/s]
Capturing CUDA graphs (decode, FULL):  70%|███████   | 47/67 [00:05<00:02,  7.79it/s]
Capturing CUDA graphs (decode, FULL):  72%|███████▏  | 48/67 [00:06<00:02,  8.00it/s]
Capturing CUDA graphs (decode, FULL):  73%|███████▎  | 49/67 [00:06<00:02,  8.10it/s]
Capturing CUDA graphs (decode, FULL):  75%|███████▍  | 50/67 [00:06<00:02,  8.08it/s]
Capturing CUDA graphs (decode, FULL):  76%|███████▌  | 51/67 [00:06<00:01,  8.02it/s]
Capturing CUDA graphs (decode, FULL):  78%|███████▊  | 52/67 [00:06<00:01,  8.16it/s]
Capturing CUDA graphs (decode, FULL):  79%|███████▉  | 53/67 [00:06<00:01,  8.39it/s]
Capturing CUDA graphs (decode, FULL):  81%|████████  | 54/67 [00:06<00:01,  8.38it/s]
Capturing CUDA graphs (decode, FULL):  82%|████████▏ | 55/67 [00:06<00:01,  8.49it/s]
Capturing CUDA graphs (decode, FULL):  84%|████████▎ | 56/67 [00:06<00:01,  8.57it/s]
Capturing CUDA graphs (decode, FULL):  85%|████████▌ | 57/67 [00:07<00:01,  8.27it/s]
Capturing CUDA graphs (decode, FULL):  87%|████████▋ | 58/67 [00:07<00:01,  7.81it/s]
Capturing CUDA graphs (decode, FULL):  88%|████████▊ | 59/67 [00:07<00:01,  7.91it/s]
Capturing CUDA graphs (decode, FULL):  90%|████████▉ | 60/67 [00:07<00:00,  8.25it/s]
Capturing CUDA graphs (decode, FULL):  91%|█████████ | 61/67 [00:07<00:00,  8.32it/s]
Capturing CUDA graphs (decode, FULL):  93%|█████████▎| 62/67 [00:07<00:00,  8.23it/s]
Capturing CUDA graphs (decode, FULL):  94%|█████████▍| 63/67 [00:07<00:00,  8.31it/s]
Capturing CUDA graphs (decode, FULL):  96%|█████████▌| 64/67 [00:07<00:00,  8.45it/s]
Capturing CUDA graphs (decode, FULL):  97%|█████████▋| 65/67 [00:08<00:00,  8.39it/s]
Capturing CUDA graphs (decode, FULL):  99%|█████████▊| 66/67 [00:08<00:00,  8.13it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 67/67 [00:08<00:00,  7.98it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 67/67 [00:08<00:00,  8.02it/s]
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:26:02 [gpu_model_runner.py:3480] Graph capturing finished in 21 secs, took 1.30 GiB
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:26:02 [core.py:210] init engine (profile, create kv cache, warmup model) took 135.53 seconds
[1;36m(EngineCore_DP0 pid=1379)[0;0m INFO 11-04 12:26:03 [factory.py:51] Creating v1 connector with name: DynamoConnector and engine_id: ec002b7b-674b-45d1-9cf8-5511bb5654bd
[1;36m(EngineCore_DP0 pid=1379)[0;0m WARNING 11-04 12:26:03 [base.py:86] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
[2m2025-11-04T12:26:03.156287Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader[0m[2m:[0m KvConnectorLeader initialized with worker_id: ec002b7b-674b-45d1-9cf8-5511bb5654bd
[2m2025-11-04T12:26:03.156296Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Bound to pub: tcp://127.0.0.1:44677 and pull: tcp://127.0.0.1:45339
[2m2025-11-04T12:26:03.156305Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::leader[0m[2m:[0m Syncing leader barrier with 1 workers on barrier id kvbm-worker-to-leader
[2m2025-11-04T12:26:03.156332Z[0m [33m WARN[0m [2m_core::llm::block_manager::vllm::connector::leader[0m[2m:[0m DYN_KVBM_METRICS_PORT not present or couldn’t be interpreted, falling back to 6880
[2m2025-11-04T12:26:03.156346Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Pinging workers...
[2m2025-11-04T12:26:03.159962Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::leader[0m[2m:[0m Worker to leader barrier synced with 1 workers
[2m2025-11-04T12:26:03.159971Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::leader[0m[2m:[0m Syncing leader barrier with 1 workers on barrier id kvbm-leader-to-worker
[2m2025-11-04T12:26:03.198154Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::worker[0m[2m:[0m Worker 7587890605597737418 waiting on barrier kvbm-leader-to-worker
[2m2025-11-04T12:26:03.201731Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::worker[0m[2m:[0m Worker 7587890605597737418 received leader data: KvbmLeaderData { pub_url: "tcp://127.0.0.1:44677", ack_url: "tcp://127.0.0.1:45339", num_host_blocks: 2543, num_disk_blocks: 5086 }
[2m2025-11-04T12:26:03.201764Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::leader[0m[2m:[0m Worker to leader barrier synced with 1 workers
[2m2025-11-04T12:26:04.158212Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Ping timed out. Retrying...
[2m2025-11-04T12:26:04.158294Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Pinging workers...
[2m2025-11-04T12:26:05.159672Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Ping timed out. Retrying...
[2m2025-11-04T12:26:05.159727Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Pinging workers...
[2m2025-11-04T12:26:06.161207Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Ping timed out. Retrying...
[2m2025-11-04T12:26:06.161266Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Pinging workers...
[2m2025-11-04T12:26:06.857268Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::block::transfer::context[0m[2m:[0m Creating pinned buffer pool: 10 buffers × 12KB each
[2m2025-11-04T12:26:06.857298Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::block::transfer::context[0m[2m:[0m Total pool memory: 0MB
[2m2025-11-04T12:26:06.857545Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::block::transfer::context[0m[2m:[0m Successfully created pinned buffer pool: 10/10 buffers allocated
[2m2025-11-04T12:26:06.863358Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::pool::managed[0m[2m:[0m building block pool
[2m2025-11-04T12:26:06.864149Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::pool::managed[0m[2m:[0m building block pool
[2m2025-11-04T12:26:06.864565Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::pool::managed[0m[2m:[0m building block pool
[2m2025-11-04T12:26:06.870478Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::block::transfer::context[0m[2m:[0m Creating pinned buffer pool: 10 buffers × 0KB each
[2m2025-11-04T12:26:06.870489Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::block::transfer::context[0m[2m:[0m Total pool memory: 0MB
[2m2025-11-04T12:26:06.887231Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::block::transfer::context[0m[2m:[0m Successfully created pinned buffer pool: 10/10 buffers allocated
[2m2025-11-04T12:26:07.162904Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Ping timed out. Retrying...
[2m2025-11-04T12:26:07.162950Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Pinging workers...
[2m2025-11-04T12:26:08.163792Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Ping timed out. Retrying...
[2m2025-11-04T12:26:08.163846Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Pinging workers...
[2m2025-11-04T12:26:09.165085Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Ping timed out. Retrying...
[2m2025-11-04T12:26:09.165135Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Pinging workers...
[2m2025-11-04T12:26:10.166471Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Ping timed out. Retrying...
[2m2025-11-04T12:26:10.166818Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Pinging workers...
[2m2025-11-04T12:26:11.093876Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageWorker: Bound to sub: tcp://127.0.0.1:44677 and push: tcp://127.0.0.1:45339
[2m2025-11-04T12:26:11.093966Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::worker[0m[2m:[0m worker layout allocation finished.
[2m2025-11-04T12:26:11.094002Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::worker[0m[2m:[0m Worker 7587890605597737418 waiting on barrier kvbm-leader-ready
[2m2025-11-04T12:26:11.167655Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Ping timed out. Retrying...
[2m2025-11-04T12:26:11.167679Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Pinging workers...
[2m2025-11-04T12:26:11.167981Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::zmq[0m[2m:[0m ZmqActiveMessageLeader: Worker ping successful. Startup complete.
[2m2025-11-04T12:26:11.168122Z[0m [32m INFO[0m [2mdynamo_llm::block_manager::distributed::leader[0m[2m:[0m Syncing leader readiness barrier with 1 workers on barrier id kvbm-leader-ready
[2m2025-11-04T12:26:11.252581Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader[0m[2m:[0m KvConnectorLeader init complete.
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [nixl_connector.py:56] NIXL is available
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [loggers.py:147] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 8297
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [api_server.py:1634] Supported_tasks: ['generate']
[1;36m(APIServer pid=1240)[0;0m WARNING 11-04 12:26:11 [model.py:1389] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [serving_responses.py:137] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [serving_chat.py:139] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [serving_completion.py:76] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [api_server.py:1912] Starting vLLM API server 0 on http://0.0.0.0:8000
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:34] Available routes are:
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /openapi.json, Methods: HEAD, GET
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /docs, Methods: HEAD, GET
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /docs/oauth2-redirect, Methods: HEAD, GET
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /redoc, Methods: HEAD, GET
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /health, Methods: GET
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /load, Methods: GET
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /ping, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /ping, Methods: GET
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /tokenize, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /detokenize, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/models, Methods: GET
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /version, Methods: GET
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/responses, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/responses/{response_id}, Methods: GET
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/responses/{response_id}/cancel, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/chat/completions, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/completions, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/embeddings, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /pooling, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /classify, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /score, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/score, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/audio/transcriptions, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/audio/translations, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /rerank, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/rerank, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v2/rerank, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /scale_elastic_ep, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /is_scaling_elastic_ep, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /invocations, Methods: POST
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /metrics, Methods: GET
[1;36m(APIServer pid=1240)[0;0m INFO:     Started server process [1240]
[1;36m(APIServer pid=1240)[0;0m INFO:     Waiting for application startup.
[1;36m(APIServer pid=1240)[0;0m INFO:     Application startup complete.
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:41 [chat_utils.py:560] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:40596 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:40600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:40602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:40606 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:40622 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:40626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:40632 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:40644 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:40650 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:40658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:40672 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:40686 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:40692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:40694 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:40704 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:40720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:26:52 [loggers.py:127] Engine 000: Avg prompt throughput: 1759.2 tokens/s, Avg generation throughput: 1039.3 tokens/s, Running: 16 reqs, Waiting: 0 reqs, GPU KV cache usage: 21.0%, Prefix cache hit rate: 1.4%
[2m2025-11-04T12:26:52.371053Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-390a35d289c5426a8d03eea062196fab
[2m2025-11-04T12:26:58.101432Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-b3048005a1b94bcdac5d64101f6331a6
[2m2025-11-04T12:26:58.117641Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-b8e7f8f81da442148387291c511d1f9f
[2m2025-11-04T12:26:58.117749Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-b821e0a4a76e4cffa11a0e27a0b1c815
[2m2025-11-04T12:26:58.117865Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-c2d3ecd675724a258d5ef54502c6c26b
[2m2025-11-04T12:26:58.117960Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-6271521528f14891ae207dfb2753d9d8
[2m2025-11-04T12:26:58.118022Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-90e71ffaf9064d94aac0080cd7440858
[2m2025-11-04T12:26:58.118083Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-a26a2f210b0c4e1a925e45c1a160021d
[2m2025-11-04T12:26:58.131886Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-01d948d770e246dc87915578989b662f
[2m2025-11-04T12:26:58.131971Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-f0fdf603606c49f08b1f560bb07f813b
[2m2025-11-04T12:26:58.132031Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-0c27a8f4d38c415098b3726b7fea55e2
[2m2025-11-04T12:26:58.132083Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-e6f022d430d346bebae52c77ce18f3d5
[2m2025-11-04T12:26:58.132121Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-53aacf26949442699139b39307b1fda1
[2m2025-11-04T12:26:58.132152Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-2af8df610f384331ae0dd90ac8f71bf2
[2m2025-11-04T12:26:58.132183Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-5ceba92386f5412c95075e45144715b2
[2m2025-11-04T12:26:58.132209Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-5abed36665764d9390ca019aae0bedeb
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:27:02 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 562.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 1.4%
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:35858 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:35864 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:35878 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:35884 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:35896 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:35912 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:35930 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:35922 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:35946 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:35948 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:35952 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:35960 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:35974 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:35990 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:35994 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36000 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36006 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:35964 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36010 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36024 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36040 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36018 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36048 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36078 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36064 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36092 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36104 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36128 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36118 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36134 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36132 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36160 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36174 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36144 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36204 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36192 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36182 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36216 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36218 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36250 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36226 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36230 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36256 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36240 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36278 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36268 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36284 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36190 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36314 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36292 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36208 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36300 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36416 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36370 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36426 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36354 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36430 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36122 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36386 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36434 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36400 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36328 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO:     172.17.124.32:36338 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:27:12 [loggers.py:127] Engine 000: Avg prompt throughput: 7135.5 tokens/s, Avg generation throughput: 1504.9 tokens/s, Running: 64 reqs, Waiting: 0 reqs, GPU KV cache usage: 64.7%, Prefix cache hit rate: 1.4%
[2m2025-11-04T12:27:16.535413Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-e163081639e540a790dd0cc86a2a314b
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:27:22 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2637.1 tokens/s, Running: 63 reqs, Waiting: 0 reqs, GPU KV cache usage: 84.6%, Prefix cache hit rate: 1.4%
[2m2025-11-04T12:27:25.935344Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-63ce7b1f70204b7e98fc228812ce782a
[2m2025-11-04T12:27:31.574046Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-605ca596e1eb45209cbc674a0f0fee4f
[2m2025-11-04T12:27:31.599095Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-8f16f148bfbc4cc5820f412e8ac559d3
[2m2025-11-04T12:27:31.599286Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-74ea826a312949fcb4510f2d79f3ff9d
[2m2025-11-04T12:27:31.599461Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-5003c4c70d0f4ab48b75d57e1fa08d18
[2m2025-11-04T12:27:31.599538Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-1800453b32c3415d843311b04e313e61
[2m2025-11-04T12:27:31.599663Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-57184a8426814b4fa21d52acaa9e9438
[2m2025-11-04T12:27:31.599751Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-a8a7e843a6a24fae84ddf42a5f0eba67
[2m2025-11-04T12:27:31.599844Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-02b019fdd99a4c33bb2cf68827decf2f
[2m2025-11-04T12:27:31.625536Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-354a1f0891b04219b67ca954de7c46fc
[2m2025-11-04T12:27:31.625662Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-3e1f35ff08d94fef81dc51137cc874ea
[2m2025-11-04T12:27:31.625735Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-0660b7df3689496d8f8e2ce1ce915090
[2m2025-11-04T12:27:31.625795Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-65182cd0f7bc49adac3d025fc6d0b718
[2m2025-11-04T12:27:31.625861Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-83f9f3e9142e4e1b826c7a29a09d0707
[2m2025-11-04T12:27:31.625921Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-1bc18c91257544318a9bee54bd371133
[2m2025-11-04T12:27:31.625974Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-92dcd0fbd1aa47c48f27f8d8b574a9e2
[2m2025-11-04T12:27:31.649470Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-1221f9b30f7c456a8a09552227375ab8
[2m2025-11-04T12:27:31.649553Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-248fc50b4e82454ca137fc16b9ff07e1
[2m2025-11-04T12:27:31.649607Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-60bfded7645848788afc5adfa0e6b795
[2m2025-11-04T12:27:31.649650Z[0m [32m INFO[0m [2m_core::llm::block_manager::vllm::connector::leader::slot[0m[2m:[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 [3mrequest_id[0m[2m=[0mchatcmpl-6f2426ac6dd34bddb84cd6dbab6d8f22
[2m2025-11-04T12:27:31.649688Z[0m [32m INFO[0m 
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:27:32 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2318.8 tokens/s, Running: 0 reqs, Waiting: 2 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 56.5%
[1;36m(APIServer pid=1240)[0;0m INFO 11-04 12:27:42 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 2 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 71.4%

```

### Screenshots

<img width="1503" height="844" alt="Image" src="https://github.com/user-attachments/assets/98fb7de7-1909-4d15-8fac-dba8b98d8a30" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: Vllm serve hang when Running KVBM in vLLM #4087

Describe the Bug

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Additional Context

Screenshots

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: Vllm serve hang when Running KVBM in vLLM #4087

Description

Describe the Bug

Steps to Reproduce

Expected Behavior

Actual Behavior

Environment

Additional Context

Screenshots

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions