-
Notifications
You must be signed in to change notification settings - Fork 676
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the Bug
When I tested Running KVBM in vLLM, the vLLM service would hang when using random data for benchmark( batch_size:64 ,input_len:1024, output_len:1024),After responding to some requests, it stopped responding, and the GPU utilization dropped to 0.
Steps to Reproduce
start docker
root@llmops-tdc01:/llmops-data/zty/repo/dynamo# ./container/run.sh --framework vllm -it --mount-workspace --use-nixl-gds -v /mnt/llm32/:/models --image harbor.transwarp.io/aip/vllm-runtime:0.6.0
+ docker run --gpus all -it --rm --network host --runtime nvidia --shm-size=10G --ulimit memlock=-1 --ulimit stack=67108864 --ulimit nofile=65536:65536 -e HF_TOKEN -v /mnt/llm32/:/models -v /llmops-data/zty/repo/dynamo/container/..:/workspace -v /llmops-data/zty/tmp:/tmp -v /mnt/:/mnt -v /llmops-data/zty/repo/dynamo/container/.cache/huggingface:/root/.cache/huggingface -v /run/udev:/run/udev:ro -w /workspace --cap-add CAP_SYS_PTRACE --cap-add=IPC_LOCK --ipc host --privileged harbor.transwarp.io/aip/vllm-runtime:0.6.0
start vllm serve
DYN_KVBM_CPU_CACHE_GB=4 DYN_KVBM_DISK_CACHE_GB=8 CUDA_VISIBLE_DEVICES=1 vllm serve /models/Qwen/Qwen3-30B-A3B/ --served-model-name=atom -tp 1 --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "dynamo.llm.vllm_integration.connector"}'
Expected Behavior
work well
Actual Behavior
vllm serve hang
Environment
H100 gpu
ubuntu 22.04
Additional Context
log is:
/opt/dynamo/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
INFO 11-04 12:23:12 [__init__.py:216] Automatically detected platform cuda.
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:23:17 [api_server.py:1839] vLLM API server version 0.11.0
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:23:17 [utils.py:233] non-default args: {'model_tag': '/models/Qwen/Qwen3-30B-A3B/', 'model': '/models/Qwen/Qwen3-30B-A3B/', 'served_model_name': ['atom'], 'kv_transfer_config': KVTransferConfig(kv_connector='DynamoConnector', engine_id='ec002b7b-674b-45d1-9cf8-5511bb5654bd', kv_buffer_device='cuda', kv_buffer_size=1000000000.0, kv_role='kv_both', kv_rank=None, kv_parallel_size=1, kv_ip='127.0.0.1', kv_port=14579, kv_connector_extra_config={}, kv_connector_module_path='dynamo.llm.vllm_integration.connector')}
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:23:17 [model.py:547] Resolved architecture: Qwen3MoeForCausalLM
�[1;36m(APIServer pid=1240)�[0;0m `torch_dtype` is deprecated! Use `dtype` instead!
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:23:17 [model.py:1510] Using max model len 40960
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:23:17 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=8192.
/opt/dynamo/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
INFO 11-04 12:23:22 [__init__.py:216] Automatically detected platform cuda.
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:23:27 [core.py:644] Waiting for init message from front-end.
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:23:27 [core.py:77] Initializing a V1 LLM engine (v0.11.0) with config: model='/models/Qwen/Qwen3-30B-A3B/', speculative_config=None, tokenizer='/models/Qwen/Qwen3-30B-A3B/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=atom, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention","vllm.sparse_attn_indexer"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m W1104 12:23:28.668000 1379 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m W1104 12:23:28.668000 1379 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:23:29 [parallel_state.py:1208] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:23:29 [nixl_connector.py:56] NIXL is available
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:23:29 [factory.py:51] Creating v1 connector with name: DynamoConnector and engine_id: ec002b7b-674b-45d1-9cf8-5511bb5654bd
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m WARNING 11-04 12:23:29 [base.py:86] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
�[2m2025-11-04T12:23:29.663106Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::worker�[0m�[2m:�[0m KvConnectorWorker initialized with worker_id: ec002b7b-674b-45d1-9cf8-5511bb5654bd
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:23:29 [topk_topp_sampler.py:55] Using FlashInfer for top-p & top-k sampling.
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:23:30 [gpu_model_runner.py:2602] Starting to load model /models/Qwen/Qwen3-30B-A3B/...
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:23:30 [gpu_model_runner.py:2634] Loading model from scratch...
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:23:30 [cuda.py:366] Using Flash Attention backend on V1 engine.
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 0% Completed | 0/16 [00:00<?, ?it/s]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 6% Completed | 1/16 [00:01<00:16, 1.07s/it]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 12% Completed | 2/16 [00:02<00:14, 1.07s/it]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 19% Completed | 3/16 [00:03<00:13, 1.04s/it]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 25% Completed | 4/16 [00:04<00:12, 1.02s/it]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 31% Completed | 5/16 [00:05<00:10, 1.00it/s]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 38% Completed | 6/16 [00:06<00:10, 1.02s/it]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 44% Completed | 7/16 [00:07<00:09, 1.04s/it]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 50% Completed | 8/16 [00:08<00:08, 1.03s/it]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 56% Completed | 9/16 [00:09<00:07, 1.03s/it]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 62% Completed | 10/16 [00:10<00:06, 1.04s/it]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 69% Completed | 11/16 [00:11<00:05, 1.04s/it]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 75% Completed | 12/16 [00:12<00:04, 1.02s/it]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 81% Completed | 13/16 [00:13<00:03, 1.03s/it]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 88% Completed | 14/16 [00:13<00:01, 1.22it/s]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 94% Completed | 15/16 [00:14<00:00, 1.18it/s]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:15<00:00, 1.13it/s]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Loading safetensors checkpoint shards: 100% Completed | 16/16 [00:15<00:00, 1.03it/s]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:23:46 [default_loader.py:267] Loading weights took 15.67 seconds
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:23:47 [gpu_model_runner.py:2653] Model loading took 56.8814 GiB and 16.046670 seconds
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:23:57 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/b435bf2ea3/rank_0_0/backbone for vLLM's torch.compile
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:23:57 [backends.py:559] Dynamo bytecode transform time: 9.94 s
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:23:58 [backends.py:197] Cache the graph for dynamic shape for later use
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:24:34 [backends.py:218] Compiling a graph for dynamic shape takes 36.33 s
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m WARNING 11-04 12:24:35 [fused_moe.py:798] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/opt/dynamo/venv/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=768,device_name=NVIDIA_H100_80GB_HBM3.json']
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:24:39 [monitor.py:34] torch.compile takes 46.27 s in total
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:25:40 [gpu_worker.py:298] Available KV cache memory: 12.15 GiB
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:25:41 [kv_cache_utils.py:1087] GPU KV cache size: 132,752 tokens
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:25:41 [kv_cache_utils.py:1091] Maximum concurrency for 40,960 tokens per request: 3.24x
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:25:41 [utils.py:114] Connectors do not specify a kv cache layout, defaulting to NHD.
�[2m2025-11-04T12:25:41.391597Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::worker�[0m�[2m:�[0m Auto-detected device layout from tensor shape: LayerSeparate { outer_contiguous: true }
�[2m2025-11-04T12:25:41.391681Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::worker�[0m�[2m:�[0m Initializing KvbmWorker with params: num_device_blocks=8297, page_size=16, dtype_width_bytes=2
�[2m2025-11-04T12:25:41.392177Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::worker�[0m�[2m:�[0m Inferred layout: num_layers=48, outer_dim=2, outer_contiguous=true, page_size=16, inner_dim=512
�[2m2025-11-04T12:25:41.392572Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::worker�[0m�[2m:�[0m Worker 7587890605597737418 waiting on barrier kvbm-worker-to-leader
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m 2025-11-04 12:25:41,406 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m 2025-11-04 12:25:41,572 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 0%| | 0/67 [00:00<?, ?it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 1%|▏ | 1/67 [00:00<00:10, 6.18it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 3%|▎ | 2/67 [00:00<00:09, 6.77it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 4%|▍ | 3/67 [00:00<00:09, 6.81it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 6%|▌ | 4/67 [00:00<00:08, 7.30it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 7%|▋ | 5/67 [00:00<00:08, 7.47it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 9%|▉ | 6/67 [00:00<00:07, 7.65it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 10%|█ | 7/67 [00:00<00:07, 7.59it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 12%|█▏ | 8/67 [00:01<00:07, 7.65it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 13%|█▎ | 9/67 [00:01<00:07, 7.58it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 15%|█▍ | 10/67 [00:01<00:07, 7.26it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 16%|█▋ | 11/67 [00:01<00:07, 7.47it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 18%|█▊ | 12/67 [00:01<00:07, 7.31it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 19%|█▉ | 13/67 [00:01<00:07, 7.47it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 21%|██ | 14/67 [00:01<00:07, 7.27it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 22%|██▏ | 15/67 [00:02<00:07, 7.36it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 24%|██▍ | 16/67 [00:02<00:06, 7.40it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 25%|██▌ | 17/67 [00:02<00:07, 6.94it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 27%|██▋ | 18/67 [00:02<00:07, 6.74it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 28%|██▊ | 19/67 [00:02<00:07, 6.55it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 30%|██▉ | 20/67 [00:02<00:07, 6.56it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 31%|███▏ | 21/67 [00:02<00:06, 6.66it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 33%|███▎ | 22/67 [00:03<00:06, 6.76it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 34%|███▍ | 23/67 [00:03<00:06, 6.97it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 36%|███▌ | 24/67 [00:03<00:06, 7.00it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 37%|███▋ | 25/67 [00:03<00:05, 7.03it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 39%|███▉ | 26/67 [00:03<00:06, 6.55it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 40%|████ | 27/67 [00:03<00:06, 6.56it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 42%|████▏ | 28/67 [00:04<00:06, 5.70it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 43%|████▎ | 29/67 [00:04<00:06, 5.88it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 45%|████▍ | 30/67 [00:04<00:06, 6.00it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 46%|████▋ | 31/67 [00:04<00:06, 5.97it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 48%|████▊ | 32/67 [00:04<00:06, 5.81it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 49%|████▉ | 33/67 [00:04<00:05, 5.83it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 51%|█████ | 34/67 [00:05<00:05, 6.05it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 52%|█████▏ | 35/67 [00:05<00:05, 6.32it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 54%|█████▎ | 36/67 [00:05<00:04, 6.53it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 55%|█████▌ | 37/67 [00:05<00:04, 6.68it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 57%|█████▋ | 38/67 [00:05<00:04, 6.67it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 58%|█████▊ | 39/67 [00:05<00:04, 6.45it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 60%|█████▉ | 40/67 [00:05<00:04, 6.46it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 61%|██████ | 41/67 [00:06<00:04, 6.30it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 63%|██████▎ | 42/67 [00:06<00:04, 6.15it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 64%|██████▍ | 43/67 [00:06<00:03, 6.32it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 66%|██████▌ | 44/67 [00:06<00:03, 6.49it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 67%|██████▋ | 45/67 [00:06<00:03, 6.72it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 69%|██████▊ | 46/67 [00:06<00:03, 6.87it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 70%|███████ | 47/67 [00:07<00:02, 6.97it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 72%|███████▏ | 48/67 [00:07<00:02, 7.13it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 73%|███████▎ | 49/67 [00:07<00:06, 2.99it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 75%|███████▍ | 50/67 [00:08<00:04, 3.62it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 76%|███████▌ | 51/67 [00:08<00:03, 4.15it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 78%|███████▊ | 52/67 [00:08<00:03, 4.68it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 79%|███████▉ | 53/67 [00:08<00:02, 5.07it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 81%|████████ | 54/67 [00:08<00:02, 5.47it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 82%|████████▏ | 55/67 [00:08<00:02, 5.64it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 84%|████████▎ | 56/67 [00:08<00:01, 6.02it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 85%|████████▌ | 57/67 [00:09<00:01, 6.01it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 87%|████████▋ | 58/67 [00:09<00:01, 5.72it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 88%|████████▊ | 59/67 [00:09<00:01, 5.34it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 90%|████████▉ | 60/67 [00:09<00:01, 4.90it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 91%|█████████ | 61/67 [00:09<00:01, 4.99it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 93%|█████████▎| 62/67 [00:10<00:00, 5.50it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 94%|█████████▍| 63/67 [00:10<00:00, 5.83it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 96%|█████████▌| 64/67 [00:10<00:00, 5.93it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 97%|█████████▋| 65/67 [00:10<00:00, 6.40it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 99%|█████████▊| 66/67 [00:10<00:00, 6.57it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67 [00:11<00:00, 2.84it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67 [00:11<00:00, 5.81it/s]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m
Capturing CUDA graphs (decode, FULL): 0%| | 0/67 [00:00<?, ?it/s]
Capturing CUDA graphs (decode, FULL): 1%|▏ | 1/67 [00:00<00:08, 7.59it/s]
Capturing CUDA graphs (decode, FULL): 3%|▎ | 2/67 [00:00<00:07, 8.41it/s]
Capturing CUDA graphs (decode, FULL): 4%|▍ | 3/67 [00:00<00:07, 8.80it/s]
Capturing CUDA graphs (decode, FULL): 6%|▌ | 4/67 [00:00<00:07, 8.44it/s]
Capturing CUDA graphs (decode, FULL): 7%|▋ | 5/67 [00:00<00:07, 8.68it/s]
Capturing CUDA graphs (decode, FULL): 9%|▉ | 6/67 [00:00<00:07, 8.51it/s]
Capturing CUDA graphs (decode, FULL): 10%|█ | 7/67 [00:00<00:07, 7.99it/s]
Capturing CUDA graphs (decode, FULL): 12%|█▏ | 8/67 [00:00<00:07, 7.57it/s]
Capturing CUDA graphs (decode, FULL): 13%|█▎ | 9/67 [00:01<00:07, 7.94it/s]
Capturing CUDA graphs (decode, FULL): 15%|█▍ | 10/67 [00:01<00:07, 8.06it/s]
Capturing CUDA graphs (decode, FULL): 16%|█▋ | 11/67 [00:01<00:06, 8.04it/s]
Capturing CUDA graphs (decode, FULL): 18%|█▊ | 12/67 [00:01<00:07, 7.77it/s]
Capturing CUDA graphs (decode, FULL): 19%|█▉ | 13/67 [00:01<00:07, 7.55it/s]
Capturing CUDA graphs (decode, FULL): 21%|██ | 14/67 [00:01<00:07, 7.49it/s]
Capturing CUDA graphs (decode, FULL): 22%|██▏ | 15/67 [00:01<00:06, 7.64it/s]
Capturing CUDA graphs (decode, FULL): 24%|██▍ | 16/67 [00:02<00:06, 7.51it/s]
Capturing CUDA graphs (decode, FULL): 25%|██▌ | 17/67 [00:02<00:06, 7.71it/s]
Capturing CUDA graphs (decode, FULL): 27%|██▋ | 18/67 [00:02<00:06, 7.95it/s]
Capturing CUDA graphs (decode, FULL): 28%|██▊ | 19/67 [00:02<00:05, 8.18it/s]
Capturing CUDA graphs (decode, FULL): 30%|██▉ | 20/67 [00:02<00:05, 8.19it/s]
Capturing CUDA graphs (decode, FULL): 31%|███▏ | 21/67 [00:02<00:05, 8.32it/s]
Capturing CUDA graphs (decode, FULL): 33%|███▎ | 22/67 [00:02<00:05, 8.18it/s]
Capturing CUDA graphs (decode, FULL): 34%|███▍ | 23/67 [00:02<00:05, 8.27it/s]
Capturing CUDA graphs (decode, FULL): 36%|███▌ | 24/67 [00:02<00:05, 8.48it/s]
Capturing CUDA graphs (decode, FULL): 37%|███▋ | 25/67 [00:03<00:05, 8.23it/s]
Capturing CUDA graphs (decode, FULL): 39%|███▉ | 26/67 [00:03<00:04, 8.44it/s]
Capturing CUDA graphs (decode, FULL): 40%|████ | 27/67 [00:03<00:04, 8.38it/s]
Capturing CUDA graphs (decode, FULL): 42%|████▏ | 28/67 [00:03<00:04, 8.29it/s]
Capturing CUDA graphs (decode, FULL): 43%|████▎ | 29/67 [00:03<00:04, 8.48it/s]
Capturing CUDA graphs (decode, FULL): 45%|████▍ | 30/67 [00:03<00:04, 8.50it/s]
Capturing CUDA graphs (decode, FULL): 46%|████▋ | 31/67 [00:03<00:04, 8.68it/s]
Capturing CUDA graphs (decode, FULL): 48%|████▊ | 32/67 [00:03<00:04, 8.74it/s]
Capturing CUDA graphs (decode, FULL): 49%|████▉ | 33/67 [00:04<00:03, 8.68it/s]
Capturing CUDA graphs (decode, FULL): 51%|█████ | 34/67 [00:04<00:03, 8.45it/s]
Capturing CUDA graphs (decode, FULL): 52%|█████▏ | 35/67 [00:04<00:04, 7.77it/s]
Capturing CUDA graphs (decode, FULL): 54%|█████▎ | 36/67 [00:04<00:04, 7.54it/s]
Capturing CUDA graphs (decode, FULL): 55%|█████▌ | 37/67 [00:04<00:04, 7.41it/s]
Capturing CUDA graphs (decode, FULL): 57%|█████▋ | 38/67 [00:04<00:03, 7.25it/s]
Capturing CUDA graphs (decode, FULL): 58%|█████▊ | 39/67 [00:04<00:03, 7.06it/s]
Capturing CUDA graphs (decode, FULL): 60%|█████▉ | 40/67 [00:05<00:03, 7.12it/s]
Capturing CUDA graphs (decode, FULL): 61%|██████ | 41/67 [00:05<00:03, 7.24it/s]
Capturing CUDA graphs (decode, FULL): 63%|██████▎ | 42/67 [00:05<00:03, 7.45it/s]
Capturing CUDA graphs (decode, FULL): 64%|██████▍ | 43/67 [00:05<00:03, 7.52it/s]
Capturing CUDA graphs (decode, FULL): 66%|██████▌ | 44/67 [00:05<00:03, 7.36it/s]
Capturing CUDA graphs (decode, FULL): 67%|██████▋ | 45/67 [00:05<00:02, 7.53it/s]
Capturing CUDA graphs (decode, FULL): 69%|██████▊ | 46/67 [00:05<00:02, 7.64it/s]
Capturing CUDA graphs (decode, FULL): 70%|███████ | 47/67 [00:05<00:02, 7.79it/s]
Capturing CUDA graphs (decode, FULL): 72%|███████▏ | 48/67 [00:06<00:02, 8.00it/s]
Capturing CUDA graphs (decode, FULL): 73%|███████▎ | 49/67 [00:06<00:02, 8.10it/s]
Capturing CUDA graphs (decode, FULL): 75%|███████▍ | 50/67 [00:06<00:02, 8.08it/s]
Capturing CUDA graphs (decode, FULL): 76%|███████▌ | 51/67 [00:06<00:01, 8.02it/s]
Capturing CUDA graphs (decode, FULL): 78%|███████▊ | 52/67 [00:06<00:01, 8.16it/s]
Capturing CUDA graphs (decode, FULL): 79%|███████▉ | 53/67 [00:06<00:01, 8.39it/s]
Capturing CUDA graphs (decode, FULL): 81%|████████ | 54/67 [00:06<00:01, 8.38it/s]
Capturing CUDA graphs (decode, FULL): 82%|████████▏ | 55/67 [00:06<00:01, 8.49it/s]
Capturing CUDA graphs (decode, FULL): 84%|████████▎ | 56/67 [00:06<00:01, 8.57it/s]
Capturing CUDA graphs (decode, FULL): 85%|████████▌ | 57/67 [00:07<00:01, 8.27it/s]
Capturing CUDA graphs (decode, FULL): 87%|████████▋ | 58/67 [00:07<00:01, 7.81it/s]
Capturing CUDA graphs (decode, FULL): 88%|████████▊ | 59/67 [00:07<00:01, 7.91it/s]
Capturing CUDA graphs (decode, FULL): 90%|████████▉ | 60/67 [00:07<00:00, 8.25it/s]
Capturing CUDA graphs (decode, FULL): 91%|█████████ | 61/67 [00:07<00:00, 8.32it/s]
Capturing CUDA graphs (decode, FULL): 93%|█████████▎| 62/67 [00:07<00:00, 8.23it/s]
Capturing CUDA graphs (decode, FULL): 94%|█████████▍| 63/67 [00:07<00:00, 8.31it/s]
Capturing CUDA graphs (decode, FULL): 96%|█████████▌| 64/67 [00:07<00:00, 8.45it/s]
Capturing CUDA graphs (decode, FULL): 97%|█████████▋| 65/67 [00:08<00:00, 8.39it/s]
Capturing CUDA graphs (decode, FULL): 99%|█████████▊| 66/67 [00:08<00:00, 8.13it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 67/67 [00:08<00:00, 7.98it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 67/67 [00:08<00:00, 8.02it/s]
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:26:02 [gpu_model_runner.py:3480] Graph capturing finished in 21 secs, took 1.30 GiB
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:26:02 [core.py:210] init engine (profile, create kv cache, warmup model) took 135.53 seconds
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m INFO 11-04 12:26:03 [factory.py:51] Creating v1 connector with name: DynamoConnector and engine_id: ec002b7b-674b-45d1-9cf8-5511bb5654bd
�[1;36m(EngineCore_DP0 pid=1379)�[0;0m WARNING 11-04 12:26:03 [base.py:86] Initializing KVConnectorBase_V1. This API is experimental and subject to change in the future as we iterate the design.
�[2m2025-11-04T12:26:03.156287Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader�[0m�[2m:�[0m KvConnectorLeader initialized with worker_id: ec002b7b-674b-45d1-9cf8-5511bb5654bd
�[2m2025-11-04T12:26:03.156296Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Bound to pub: tcp://127.0.0.1:44677 and pull: tcp://127.0.0.1:45339
�[2m2025-11-04T12:26:03.156305Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::leader�[0m�[2m:�[0m Syncing leader barrier with 1 workers on barrier id kvbm-worker-to-leader
�[2m2025-11-04T12:26:03.156332Z�[0m �[33m WARN�[0m �[2m_core::llm::block_manager::vllm::connector::leader�[0m�[2m:�[0m DYN_KVBM_METRICS_PORT not present or couldn’t be interpreted, falling back to 6880
�[2m2025-11-04T12:26:03.156346Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Pinging workers...
�[2m2025-11-04T12:26:03.159962Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::leader�[0m�[2m:�[0m Worker to leader barrier synced with 1 workers
�[2m2025-11-04T12:26:03.159971Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::leader�[0m�[2m:�[0m Syncing leader barrier with 1 workers on barrier id kvbm-leader-to-worker
�[2m2025-11-04T12:26:03.198154Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::worker�[0m�[2m:�[0m Worker 7587890605597737418 waiting on barrier kvbm-leader-to-worker
�[2m2025-11-04T12:26:03.201731Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::worker�[0m�[2m:�[0m Worker 7587890605597737418 received leader data: KvbmLeaderData { pub_url: "tcp://127.0.0.1:44677", ack_url: "tcp://127.0.0.1:45339", num_host_blocks: 2543, num_disk_blocks: 5086 }
�[2m2025-11-04T12:26:03.201764Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::leader�[0m�[2m:�[0m Worker to leader barrier synced with 1 workers
�[2m2025-11-04T12:26:04.158212Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Ping timed out. Retrying...
�[2m2025-11-04T12:26:04.158294Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Pinging workers...
�[2m2025-11-04T12:26:05.159672Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Ping timed out. Retrying...
�[2m2025-11-04T12:26:05.159727Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Pinging workers...
�[2m2025-11-04T12:26:06.161207Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Ping timed out. Retrying...
�[2m2025-11-04T12:26:06.161266Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Pinging workers...
�[2m2025-11-04T12:26:06.857268Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::block::transfer::context�[0m�[2m:�[0m Creating pinned buffer pool: 10 buffers × 12KB each
�[2m2025-11-04T12:26:06.857298Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::block::transfer::context�[0m�[2m:�[0m Total pool memory: 0MB
�[2m2025-11-04T12:26:06.857545Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::block::transfer::context�[0m�[2m:�[0m Successfully created pinned buffer pool: 10/10 buffers allocated
�[2m2025-11-04T12:26:06.863358Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::pool::managed�[0m�[2m:�[0m building block pool
�[2m2025-11-04T12:26:06.864149Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::pool::managed�[0m�[2m:�[0m building block pool
�[2m2025-11-04T12:26:06.864565Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::pool::managed�[0m�[2m:�[0m building block pool
�[2m2025-11-04T12:26:06.870478Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::block::transfer::context�[0m�[2m:�[0m Creating pinned buffer pool: 10 buffers × 0KB each
�[2m2025-11-04T12:26:06.870489Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::block::transfer::context�[0m�[2m:�[0m Total pool memory: 0MB
�[2m2025-11-04T12:26:06.887231Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::block::transfer::context�[0m�[2m:�[0m Successfully created pinned buffer pool: 10/10 buffers allocated
�[2m2025-11-04T12:26:07.162904Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Ping timed out. Retrying...
�[2m2025-11-04T12:26:07.162950Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Pinging workers...
�[2m2025-11-04T12:26:08.163792Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Ping timed out. Retrying...
�[2m2025-11-04T12:26:08.163846Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Pinging workers...
�[2m2025-11-04T12:26:09.165085Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Ping timed out. Retrying...
�[2m2025-11-04T12:26:09.165135Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Pinging workers...
�[2m2025-11-04T12:26:10.166471Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Ping timed out. Retrying...
�[2m2025-11-04T12:26:10.166818Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Pinging workers...
�[2m2025-11-04T12:26:11.093876Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageWorker: Bound to sub: tcp://127.0.0.1:44677 and push: tcp://127.0.0.1:45339
�[2m2025-11-04T12:26:11.093966Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::worker�[0m�[2m:�[0m worker layout allocation finished.
�[2m2025-11-04T12:26:11.094002Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::worker�[0m�[2m:�[0m Worker 7587890605597737418 waiting on barrier kvbm-leader-ready
�[2m2025-11-04T12:26:11.167655Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Ping timed out. Retrying...
�[2m2025-11-04T12:26:11.167679Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Pinging workers...
�[2m2025-11-04T12:26:11.167981Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::zmq�[0m�[2m:�[0m ZmqActiveMessageLeader: Worker ping successful. Startup complete.
�[2m2025-11-04T12:26:11.168122Z�[0m �[32m INFO�[0m �[2mdynamo_llm::block_manager::distributed::leader�[0m�[2m:�[0m Syncing leader readiness barrier with 1 workers on barrier id kvbm-leader-ready
�[2m2025-11-04T12:26:11.252581Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader�[0m�[2m:�[0m KvConnectorLeader init complete.
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [nixl_connector.py:56] NIXL is available
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [loggers.py:147] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 8297
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [api_server.py:1634] Supported_tasks: ['generate']
�[1;36m(APIServer pid=1240)�[0;0m WARNING 11-04 12:26:11 [model.py:1389] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [serving_responses.py:137] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [serving_chat.py:139] Using default chat sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [serving_completion.py:76] Using default completion sampling params from model: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [api_server.py:1912] Starting vLLM API server 0 on http://0.0.0.0:8000
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:34] Available routes are:
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /openapi.json, Methods: HEAD, GET
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /docs, Methods: HEAD, GET
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /docs/oauth2-redirect, Methods: HEAD, GET
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /redoc, Methods: HEAD, GET
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /health, Methods: GET
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /load, Methods: GET
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /ping, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /ping, Methods: GET
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /tokenize, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /detokenize, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/models, Methods: GET
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /version, Methods: GET
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/responses, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/responses/{response_id}, Methods: GET
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/responses/{response_id}/cancel, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/chat/completions, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/completions, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/embeddings, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /pooling, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /classify, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /score, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/score, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/audio/transcriptions, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/audio/translations, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /rerank, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v1/rerank, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /v2/rerank, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /scale_elastic_ep, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /is_scaling_elastic_ep, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /invocations, Methods: POST
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:11 [launcher.py:42] Route: /metrics, Methods: GET
�[1;36m(APIServer pid=1240)�[0;0m INFO: Started server process [1240]
�[1;36m(APIServer pid=1240)�[0;0m INFO: Waiting for application startup.
�[1;36m(APIServer pid=1240)�[0;0m INFO: Application startup complete.
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:41 [chat_utils.py:560] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:40596 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:40600 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:40602 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:40606 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:40622 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:40626 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:40632 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:40644 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:40650 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:40658 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:40672 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:40686 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:40692 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:40694 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:40704 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:40720 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:26:52 [loggers.py:127] Engine 000: Avg prompt throughput: 1759.2 tokens/s, Avg generation throughput: 1039.3 tokens/s, Running: 16 reqs, Waiting: 0 reqs, GPU KV cache usage: 21.0%, Prefix cache hit rate: 1.4%
�[2m2025-11-04T12:26:52.371053Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-390a35d289c5426a8d03eea062196fab
�[2m2025-11-04T12:26:58.101432Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-b3048005a1b94bcdac5d64101f6331a6
�[2m2025-11-04T12:26:58.117641Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-b8e7f8f81da442148387291c511d1f9f
�[2m2025-11-04T12:26:58.117749Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-b821e0a4a76e4cffa11a0e27a0b1c815
�[2m2025-11-04T12:26:58.117865Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-c2d3ecd675724a258d5ef54502c6c26b
�[2m2025-11-04T12:26:58.117960Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-6271521528f14891ae207dfb2753d9d8
�[2m2025-11-04T12:26:58.118022Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-90e71ffaf9064d94aac0080cd7440858
�[2m2025-11-04T12:26:58.118083Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-a26a2f210b0c4e1a925e45c1a160021d
�[2m2025-11-04T12:26:58.131886Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-01d948d770e246dc87915578989b662f
�[2m2025-11-04T12:26:58.131971Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-f0fdf603606c49f08b1f560bb07f813b
�[2m2025-11-04T12:26:58.132031Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-0c27a8f4d38c415098b3726b7fea55e2
�[2m2025-11-04T12:26:58.132083Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-e6f022d430d346bebae52c77ce18f3d5
�[2m2025-11-04T12:26:58.132121Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-53aacf26949442699139b39307b1fda1
�[2m2025-11-04T12:26:58.132152Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-2af8df610f384331ae0dd90ac8f71bf2
�[2m2025-11-04T12:26:58.132183Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-5ceba92386f5412c95075e45144715b2
�[2m2025-11-04T12:26:58.132209Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-5abed36665764d9390ca019aae0bedeb
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:27:02 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 562.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 1.4%
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:35858 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:35864 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:35878 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:35884 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:35896 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:35912 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:35930 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:35922 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:35946 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:35948 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:35952 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:35960 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:35974 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:35990 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:35994 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36000 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36006 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:35964 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36010 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36024 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36036 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36040 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36018 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36048 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36078 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36064 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36092 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36104 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36128 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36118 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36134 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36132 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36160 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36174 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36144 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36204 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36192 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36182 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36216 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36218 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36250 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36226 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36230 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36256 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36240 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36278 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36268 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36284 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36190 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36314 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36292 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36208 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36300 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36416 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36370 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36426 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36354 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36430 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36122 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36386 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36434 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36400 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36328 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO: 172.17.124.32:36338 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:27:12 [loggers.py:127] Engine 000: Avg prompt throughput: 7135.5 tokens/s, Avg generation throughput: 1504.9 tokens/s, Running: 64 reqs, Waiting: 0 reqs, GPU KV cache usage: 64.7%, Prefix cache hit rate: 1.4%
�[2m2025-11-04T12:27:16.535413Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-e163081639e540a790dd0cc86a2a314b
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:27:22 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2637.1 tokens/s, Running: 63 reqs, Waiting: 0 reqs, GPU KV cache usage: 84.6%, Prefix cache hit rate: 1.4%
�[2m2025-11-04T12:27:25.935344Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-63ce7b1f70204b7e98fc228812ce782a
�[2m2025-11-04T12:27:31.574046Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-605ca596e1eb45209cbc674a0f0fee4f
�[2m2025-11-04T12:27:31.599095Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-8f16f148bfbc4cc5820f412e8ac559d3
�[2m2025-11-04T12:27:31.599286Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-74ea826a312949fcb4510f2d79f3ff9d
�[2m2025-11-04T12:27:31.599461Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-5003c4c70d0f4ab48b75d57e1fa08d18
�[2m2025-11-04T12:27:31.599538Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-1800453b32c3415d843311b04e313e61
�[2m2025-11-04T12:27:31.599663Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-57184a8426814b4fa21d52acaa9e9438
�[2m2025-11-04T12:27:31.599751Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-a8a7e843a6a24fae84ddf42a5f0eba67
�[2m2025-11-04T12:27:31.599844Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-02b019fdd99a4c33bb2cf68827decf2f
�[2m2025-11-04T12:27:31.625536Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-354a1f0891b04219b67ca954de7c46fc
�[2m2025-11-04T12:27:31.625662Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-3e1f35ff08d94fef81dc51137cc874ea
�[2m2025-11-04T12:27:31.625735Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-0660b7df3689496d8f8e2ce1ce915090
�[2m2025-11-04T12:27:31.625795Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-65182cd0f7bc49adac3d025fc6d0b718
�[2m2025-11-04T12:27:31.625861Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-83f9f3e9142e4e1b826c7a29a09d0707
�[2m2025-11-04T12:27:31.625921Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-1bc18c91257544318a9bee54bd371133
�[2m2025-11-04T12:27:31.625974Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-92dcd0fbd1aa47c48f27f8d8b574a9e2
�[2m2025-11-04T12:27:31.649470Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-1221f9b30f7c456a8a09552227375ab8
�[2m2025-11-04T12:27:31.649553Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-248fc50b4e82454ca137fc16b9ff07e1
�[2m2025-11-04T12:27:31.649607Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-60bfded7645848788afc5adfa0e6b795
�[2m2025-11-04T12:27:31.649650Z�[0m �[32m INFO�[0m �[2m_core::llm::block_manager::vllm::connector::leader::slot�[0m�[2m:�[0m request set to finish: cached_gpu_tokens: 0; cached_host_tokens: 0; cached_disk_tokens: 0 �[3mrequest_id�[0m�[2m=�[0mchatcmpl-6f2426ac6dd34bddb84cd6dbab6d8f22
�[2m2025-11-04T12:27:31.649688Z�[0m �[32m INFO�[0m
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:27:32 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2318.8 tokens/s, Running: 0 reqs, Waiting: 2 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 56.5%
�[1;36m(APIServer pid=1240)�[0;0m INFO 11-04 12:27:42 [loggers.py:127] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 2 reqs, GPU KV cache usage: 98.8%, Prefix cache hit rate: 71.4%
Screenshots

AniketKul
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working