Skip to content

Conversation

heheda12345
Copy link
Collaborator

@heheda12345 heheda12345 commented Sep 16, 2025

Purpose

Hybrid allocator requires all layer has the same physical memory per block now. But models like #24916 (bf16 for sliding window attention and fp8 for full attention) will have different memory per layer.

This pr supports these cases by giving different layers different block_sizes to make the physical memory per block the same. Require one layer's memory per block is a multiple of the other now.

To support prefix caching, we need:

  1. make sure the prefix caching hit length is a multiple of all block_sizes. This is achieved by introducing an alignment factor to get_longest_cache_hit and set the alignment requirement to the LCM of all block_sizes.
  2. generate the BlockHash of each block size from the BlockHash with block_size=cache_config.block_size.

For 2, we can generate the block hash with a larger block_size from that with a smaller block size. For example, with block hash of block_size 16, we can get the block hash with block_hash 32 by concatenating two hash value with block_size 16 to one hash value with block_size 32:

block_hash with block_size 16:

token 0-15 16-31 32-47 48-63
Hash A B C D

block_hash with block_size 32:

token 0-31 32-63
Hash AB CD

Note: for non-hybrid model with different hidden size per layer like #22432 , we may still keep the block size the same for all layers. I plan to do it in a future PR.

Test Plan

Set the kv_dtype of either sliding window attention or full attention to fp8 and run

test_correctness_sliding_window.py::test_sliding_window_retrieval[False-1-5-google/gemma-3-1b-it] 

And also necessary unit tests.

Test Result

Success

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the v1 label Sep 16, 2025
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
f"num_heads ({num_heads}) is not " \
f"divisible by num_kv_heads ({num_kv_heads})"

# TODO in this PR: only for testing now. remove this hardcode later
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self reminder: remove this

for kv_cache_config in kv_cache_configs:
kv_cache_config.num_blocks = min_num_blocks
# TODO: remove this print
print("kv_cache_configs", kv_cache_configs[0])
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self reminder: remove this

attn_layers = get_layers_from_vllm_config(self.vllm_config, Attention)

# TODO in this PR: revert this
def get_torch_dtype(kv_cache_dtype: str) -> torch.dtype:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self reminder: remove this and do it in a future pr

@heheda12345
Copy link
Collaborator Author

Wait this PR only supports bf16 for full attention and fp8 for sliding window. Trying to fix fp8 for full attention and bf16 for sliding window.

@heheda12345
Copy link
Collaborator Author

Wait this PR only supports bf16 for full attention and fp8 for sliding window. Trying to fix fp8 for full attention and bf16 for sliding window.
Fixed.

Signed-off-by: Chen Zhang <[email protected]>
@heheda12345 heheda12345 changed the title [Hybrid Allocator] Support layers with different hidden size [Hybrid Allocator] Support KV cache groups with different block_size Sep 16, 2025
block_size=32)),
],
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding a test for the mixed dtype case?

    # Different dtype, align by using different block size
    kv_cache_specs_hybrid = {
        'layer_1': new_kv_cache_spec(dtype=torch.float8_e4m3fn),
        'layer_2': new_sliding_window_spec(dtype=torch.bfloat16),
    }
    kv_cache_config_hybrid = get_kv_cache_configs(
        vllm_config, [kv_cache_specs_hybrid],
        [mem_per_block_per_layer * 32])[0]
    assert kv_cache_config_hybrid == KVCacheConfig(
        num_blocks=32 * 2, # 2x blocks because baseline is BF16 (not FP32)
        kv_cache_tensors=[
            KVCacheTensor(size=mem_per_block_per_layer * 32,
                          shared_by=["layer_1", "layer_2"]),
        ],
        kv_cache_groups=[
            KVCacheGroupSpec(["layer_1"], new_kv_cache_spec(dtype=torch.float8_e4m3fn, block_size=32)),
            KVCacheGroupSpec(["layer_2"],
                             new_sliding_window_spec(dtype=torch.bfloat16,
                                                     block_size=16)),
        ],
    )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly could use new_kv_cache_spec, as it is nothing speficif to new_sliding_window_spec I'd say.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding a test for the mixed dtype case?

I think there is no difference on mixed dtype & mixed head size from the view of this PR. Feel free to add tests when you are working on mixed dtype support.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly could use new_kv_cache_spec, as it is nothing speficif to new_sliding_window_spec I'd say.

For models only with full attention, we can have a much simpler path because we don't need to ensure all layers have the same page_size_bytes. I'm working on it in another PR.

Copy link

mergify bot commented Sep 21, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @heheda12345.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Sep 21, 2025
@josephrocca
Copy link

josephrocca commented Oct 2, 2025

Edit: I just remembered that DCP doesn't currently support FP8 KV cache, so it seems likely that that's the issue here?


Unsure if this is expected, since IIUC this PR is not yet finished, but I get a crash during startup at assert hash_block_size == self.kv_cache_spec.block_size when testing this PR with -dcp 4 like so:

git clone --branch two_dtype_kv_cache https://github.com/heheda12345/vllm && cd vllm && git reset --hard aaf8bc9366fa270dc0b5eea81dec3a01206bd6ef
VLLM_USE_PRECOMPILED=1 uv pip install --editable .[flashinfer]
vllm serve RedHatAI/DeepSeek-R1-0528-quantized.w4a16 --tensor-parallel-size 4 -dcp 4 --served-model-name default --max-model-len 9216 --kv-cache-dtype fp8_e4m3

It works fine without -dcp 4. It might be a bug with -dcp rather than this PR.

I've only tested on a 4xH200 machine.

Click here for crash logs
/root/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO 10-02 06:59:56 [__init__.py:216] Automatically detected platform cuda.
(APIServer pid=55382) INFO 10-02 06:59:58 [api_server.py:1911] vLLM API server version 0.1.dev9506+gaaf8bc936
(APIServer pid=55382) INFO 10-02 06:59:58 [utils.py:328] non-default args: {'model_tag': 'RedHatAI/DeepSeek-R1-0528-quantized.w4a16', 'host': '0.0.0.0', 'port': 4000, 'model': 'RedHatAI/DeepSeek-R1-0528-quantized.w4a16', 'max_model_len': 9216, 'served_model_name': ['default'], 'tensor_parallel_size': 4, 'decode_context_parallel_size': 4, 'kv_cache_dtype': 'fp8_e4m3'}
(APIServer pid=55382) INFO 10-02 07:00:05 [__init__.py:706] Resolved architecture: DeepseekV3ForCausalLM
(APIServer pid=55382) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=55382) INFO 10-02 07:00:05 [__init__.py:1782] Using max model len 9216
(APIServer pid=55382) INFO 10-02 07:00:05 [cache.py:174] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=55382) INFO 10-02 07:00:05 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=55382) INFO 10-02 07:00:06 [cuda.py:174] Forcing kv cache block size to 64 for FlashMLA backend.
/root/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO 10-02 07:00:09 [__init__.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=55692) INFO 10-02 07:00:12 [core.py:648] Waiting for init message from front-end.
(EngineCore_DP0 pid=55692) INFO 10-02 07:00:12 [core.py:75] Initializing a V1 LLM engine (v0.1.dev9506+gaaf8bc936) with config: model='RedHatAI/DeepSeek-R1-0528-quantized.w4a16', speculative_config=None, tokenizer='RedHatAI/DeepSeek-R1-0528-quantized.w4a16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=9216, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=default, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_DP0 pid=55692) WARNING 10-02 07:00:12 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 176 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=55692) INFO 10-02 07:00:12 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_50b22551'), local_subscribe_addr='ipc:///tmp/23eee754-53b0-46a3-9d64-b18be4d99e16', remote_subscribe_addr=None, remote_addr_ipv6=False)
/root/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/root/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/root/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
/root/venv/lib/python3.12/site-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
INFO 10-02 07:00:15 [__init__.py:216] Automatically detected platform cuda.
INFO 10-02 07:00:15 [__init__.py:216] Automatically detected platform cuda.
INFO 10-02 07:00:15 [__init__.py:216] Automatically detected platform cuda.
INFO 10-02 07:00:15 [__init__.py:216] Automatically detected platform cuda.
W1002 07:00:17.496000 55841 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W1002 07:00:17.496000 55841 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W1002 07:00:17.496000 55842 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W1002 07:00:17.496000 55842 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W1002 07:00:17.501000 55843 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W1002 07:00:17.501000 55843 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W1002 07:00:17.504000 55844 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W1002 07:00:17.504000 55844 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 10-02 07:00:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_0911e17a'), local_subscribe_addr='ipc:///tmp/26e19166-8cb8-4612-9642-7eeaf8adbad2', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-02 07:00:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_4394977f'), local_subscribe_addr='ipc:///tmp/9aaf23ba-3d42-4e97-a821-5199effb2be4', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-02 07:00:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_8ffeaa41'), local_subscribe_addr='ipc:///tmp/ee4394b3-b8cf-4ede-9ceb-8aa5e152341d', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-02 07:00:19 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_1325f30e'), local_subscribe_addr='ipc:///tmp/4043c4b9-7faf-4c4c-bd7a-cebf9ed78bc8', remote_subscribe_addr=None, remote_addr_ipv6=False)
[W1002 07:00:19.317282572 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1002 07:00:20.559870191 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1002 07:00:20.818073215 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1002 07:00:20.824695277 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 10-02 07:00:20 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 10-02 07:00:20 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-02 07:00:20 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 10-02 07:00:20 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 10-02 07:00:20 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 10-02 07:00:20 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-02 07:00:20 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-02 07:00:20 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-02 07:00:21 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 10-02 07:00:21 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 10-02 07:00:21 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 10-02 07:00:21 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 10-02 07:00:21 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_b4e5bab4'), local_subscribe_addr='ipc:///tmp/da574df2-ff6a-4888-8790-a7924d439f03', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 10-02 07:00:21 [__init__.py:1439] Found nccl from library libnccl.so.2
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 10-02 07:00:21 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-02 07:00:21 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 10-02 07:00:21 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 10-02 07:00:21 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-02 07:00:21 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 10-02 07:00:21 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-02 07:00:21 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-02 07:00:21 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_4082f51b'), local_subscribe_addr='ipc:///tmp/b9fbcccf-f2af-4268-a648-3e198bafaac5', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 10-02 07:00:21 [parallel_state.py:1206] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 10-02 07:00:21 [parallel_state.py:1206] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 10-02 07:00:21 [parallel_state.py:1206] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
INFO 10-02 07:00:21 [parallel_state.py:1206] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 10-02 07:00:21 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
INFO 10-02 07:00:21 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
INFO 10-02 07:00:21 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
INFO 10-02 07:00:21 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
(Worker_TP3 pid=55844) INFO 10-02 07:00:21 [gpu_model_runner.py:2357] Starting to load model RedHatAI/DeepSeek-R1-0528-quantized.w4a16...
(Worker_TP2 pid=55843) INFO 10-02 07:00:21 [gpu_model_runner.py:2357] Starting to load model RedHatAI/DeepSeek-R1-0528-quantized.w4a16...
(Worker_TP1 pid=55842) INFO 10-02 07:00:21 [gpu_model_runner.py:2357] Starting to load model RedHatAI/DeepSeek-R1-0528-quantized.w4a16...
(Worker_TP0 pid=55841) INFO 10-02 07:00:21 [gpu_model_runner.py:2357] Starting to load model RedHatAI/DeepSeek-R1-0528-quantized.w4a16...
(Worker_TP3 pid=55844) INFO 10-02 07:00:22 [gpu_model_runner.py:2389] Loading model from scratch...
(Worker_TP2 pid=55843) INFO 10-02 07:00:22 [gpu_model_runner.py:2389] Loading model from scratch...
(Worker_TP1 pid=55842) INFO 10-02 07:00:22 [gpu_model_runner.py:2389] Loading model from scratch...
(Worker_TP0 pid=55841) INFO 10-02 07:00:22 [gpu_model_runner.py:2389] Loading model from scratch...
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.0.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.0.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.0.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.0.self_attn.attn
(Worker_TP2 pid=55843) INFO 10-02 07:00:22 [cuda.py:258] Using FlashMLA backend on V1 engine.
(Worker_TP3 pid=55844) INFO 10-02 07:00:22 [cuda.py:258] Using FlashMLA backend on V1 engine.
(Worker_TP0 pid=55841) INFO 10-02 07:00:22 [cuda.py:258] Using FlashMLA backend on V1 engine.
(Worker_TP1 pid=55842) INFO 10-02 07:00:22 [cuda.py:258] Using FlashMLA backend on V1 engine.
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.1.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.1.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.1.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.1.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.2.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.2.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.2.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.2.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.3.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.3.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.3.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.3.self_attn.attn
(Worker_TP0 pid=55841) INFO 10-02 07:00:22 [compressed_tensors_moe.py:121] Using CompressedTensorsWNA16MarlinMoEMethod
(Worker_TP1 pid=55842) INFO 10-02 07:00:22 [compressed_tensors_moe.py:121] Using CompressedTensorsWNA16MarlinMoEMethod
(Worker_TP2 pid=55843) INFO 10-02 07:00:22 [compressed_tensors_moe.py:121] Using CompressedTensorsWNA16MarlinMoEMethod
(Worker_TP3 pid=55844) INFO 10-02 07:00:22 [compressed_tensors_moe.py:121] Using CompressedTensorsWNA16MarlinMoEMethod
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.4.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.4.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.4.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.4.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.5.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.5.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.5.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.5.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.6.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.6.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.6.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.6.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.7.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.7.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.7.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.7.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.8.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.8.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.8.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.8.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.9.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.9.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.9.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.9.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.10.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.10.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.10.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.10.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.11.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.11.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.11.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.11.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.12.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.12.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.12.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.12.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.13.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.13.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.13.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.13.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.14.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.14.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.14.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.14.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.15.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.15.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.15.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.15.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.16.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.16.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.16.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.16.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.17.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.17.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.17.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.17.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.18.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.18.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.18.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.18.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.19.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.19.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.19.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.19.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.20.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.20.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.20.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.20.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.21.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.21.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.21.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.21.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.22.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.22.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.22.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.22.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.23.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.23.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.23.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.23.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.24.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.24.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.24.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.24.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.25.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.25.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.25.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.25.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.26.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.26.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.26.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.26.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.27.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.27.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.27.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.27.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.28.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.28.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.29.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.28.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.28.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.29.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.30.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.29.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.29.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.30.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.31.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.30.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.30.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.31.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.32.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.31.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.31.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.32.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.33.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.32.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.32.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.33.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.34.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.33.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.33.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.34.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.35.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.34.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.34.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.35.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.36.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.35.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.35.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.36.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.37.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.36.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.37.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.36.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.38.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.37.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.37.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.38.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.39.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.39.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.38.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.38.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.40.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.39.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.39.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.40.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.41.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.40.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.40.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.41.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.42.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.41.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.41.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.42.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.43.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.42.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.43.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.44.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.42.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.43.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.43.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.44.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.45.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.44.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.44.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.46.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.45.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.45.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.47.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.46.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.45.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.48.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.46.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.46.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.47.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.49.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.48.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.47.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.47.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.50.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.49.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.48.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.48.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.51.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.50.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.49.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.49.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.52.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.51.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.50.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.50.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.53.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.52.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.51.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.51.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.54.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.53.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.52.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.52.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.55.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.54.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.53.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.53.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.56.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.54.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.54.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.55.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.57.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.56.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.55.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.55.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.58.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.56.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.56.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.57.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.59.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.58.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.57.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.57.self_attn.attn
(Worker_TP3 pid=55844) set kv_cache_dtype to fp8_e4m3 for layer model.layers.60.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.59.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.58.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.58.self_attn.attn
(Worker_TP2 pid=55843) set kv_cache_dtype to fp8_e4m3 for layer model.layers.60.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.59.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.59.self_attn.attn
(Worker_TP0 pid=55841) set kv_cache_dtype to fp8_e4m3 for layer model.layers.60.self_attn.attn
(Worker_TP1 pid=55842) set kv_cache_dtype to fp8_e4m3 for layer model.layers.60.self_attn.attn
(Worker_TP3 pid=55844) INFO 10-02 07:00:23 [weight_utils.py:348] Using model weights format ['*.safetensors']
(Worker_TP2 pid=55843) INFO 10-02 07:00:23 [weight_utils.py:348] Using model weights format ['*.safetensors']
(Worker_TP0 pid=55841) INFO 10-02 07:00:23 [weight_utils.py:348] Using model weights format ['*.safetensors']
(Worker_TP1 pid=55842) INFO 10-02 07:00:23 [weight_utils.py:348] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/63 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   2% Completed | 1/63 [00:03<03:14,  3.13s/it]
Loading safetensors checkpoint shards:   3% Completed | 2/63 [00:06<03:18,  3.25s/it]
Loading safetensors checkpoint shards:   5% Completed | 3/63 [00:09<03:17,  3.29s/it]
Loading safetensors checkpoint shards:   6% Completed | 4/63 [00:13<03:15,  3.32s/it]
Loading safetensors checkpoint shards:   8% Completed | 5/63 [00:16<03:13,  3.34s/it]
Loading safetensors checkpoint shards:  10% Completed | 6/63 [00:19<03:12,  3.37s/it]
Loading safetensors checkpoint shards:  11% Completed | 7/63 [00:23<03:09,  3.39s/it]
Loading safetensors checkpoint shards:  13% Completed | 8/63 [00:26<03:07,  3.40s/it]
Loading safetensors checkpoint shards:  14% Completed | 9/63 [00:30<03:04,  3.42s/it]
Loading safetensors checkpoint shards:  16% Completed | 10/63 [00:33<03:00,  3.40s/it]
Loading safetensors checkpoint shards:  17% Completed | 11/63 [00:37<02:57,  3.41s/it]
Loading safetensors checkpoint shards:  19% Completed | 12/63 [00:40<02:55,  3.43s/it]
Loading safetensors checkpoint shards:  21% Completed | 13/63 [00:44<02:52,  3.45s/it]
Loading safetensors checkpoint shards:  22% Completed | 14/63 [00:47<02:49,  3.46s/it]
Loading safetensors checkpoint shards:  24% Completed | 15/63 [00:47<02:00,  2.51s/it]
Loading safetensors checkpoint shards:  25% Completed | 16/63 [00:51<02:09,  2.75s/it]
Loading safetensors checkpoint shards:  27% Completed | 17/63 [00:54<02:16,  2.97s/it]
Loading safetensors checkpoint shards:  29% Completed | 18/63 [00:58<02:20,  3.12s/it]
Loading safetensors checkpoint shards:  30% Completed | 19/63 [01:01<02:22,  3.23s/it]
Loading safetensors checkpoint shards:  32% Completed | 20/63 [01:05<02:22,  3.31s/it]
Loading safetensors checkpoint shards:  33% Completed | 21/63 [01:08<02:20,  3.35s/it]
Loading safetensors checkpoint shards:  35% Completed | 22/63 [01:11<02:18,  3.38s/it]
Loading safetensors checkpoint shards:  37% Completed | 23/63 [01:15<02:16,  3.41s/it]
Loading safetensors checkpoint shards:  38% Completed | 24/63 [01:18<02:14,  3.44s/it]
Loading safetensors checkpoint shards:  40% Completed | 25/63 [01:22<02:11,  3.46s/it]
Loading safetensors checkpoint shards:  41% Completed | 26/63 [01:22<01:32,  2.49s/it]
Loading safetensors checkpoint shards:  43% Completed | 27/63 [01:26<01:38,  2.73s/it]
Loading safetensors checkpoint shards:  44% Completed | 28/63 [01:29<01:43,  2.95s/it]
Loading safetensors checkpoint shards:  46% Completed | 29/63 [01:32<01:45,  3.11s/it]
Loading safetensors checkpoint shards:  48% Completed | 30/63 [01:36<01:46,  3.22s/it]
Loading safetensors checkpoint shards:  49% Completed | 31/63 [01:39<01:45,  3.30s/it]
Loading safetensors checkpoint shards:  51% Completed | 32/63 [01:43<01:43,  3.35s/it]
Loading safetensors checkpoint shards:  52% Completed | 33/63 [01:46<01:41,  3.40s/it]
Loading safetensors checkpoint shards:  54% Completed | 34/63 [01:50<01:39,  3.42s/it]
Loading safetensors checkpoint shards:  56% Completed | 35/63 [01:53<01:36,  3.43s/it]
Loading safetensors checkpoint shards:  57% Completed | 36/63 [01:57<01:33,  3.45s/it]
Loading safetensors checkpoint shards:  59% Completed | 37/63 [02:00<01:29,  3.46s/it]
Loading safetensors checkpoint shards:  60% Completed | 38/63 [02:04<01:26,  3.47s/it]
Loading safetensors checkpoint shards:  62% Completed | 39/63 [02:07<01:23,  3.48s/it]
Loading safetensors checkpoint shards:  63% Completed | 40/63 [02:11<01:20,  3.49s/it]
Loading safetensors checkpoint shards:  65% Completed | 41/63 [02:14<01:16,  3.49s/it]
(Worker_TP3 pid=55844) INFO 10-02 07:02:40 [default_loader.py:268] Loading weights took 136.75 seconds
(Worker_TP3 pid=55844) WARNING 10-02 07:02:40 [kv_cache.py:86] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for the flash-attn backend.
(Worker_TP3 pid=55844) WARNING 10-02 07:02:40 [kv_cache.py:100] Using KV cache scaling factor 1.0 for fp8_e4m3. This may cause accuracy issues. Please make sure k/v_scale scaling factors are available in the fp8 checkpoint.
(Worker_TP2 pid=55843) INFO 10-02 07:02:41 [default_loader.py:268] Loading weights took 137.44 seconds
(Worker_TP2 pid=55843) WARNING 10-02 07:02:41 [kv_cache.py:86] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for the flash-attn backend.
(Worker_TP2 pid=55843) WARNING 10-02 07:02:41 [kv_cache.py:100] Using KV cache scaling factor 1.0 for fp8_e4m3. This may cause accuracy issues. Please make sure k/v_scale scaling factors are available in the fp8 checkpoint.
Loading safetensors checkpoint shards:  67% Completed | 42/63 [02:18<01:13,  3.49s/it]
(Worker_TP3 pid=55844) INFO 10-02 07:02:42 [gpu_model_runner.py:2411] Model loading took 87.9393 GiB and 140.081488 seconds
(Worker_TP2 pid=55843) INFO 10-02 07:02:43 [gpu_model_runner.py:2411] Model loading took 87.9393 GiB and 141.136247 seconds
Loading safetensors checkpoint shards:  68% Completed | 43/63 [02:21<01:09,  3.49s/it]
Loading safetensors checkpoint shards:  70% Completed | 44/63 [02:25<01:06,  3.48s/it]
Loading safetensors checkpoint shards:  71% Completed | 45/63 [02:28<01:02,  3.47s/it]
Loading safetensors checkpoint shards:  73% Completed | 46/63 [02:32<00:59,  3.48s/it]
Loading safetensors checkpoint shards:  75% Completed | 47/63 [02:35<00:55,  3.49s/it]
Loading safetensors checkpoint shards:  76% Completed | 48/63 [02:35<00:37,  2.52s/it]
Loading safetensors checkpoint shards:  78% Completed | 49/63 [02:39<00:38,  2.75s/it]
Loading safetensors checkpoint shards:  79% Completed | 50/63 [02:42<00:38,  2.98s/it]
Loading safetensors checkpoint shards:  81% Completed | 51/63 [02:46<00:37,  3.14s/it]
Loading safetensors checkpoint shards:  83% Completed | 52/63 [02:49<00:35,  3.24s/it]
Loading safetensors checkpoint shards:  84% Completed | 53/63 [02:53<00:33,  3.31s/it]
Loading safetensors checkpoint shards:  86% Completed | 54/63 [02:56<00:30,  3.36s/it]
Loading safetensors checkpoint shards:  87% Completed | 55/63 [03:00<00:27,  3.40s/it]
Loading safetensors checkpoint shards:  89% Completed | 56/63 [03:00<00:17,  2.48s/it]
Loading safetensors checkpoint shards:  90% Completed | 57/63 [03:03<00:16,  2.72s/it]
Loading safetensors checkpoint shards:  92% Completed | 58/63 [03:07<00:14,  2.95s/it]
Loading safetensors checkpoint shards:  94% Completed | 59/63 [03:07<00:08,  2.17s/it]
Loading safetensors checkpoint shards:  95% Completed | 60/63 [03:10<00:07,  2.50s/it]
Loading safetensors checkpoint shards:  97% Completed | 61/63 [03:14<00:05,  2.79s/it]
Loading safetensors checkpoint shards:  98% Completed | 62/63 [03:17<00:03,  3.01s/it]
Loading safetensors checkpoint shards: 100% Completed | 63/63 [03:21<00:00,  3.16s/it]
Loading safetensors checkpoint shards: 100% Completed | 63/63 [03:21<00:00,  3.20s/it]
(Worker_TP0 pid=55841) 
(Worker_TP0 pid=55841) INFO 10-02 07:03:45 [default_loader.py:268] Loading weights took 201.63 seconds
(Worker_TP0 pid=55841) WARNING 10-02 07:03:45 [kv_cache.py:86] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for the flash-attn backend.
(Worker_TP0 pid=55841) WARNING 10-02 07:03:45 [kv_cache.py:100] Using KV cache scaling factor 1.0 for fp8_e4m3. This may cause accuracy issues. Please make sure k/v_scale scaling factors are available in the fp8 checkpoint.
(Worker_TP1 pid=55842) INFO 10-02 07:03:48 [default_loader.py:268] Loading weights took 203.87 seconds
(Worker_TP1 pid=55842) WARNING 10-02 07:03:48 [kv_cache.py:86] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for the flash-attn backend.
(Worker_TP1 pid=55842) WARNING 10-02 07:03:48 [kv_cache.py:100] Using KV cache scaling factor 1.0 for fp8_e4m3. This may cause accuracy issues. Please make sure k/v_scale scaling factors are available in the fp8 checkpoint.
(Worker_TP0 pid=55841) INFO 10-02 07:03:49 [gpu_model_runner.py:2411] Model loading took 87.9393 GiB and 206.858939 seconds
(Worker_TP1 pid=55842) INFO 10-02 07:03:51 [gpu_model_runner.py:2411] Model loading took 87.9393 GiB and 208.650490 seconds
(Worker_TP3 pid=55844) INFO 10-02 07:03:59 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/89087ccd47/rank_3_0/backbone for vLLM's torch.compile
(Worker_TP3 pid=55844) INFO 10-02 07:03:59 [backends.py:550] Dynamo bytecode transform time: 7.79 s
(Worker_TP2 pid=55843) INFO 10-02 07:03:59 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/89087ccd47/rank_2_0/backbone for vLLM's torch.compile
(Worker_TP2 pid=55843) INFO 10-02 07:03:59 [backends.py:550] Dynamo bytecode transform time: 7.91 s
(Worker_TP3 pid=55844) INFO 10-02 07:04:02 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 3.172 s
(Worker_TP2 pid=55843) INFO 10-02 07:04:03 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 3.205 s
(Worker_TP0 pid=55841) INFO 10-02 07:04:03 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/89087ccd47/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=55841) INFO 10-02 07:04:03 [backends.py:550] Dynamo bytecode transform time: 11.51 s
(Worker_TP1 pid=55842) INFO 10-02 07:04:03 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/89087ccd47/rank_1_0/backbone for vLLM's torch.compile
(Worker_TP1 pid=55842) INFO 10-02 07:04:03 [backends.py:550] Dynamo bytecode transform time: 11.58 s
(Worker_TP0 pid=55841) INFO 10-02 07:04:08 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 4.928 s
(Worker_TP1 pid=55842) INFO 10-02 07:04:08 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 4.936 s
(Worker_TP2 pid=55843) INFO 10-02 07:04:11 [monitor.py:34] torch.compile takes 7.91 s in total
(Worker_TP3 pid=55844) INFO 10-02 07:04:11 [monitor.py:34] torch.compile takes 7.79 s in total
(Worker_TP1 pid=55842) INFO 10-02 07:04:11 [monitor.py:34] torch.compile takes 11.58 s in total
(Worker_TP0 pid=55841) INFO 10-02 07:04:11 [monitor.py:34] torch.compile takes 11.51 s in total
(Worker_TP3 pid=55844) INFO 10-02 07:04:13 [gpu_worker.py:299] Available KV cache memory: 33.99 GiB
(Worker_TP2 pid=55843) INFO 10-02 07:04:13 [gpu_worker.py:299] Available KV cache memory: 33.99 GiB
(Worker_TP0 pid=55841) INFO 10-02 07:04:13 [gpu_worker.py:299] Available KV cache memory: 33.99 GiB
(Worker_TP1 pid=55842) INFO 10-02 07:04:13 [gpu_worker.py:299] Available KV cache memory: 33.99 GiB
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1048] Multiplying the GPU KV cache size by the dcp_world_size 4.
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1052] GPU KV cache size: 4,154,112 tokens
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1056] Maximum concurrency for 9,216 tokens per request: 450.75x
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1048] Multiplying the GPU KV cache size by the dcp_world_size 4.
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1052] GPU KV cache size: 4,154,112 tokens
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1056] Maximum concurrency for 9,216 tokens per request: 450.75x
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1048] Multiplying the GPU KV cache size by the dcp_world_size 4.
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1052] GPU KV cache size: 4,154,112 tokens
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1056] Maximum concurrency for 9,216 tokens per request: 450.75x
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1048] Multiplying the GPU KV cache size by the dcp_world_size 4.
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1052] GPU KV cache size: 4,154,112 tokens
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:14 [kv_cache_utils.py:1056] Maximum concurrency for 9,216 tokens per request: 450.75x
(EngineCore_DP0 pid=55692) kv_cache_configs KVCacheConfig(num_blocks=16227, kv_cache_tensors=[KVCacheTensor(size=598192128, shared_by=['model.layers.0.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.1.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.2.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.3.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.4.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.5.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.6.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.7.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.8.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.9.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.10.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.11.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.12.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.13.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.14.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.15.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.16.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.17.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.18.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.19.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.20.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.21.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.22.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.23.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.24.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.25.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.26.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.27.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.28.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.29.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.30.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.31.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.32.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.33.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.34.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.35.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.36.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.37.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.38.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.39.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.40.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.41.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.42.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.43.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.44.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.45.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.46.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.47.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.48.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.49.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.50.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.51.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.52.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.53.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.54.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.55.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.56.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.57.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.58.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.59.self_attn.attn']), KVCacheTensor(size=598192128, shared_by=['model.layers.60.self_attn.attn'])], kv_cache_groups=[KVCacheGroupSpec(layer_names=['model.layers.0.self_attn.attn', 'model.layers.1.self_attn.attn', 'model.layers.2.self_attn.attn', 'model.layers.3.self_attn.attn', 'model.layers.4.self_attn.attn', 'model.layers.5.self_attn.attn', 'model.layers.6.self_attn.attn', 'model.layers.7.self_attn.attn', 'model.layers.8.self_attn.attn', 'model.layers.9.self_attn.attn', 'model.layers.10.self_attn.attn', 'model.layers.11.self_attn.attn', 'model.layers.12.self_attn.attn', 'model.layers.13.self_attn.attn', 'model.layers.14.self_attn.attn', 'model.layers.15.self_attn.attn', 'model.layers.16.self_attn.attn', 'model.layers.17.self_attn.attn', 'model.layers.18.self_attn.attn', 'model.layers.19.self_attn.attn', 'model.layers.20.self_attn.attn', 'model.layers.21.self_attn.attn', 'model.layers.22.self_attn.attn', 'model.layers.23.self_attn.attn', 'model.layers.24.self_attn.attn', 'model.layers.25.self_attn.attn', 'model.layers.26.self_attn.attn', 'model.layers.27.self_attn.attn', 'model.layers.28.self_attn.attn', 'model.layers.29.self_attn.attn', 'model.layers.30.self_attn.attn', 'model.layers.31.self_attn.attn', 'model.layers.32.self_attn.attn', 'model.layers.33.self_attn.attn', 'model.layers.34.self_attn.attn', 'model.layers.35.self_attn.attn', 'model.layers.36.self_attn.attn', 'model.layers.37.self_attn.attn', 'model.layers.38.self_attn.attn', 'model.layers.39.self_attn.attn', 'model.layers.40.self_attn.attn', 'model.layers.41.self_attn.attn', 'model.layers.42.self_attn.attn', 'model.layers.43.self_attn.attn', 'model.layers.44.self_attn.attn', 'model.layers.45.self_attn.attn', 'model.layers.46.self_attn.attn', 'model.layers.47.self_attn.attn', 'model.layers.48.self_attn.attn', 'model.layers.49.self_attn.attn', 'model.layers.50.self_attn.attn', 'model.layers.51.self_attn.attn', 'model.layers.52.self_attn.attn', 'model.layers.53.self_attn.attn', 'model.layers.54.self_attn.attn', 'model.layers.55.self_attn.attn', 'model.layers.56.self_attn.attn', 'model.layers.57.self_attn.attn', 'model.layers.58.self_attn.attn', 'model.layers.59.self_attn.attn', 'model.layers.60.self_attn.attn'], kv_cache_spec=FullAttentionSpec(block_size=64, num_kv_heads=1, head_size=576, dtype=torch.uint8, use_mla=True, sliding_window=None, attention_chunk_size=None))])
(Worker_TP3 pid=55844) 2025-10-02 07:04:14,399 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP1 pid=55842) 2025-10-02 07:04:14,399 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP2 pid=55843) 2025-10-02 07:04:14,399 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP0 pid=55841) 2025-10-02 07:04:14,399 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP3 pid=55844) 2025-10-02 07:04:14,958 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker_TP2 pid=55843) 2025-10-02 07:04:14,958 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker_TP0 pid=55841) 2025-10-02 07:04:14,958 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker_TP1 pid=55842) 2025-10-02 07:04:14,958 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  64%|████████████████████████████████████████████████▊                           | 43/67 [00:09<00:05,  4.51it/s](Worker_TP3 pid=55844) INFO 10-02 07:04:24 [custom_all_reduce.py:203] Registering 8241 cuda graph addresses
(Worker_TP2 pid=55843) INFO 10-02 07:04:24 [custom_all_reduce.py:203] Registering 8241 cuda graph addresses
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████| 67/67 [00:15<00:00,  4.43it/s]
(Worker_TP0 pid=55841) INFO 10-02 07:04:30 [custom_all_reduce.py:203] Registering 8241 cuda graph addresses
(Worker_TP1 pid=55842) INFO 10-02 07:04:30 [custom_all_reduce.py:203] Registering 8241 cuda graph addresses
(Worker_TP3 pid=55844) INFO 10-02 07:04:31 [gpu_model_runner.py:3137] Graph capturing finished in 16 secs, took -0.68 GiB
(Worker_TP3 pid=55844) INFO 10-02 07:04:31 [gpu_worker.py:392] Free memory on device (139.19/139.8 GiB) on startup. Desired GPU memory utilization is (0.9, 125.82 GiB). Actual usage is 87.94 GiB for weight, 1.9 GiB for peak activation, 2.0 GiB for non-torch memory, and -0.68 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=37059928780` to fit into requested memory, or `--kv-cache-memory=51419895296` to fully utilize gpu memory. Current kv cache memory in use is 36491600588 bytes.
(Worker_TP2 pid=55843) INFO 10-02 07:04:31 [gpu_model_runner.py:3137] Graph capturing finished in 16 secs, took -0.68 GiB
(Worker_TP2 pid=55843) INFO 10-02 07:04:31 [gpu_worker.py:392] Free memory on device (139.19/139.8 GiB) on startup. Desired GPU memory utilization is (0.9, 125.82 GiB). Actual usage is 87.94 GiB for weight, 1.9 GiB for peak activation, 2.0 GiB for non-torch memory, and -0.68 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=37059928780` to fit into requested memory, or `--kv-cache-memory=51419895296` to fully utilize gpu memory. Current kv cache memory in use is 36491600588 bytes.
(Worker_TP1 pid=55842) INFO 10-02 07:04:31 [gpu_model_runner.py:3137] Graph capturing finished in 16 secs, took -0.68 GiB
(Worker_TP1 pid=55842) INFO 10-02 07:04:31 [gpu_worker.py:392] Free memory on device (139.19/139.8 GiB) on startup. Desired GPU memory utilization is (0.9, 125.82 GiB). Actual usage is 87.94 GiB for weight, 1.9 GiB for peak activation, 2.0 GiB for non-torch memory, and -0.68 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=37059928780` to fit into requested memory, or `--kv-cache-memory=51419895296` to fully utilize gpu memory. Current kv cache memory in use is 36491600588 bytes.
(Worker_TP0 pid=55841) INFO 10-02 07:04:31 [gpu_model_runner.py:3137] Graph capturing finished in 16 secs, took -0.68 GiB
(Worker_TP0 pid=55841) INFO 10-02 07:04:31 [gpu_worker.py:392] Free memory on device (139.19/139.8 GiB) on startup. Desired GPU memory utilization is (0.9, 125.82 GiB). Actual usage is 87.94 GiB for weight, 1.9 GiB for peak activation, 2.0 GiB for non-torch memory, and -0.68 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=37059928780` to fit into requested memory, or `--kv-cache-memory=51419895296` to fully utilize gpu memory. Current kv cache memory in use is 36491600588 bytes.
(EngineCore_DP0 pid=55692) INFO 10-02 07:04:31 [core.py:214] init engine (profile, create kv cache, warmup model) took 40.02 seconds
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712] EngineCore failed to start.
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712] Traceback (most recent call last):
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]   File "/root/vllm/vllm/v1/engine/core.py", line 703, in run_engine_core
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]   File "/root/vllm/vllm/v1/engine/core.py", line 502, in __init__
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]   File "/root/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]     self.scheduler: SchedulerInterface = Scheduler(
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]                                          ^^^^^^^^^^
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]   File "/root/vllm/vllm/v1/core/sched/scheduler.py", line 166, in __init__
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]     self.kv_cache_manager = KVCacheManager(
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]                             ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]   File "/root/vllm/vllm/v1/core/kv_cache_manager.py", line 105, in __init__
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]     self.coordinator = get_kv_cache_coordinator(
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]                        ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]   File "/root/vllm/vllm/v1/core/kv_cache_coordinator.py", line 455, in get_kv_cache_coordinator
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]     return UnitaryKVCacheCoordinator(kv_cache_config, max_model_len,
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]   File "/root/vllm/vllm/v1/core/kv_cache_coordinator.py", line 251, in __init__
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]     assert hash_block_size == self.kv_cache_spec.block_size
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:32 [core.py:712] AssertionError
(EngineCore_DP0 pid=55692) ERROR 10-02 07:04:34 [multiproc_executor.py:154] Worker proc VllmWorker-3 died unexpectedly, shutting down executor.
(EngineCore_DP0 pid=55692) Process EngineCore_DP0:
(EngineCore_DP0 pid=55692) Traceback (most recent call last):
(EngineCore_DP0 pid=55692)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=55692)     self.run()
(EngineCore_DP0 pid=55692)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=55692)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=55692)   File "/root/vllm/vllm/v1/engine/core.py", line 716, in run_engine_core
(EngineCore_DP0 pid=55692)     raise e
(EngineCore_DP0 pid=55692)   File "/root/vllm/vllm/v1/engine/core.py", line 703, in run_engine_core
(EngineCore_DP0 pid=55692)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=55692)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692)   File "/root/vllm/vllm/v1/engine/core.py", line 502, in __init__
(EngineCore_DP0 pid=55692)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=55692)   File "/root/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore_DP0 pid=55692)     self.scheduler: SchedulerInterface = Scheduler(
(EngineCore_DP0 pid=55692)                                          ^^^^^^^^^^
(EngineCore_DP0 pid=55692)   File "/root/vllm/vllm/v1/core/sched/scheduler.py", line 166, in __init__
(EngineCore_DP0 pid=55692)     self.kv_cache_manager = KVCacheManager(
(EngineCore_DP0 pid=55692)                             ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692)   File "/root/vllm/vllm/v1/core/kv_cache_manager.py", line 105, in __init__
(EngineCore_DP0 pid=55692)     self.coordinator = get_kv_cache_coordinator(
(EngineCore_DP0 pid=55692)                        ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692)   File "/root/vllm/vllm/v1/core/kv_cache_coordinator.py", line 455, in get_kv_cache_coordinator
(EngineCore_DP0 pid=55692)     return UnitaryKVCacheCoordinator(kv_cache_config, max_model_len,
(EngineCore_DP0 pid=55692)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692)   File "/root/vllm/vllm/v1/core/kv_cache_coordinator.py", line 251, in __init__
(EngineCore_DP0 pid=55692)     assert hash_block_size == self.kv_cache_spec.block_size
(EngineCore_DP0 pid=55692)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=55692) AssertionError
(APIServer pid=55382) Traceback (most recent call last):
(APIServer pid=55382)   File "/root/venv/bin/vllm", line 10, in <module>
(APIServer pid=55382)     sys.exit(main())
(APIServer pid=55382)              ^^^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=55382)     args.dispatch_function(args)
(APIServer pid=55382)   File "/root/vllm/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=55382)     uvloop.run(run_server(args))
(APIServer pid=55382)   File "/root/venv/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
(APIServer pid=55382)     return __asyncio.run(
(APIServer pid=55382)            ^^^^^^^^^^^^^^
(APIServer pid=55382)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=55382)     return runner.run(main)
(APIServer pid=55382)            ^^^^^^^^^^^^^^^^
(APIServer pid=55382)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=55382)     return self._loop.run_until_complete(task)
(APIServer pid=55382)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=55382)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=55382)   File "/root/venv/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=55382)     return await main
(APIServer pid=55382)            ^^^^^^^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 1956, in run_server
(APIServer pid=55382)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=55382)   File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 1976, in run_server_worker
(APIServer pid=55382)     async with build_async_engine_client(
(APIServer pid=55382)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=55382)     return await anext(self.gen)
(APIServer pid=55382)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 180, in build_async_engine_client
(APIServer pid=55382)     async with build_async_engine_client_from_engine_args(
(APIServer pid=55382)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=55382)     return await anext(self.gen)
(APIServer pid=55382)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 222, in build_async_engine_client_from_engine_args
(APIServer pid=55382)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=55382)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/utils/__init__.py", line 1595, in inner
(APIServer pid=55382)     return fn(*args, **kwargs)
(APIServer pid=55382)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/v1/engine/async_llm.py", line 209, in from_vllm_config
(APIServer pid=55382)     return cls(
(APIServer pid=55382)            ^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/v1/engine/async_llm.py", line 136, in __init__
(APIServer pid=55382)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=55382)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=55382)     return AsyncMPClient(*client_args)
(APIServer pid=55382)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=55382)   File "/root/vllm/vllm/v1/engine/core_client.py", line 769, in __init__
(APIServer pid=55382)     super().__init__(
(APIServer pid=55382)   File "/root/vllm/vllm/v1/engine/core_client.py", line 448, in __init__
(APIServer pid=55382)     with launch_core_engines(vllm_config, executor_class,
(APIServer pid=55382)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=55382)     next(self.gen)
(APIServer pid=55382)   File "/root/vllm/vllm/v1/engine/utils.py", line 729, in launch_core_engines
(APIServer pid=55382)     wait_for_engine_startup(
(APIServer pid=55382)   File "/root/vllm/vllm/v1/engine/utils.py", line 782, in wait_for_engine_startup
(APIServer pid=55382)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=55382) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Notes:

  • I noticed that in those logs it says the KV cache is 2x the size compared to when running without this PR. Unsure if that's a relevant/expected detail. I (perhaps naively) didn't expect this PR to increase the KV cache capacity.

@heheda12345
Copy link
Collaborator Author

Thanks for catching! I didn't try DCP yet. But why do you need this PR for deepseek-r1?

@josephrocca
Copy link

Hi @heheda12345, thanks for your comment. I was actually just testing this PR in case it solves a weird bug with DCP inference, seemingly related to incorrect KV cache storage/retrieval, which causes some requests to use the wrong KV data during inference when prefix caching is enabled.

From your question, it sounds like this PR is not related to DeepSeek R1/V3. I'm too inexperienced with this stuff to realise that ^^'

@heheda12345 heheda12345 requested a review from ApostaC as a code owner October 4, 2025 05:21
@heheda12345
Copy link
Collaborator Author

Will rebase it next week to avoid the conflict with #25101

@mergify mergify bot removed the needs-rebase label Oct 6, 2025
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
Signed-off-by: Chen Zhang <[email protected]>
@heheda12345
Copy link
Collaborator Author

Will handle the DCP-related crash after #26296

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants