-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Description
Your current environment
I'm using the vLLM Production Stack Helm chart (0.1.7)
Helm values
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "dotsocr"
repository: "vllm/vllm-openai"
tag: "nightly"
modelURL: "rednote-hilab/dots.ocr"
replicaCount: 1
requestCPU: 10
requestMemory: "20Gi"
requestGPU: 1
vllmConfig:
v1: 1
tensorParallelSize: 1
pipelineParallelSize: 2
maxModelLen: 4096
dtype: "bfloat16"
extraArgs:
- "--trust-remote-code"
- "--served-model-name=dotsocr"
shmSize: "20Gi"
raySpec:
headNode:
requestCPU: 15
requestMemory: "20Gi"
requestGPU: 1π Describe the bug
I'm running into an issue specific to the model rednote-hilab/dots.ocr when enabling distributed execution with:
- tensor_parallel_size = 1
- pipeline_parallel_size = 2 (or higher)
The exact same setup works with other models (Qwen, DeepSeek-OCR,...).
Only dots.ocr crashes during engine initialization.
I tried "latest" and "nightly" tags but same result.
Note:
If I simply change modelURL to another model, the deployment starts successfully.
Single-Node Multi-GPU (tensor_parallel_size >=2 and pipeline_parallel_size=1 ) works fine with dots.ocr.
Error logs
Below is the raw error log (trimmed for readability).
The crash consistently happens inside:
gpu_model_runner.py β sync_and_slice_intermediate_tensors β assert self.intermediate_tensors is not None
Full stack:
[2025-11-13 06:17:09,402] Ray cluster status: 2/2 nodes alive.
[2025-11-13 06:17:09,402] Cluster is ready.
Ray is ready. Starting vLLM...
Executing: vllm serve rednote-hilab/dots.ocr --host 0.0.0.0 --port 8000 --distributed-executor-backend ray --max-model-len 4096 --dtype bfloat16 --tensor-parallel-size 1 --pipeline-parallel-size 2 --trust-remote-code --served-model-name=dotsocr
(APIServer pid=54) INFO 11-13 06:17:20 [api_server.py:1897] vLLM API server version 0.11.1rc7.dev109+gca00b1bfc
(APIServer pid=54) INFO 11-13 06:17:20 [utils.py:253] non-default args: {'model_tag': 'rednote-hilab/dots.ocr', 'host': '0.0.0.0', 'model': 'rednote-hilab/dots.ocr', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 4096, 'served_model_name': ['dotsocr'], 'distributed_executor_backend': 'ray', 'pipeline_parallel_size': 2}
(APIServer pid=54) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=54) A new version of the following files was downloaded from https://huggingface.co/rednote-hilab/dots.ocr:
(APIServer pid=54) - configuration_dots.py
(APIServer pid=54) . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
(APIServer pid=54) INFO 11-13 06:17:27 [model.py:631] Resolved architecture: DotsOCRForCausalLM
(APIServer pid=54) INFO 11-13 06:17:27 [model.py:1736] Using max model len 4096
(APIServer pid=54) INFO 11-13 06:17:27 [scheduler.py:254] Chunked prefill is enabled with max_num_batched_tokens=2048.
(EngineCore_DP0 pid=1769) INFO 11-13 06:17:35 [core.py:94] Initializing a V1 LLM engine (v0.11.1rc7.dev109+gca00b1bfc) with config: model='rednote-hilab/dots.ocr', speculative_config=None, tokenizer='rednote-hilab/dots.ocr', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=2, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=dotsocr, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': True, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'local_cache_dir': None}
(EngineCore_DP0 pid=1769) WARNING 11-13 06:17:35 [ray_utils.py:331] Tensor parallel size (2) exceeds available GPUs (1). This may result in Ray placement group allocation failures. Consider reducing tensor_parallel_size to 1 or less, or ensure your Ray cluster has 2 GPUs available.
(EngineCore_DP0 pid=1769) 2025-11-13 06:17:35,548 INFO worker.py:1691 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
(EngineCore_DP0 pid=1769) 2025-11-13 06:17:35,550 INFO worker.py:1832 -- Connecting to existing Ray cluster at address: 10.2.4.18:6379...
(EngineCore_DP0 pid=1769) 2025-11-13 06:17:35,564 INFO worker.py:2012 -- Connected to Ray cluster.
(EngineCore_DP0 pid=1769) /usr/local/lib/python3.12/dist-packages/ray/_private/worker.py:2051: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
(EngineCore_DP0 pid=1769) INFO 11-13 06:17:35 [ray_utils.py:396] No current placement group found. Creating a new placement group.
(EngineCore_DP0 pid=1769) warnings.warn(
(EngineCore_DP0 pid=1769) (raylet, ip=10.2.1.15) [2025-11-13 06:17:36,375 E 85 85] (raylet) main.cc:975: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(EngineCore_DP0 pid=1769) INFO 11-13 06:17:44 [ray_env.py:66] RAY_NON_CARRY_OVER_ENV_VARS from config: set()
(EngineCore_DP0 pid=1769) INFO 11-13 06:17:44 [ray_env.py:69] Copying the following environment variables to workers: ['VLLM_WORKER_MULTIPROC_METHOD', 'HF_TOKEN', 'VLLM_USAGE_SOURCE', 'VLLM_API_KEY', 'LD_LIBRARY_PATH', 'CUDA_HOME']
(EngineCore_DP0 pid=1769) INFO 11-13 06:17:44 [ray_env.py:74] If certain env vars should NOT be copied, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json file
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
[2025-11-13 06:18:05,592 E 1769 1855] core_worker_process.cc:825: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=274, ip=10.2.1.15) [2025-11-13 06:18:06,107 E 274 309] core_worker_process.cc:825: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=274, ip=10.2.1.15) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405)
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405)
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 2.08it/s]
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405)
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 2.51it/s]
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405)
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 2.43it/s]
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405)
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) WARNING 11-13 06:17:45 [worker_base.py:301] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) INFO 11-13 06:17:46 [pynccl.py:111] vLLM is using nccl==2.27.5
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) INFO 11-13 06:17:46 [parallel_state.py:1325] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=1769) Process EngineCore_DP0:
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) INFO 11-13 06:17:53 [gpu_model_runner.py:3047] Starting to load model rednote-hilab/dots.ocr...
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=274, ip=10.2.1.15) WARNING 11-13 06:17:45 [worker_base.py:301] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=274, ip=10.2.1.15) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [repeated 8x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=274, ip=10.2.1.15) INFO 11-13 06:17:46 [parallel_state.py:1325] rank 1 in world size 2 is assigned as DP rank 0, PP rank 1, TP rank 0, EP rank 0
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=274, ip=10.2.1.15) INFO 11-13 06:18:13 [cuda.py:408] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=274, ip=10.2.1.15) INFO 11-13 06:18:13 [cuda.py:417] Using FLASH_ATTN backend.
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) INFO 11-13 06:21:04 [weight_utils.py:441] Time spent downloading weights for rednote-hilab/dots.ocr: 170.155065 seconds
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) INFO 11-13 06:18:14 [cuda.py:408] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) INFO 11-13 06:18:14 [cuda.py:417] Using FLASH_ATTN backend.
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) INFO 11-13 06:21:05 [default_loader.py:314] Loading weights took 0.88 seconds
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) INFO 11-13 06:21:06 [gpu_model_runner.py:3126] Model loading took 4.0605 GiB memory and 192.645870 seconds
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=274, ip=10.2.1.15) INFO 11-13 06:21:07 [gpu_model_runner.py:3876] Encoder cache will be initialized with a budget of 14400 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] EngineCore failed to start.
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] Traceback (most recent call last):
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 846, in run_engine_core
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 619, in __init__
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] super().__init__(
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 110, in __init__
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 228, in _initialize_kv_caches
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 126, in determine_available_memory
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 474, in collective_rpc
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] return ray.get(ray_worker_outputs, timeout=timeout)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] return fn(*args, **kwargs)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] return func(*args, **kwargs)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2961, in get
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] values, debugger_breakpoint = worker.get_objects(
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1026, in get_objects
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] raise value.as_instanceof_cause()
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ray.exceptions.RayTaskError(AssertionError): ray::RayWorkerWrapper.execute_method() (pid=274, ip=10.2.1.15, actor_id=61c5edededf45d7a32260d8609000000, repr=<vllm.v1.executor.ray_utils.RayWorkerWrapper object at 0x7f348a83be90>)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 343, in execute_method
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] raise e
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 332, in execute_method
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] return run_method(self, method, args, kwargs)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] return func(*args, **kwargs)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] return func(*args, **kwargs)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 318, in determine_available_memory
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] self.model_runner.profile_run()
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3923, in profile_run
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] hidden_states, last_hidden_states = self._dummy_run(
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] return func(*args, **kwargs)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3594, in _dummy_run
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] intermediate_tensors = self.sync_and_slice_intermediate_tensors(
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2072, in sync_and_slice_intermediate_tensors
(EngineCore_DP0 pid=1769) Traceback (most recent call last):
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] assert self.intermediate_tensors is not None
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] AssertionError
The engine then aborts with:
RuntimeError: Engine core initialization failed. Failed core proc(s): {}
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.