Skip to content

[Bug]: dots.ocr multi-node deployment failsΒ #28658

@htagourti

Description

@htagourti

Your current environment

I'm using the vLLM Production Stack Helm chart (0.1.7)


Helm values

servingEngineSpec:
  runtimeClassName: ""
  modelSpec:
    - name: "dotsocr"
      repository: "vllm/vllm-openai"
      tag: "nightly"
      modelURL: "rednote-hilab/dots.ocr"

      replicaCount: 1

      requestCPU: 10
      requestMemory: "20Gi"
      requestGPU: 1

      vllmConfig:
        v1: 1
        tensorParallelSize: 1
        pipelineParallelSize: 2
        maxModelLen: 4096
        dtype: "bfloat16"
        extraArgs:
          - "--trust-remote-code"
          - "--served-model-name=dotsocr"

      shmSize: "20Gi"

      raySpec:
        headNode:
          requestCPU: 15
          requestMemory: "20Gi"
          requestGPU: 1

πŸ› Describe the bug

I'm running into an issue specific to the model rednote-hilab/dots.ocr when enabling distributed execution with:

  • tensor_parallel_size = 1
  • pipeline_parallel_size = 2 (or higher)

The exact same setup works with other models (Qwen, DeepSeek-OCR,...).
Only dots.ocr crashes during engine initialization.
I tried "latest" and "nightly" tags but same result.

Note:
If I simply change modelURL to another model, the deployment starts successfully.
Single-Node Multi-GPU (tensor_parallel_size >=2 and pipeline_parallel_size=1 ) works fine with dots.ocr.


Error logs

Below is the raw error log (trimmed for readability).
The crash consistently happens inside:

gpu_model_runner.py β†’ sync_and_slice_intermediate_tensors β†’ assert self.intermediate_tensors is not None

Full stack:

[2025-11-13 06:17:09,402] Ray cluster status: 2/2 nodes alive.
[2025-11-13 06:17:09,402] Cluster is ready.
Ray is ready. Starting vLLM...
Executing: vllm serve rednote-hilab/dots.ocr --host 0.0.0.0 --port 8000 --distributed-executor-backend ray --max-model-len 4096 --dtype bfloat16 --tensor-parallel-size 1 --pipeline-parallel-size 2 --trust-remote-code --served-model-name=dotsocr
(APIServer pid=54) INFO 11-13 06:17:20 [api_server.py:1897] vLLM API server version 0.11.1rc7.dev109+gca00b1bfc
(APIServer pid=54) INFO 11-13 06:17:20 [utils.py:253] non-default args: {'model_tag': 'rednote-hilab/dots.ocr', 'host': '0.0.0.0', 'model': 'rednote-hilab/dots.ocr', 'trust_remote_code': True, 'dtype': 'bfloat16', 'max_model_len': 4096, 'served_model_name': ['dotsocr'], 'distributed_executor_backend': 'ray', 'pipeline_parallel_size': 2}
(APIServer pid=54) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=54) A new version of the following files was downloaded from https://huggingface.co/rednote-hilab/dots.ocr:
(APIServer pid=54) - configuration_dots.py
(APIServer pid=54) . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
(APIServer pid=54) INFO 11-13 06:17:27 [model.py:631] Resolved architecture: DotsOCRForCausalLM
(APIServer pid=54) INFO 11-13 06:17:27 [model.py:1736] Using max model len 4096
(APIServer pid=54) INFO 11-13 06:17:27 [scheduler.py:254] Chunked prefill is enabled with max_num_batched_tokens=2048.
(EngineCore_DP0 pid=1769) INFO 11-13 06:17:35 [core.py:94] Initializing a V1 LLM engine (v0.11.1rc7.dev109+gca00b1bfc) with config: model='rednote-hilab/dots.ocr', speculative_config=None, tokenizer='rednote-hilab/dots.ocr', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=2, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=dotsocr, enable_prefix_caching=True, chunked_prefill_enabled=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': True, 'use_inductor': None, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {}, 'max_cudagraph_capture_size': 512, 'local_cache_dir': None}
(EngineCore_DP0 pid=1769) WARNING 11-13 06:17:35 [ray_utils.py:331] Tensor parallel size (2) exceeds available GPUs (1). This may result in Ray placement group allocation failures. Consider reducing tensor_parallel_size to 1 or less, or ensure your Ray cluster has 2 GPUs available.
(EngineCore_DP0 pid=1769) 2025-11-13 06:17:35,548	INFO worker.py:1691 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS
(EngineCore_DP0 pid=1769) 2025-11-13 06:17:35,550	INFO worker.py:1832 -- Connecting to existing Ray cluster at address: 10.2.4.18:6379...
(EngineCore_DP0 pid=1769) 2025-11-13 06:17:35,564	INFO worker.py:2012 -- Connected to Ray cluster.
(EngineCore_DP0 pid=1769) /usr/local/lib/python3.12/dist-packages/ray/_private/worker.py:2051: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
(EngineCore_DP0 pid=1769) INFO 11-13 06:17:35 [ray_utils.py:396] No current placement group found. Creating a new placement group.
(EngineCore_DP0 pid=1769)   warnings.warn(
(EngineCore_DP0 pid=1769) (raylet, ip=10.2.1.15) [2025-11-13 06:17:36,375 E 85 85] (raylet) main.cc:975: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(EngineCore_DP0 pid=1769) INFO 11-13 06:17:44 [ray_env.py:66] RAY_NON_CARRY_OVER_ENV_VARS from config: set()
(EngineCore_DP0 pid=1769) INFO 11-13 06:17:44 [ray_env.py:69] Copying the following environment variables to workers: ['VLLM_WORKER_MULTIPROC_METHOD', 'HF_TOKEN', 'VLLM_USAGE_SOURCE', 'VLLM_API_KEY', 'LD_LIBRARY_PATH', 'CUDA_HOME']
(EngineCore_DP0 pid=1769) INFO 11-13 06:17:44 [ray_env.py:74] If certain env vars should NOT be copied, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json file
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
[2025-11-13 06:18:05,592 E 1769 1855] core_worker_process.cc:825: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=274, ip=10.2.1.15) [2025-11-13 06:18:06,107 E 274 309] core_worker_process.cc:825: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=274, ip=10.2.1.15) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) 
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) 
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  2.08it/s]
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) 
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.51it/s]
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) 
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  2.43it/s]
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405)
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) WARNING 11-13 06:17:45 [worker_base.py:301] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) [Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) INFO 11-13 06:17:46 [pynccl.py:111] vLLM is using nccl==2.27.5
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) INFO 11-13 06:17:46 [parallel_state.py:1325] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=1769) Process EngineCore_DP0:
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) INFO 11-13 06:17:53 [gpu_model_runner.py:3047] Starting to load model rednote-hilab/dots.ocr...
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=274, ip=10.2.1.15) WARNING 11-13 06:17:45 [worker_base.py:301] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=274, ip=10.2.1.15) [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [repeated 8x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=274, ip=10.2.1.15) INFO 11-13 06:17:46 [parallel_state.py:1325] rank 1 in world size 2 is assigned as DP rank 0, PP rank 1, TP rank 0, EP rank 0
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=274, ip=10.2.1.15) INFO 11-13 06:18:13 [cuda.py:408] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=274, ip=10.2.1.15) INFO 11-13 06:18:13 [cuda.py:417] Using FLASH_ATTN backend.
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) INFO 11-13 06:21:04 [weight_utils.py:441] Time spent downloading weights for rednote-hilab/dots.ocr: 170.155065 seconds
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) INFO 11-13 06:18:14 [cuda.py:408] Valid backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) INFO 11-13 06:18:14 [cuda.py:417] Using FLASH_ATTN backend.
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) INFO 11-13 06:21:05 [default_loader.py:314] Loading weights took 0.88 seconds
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=405) INFO 11-13 06:21:06 [gpu_model_runner.py:3126] Model loading took 4.0605 GiB memory and 192.645870 seconds
(EngineCore_DP0 pid=1769) (RayWorkerWrapper pid=274, ip=10.2.1.15) INFO 11-13 06:21:07 [gpu_model_runner.py:3876] Encoder cache will be initialized with a budget of 14400 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] EngineCore failed to start.
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] Traceback (most recent call last):
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 846, in run_engine_core
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 619, in __init__
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     super().__init__(
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 110, in __init__
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 228, in _initialize_kv_caches
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 126, in determine_available_memory
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     return self.collective_rpc("determine_available_memory")
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 474, in collective_rpc
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     return ray.get(ray_worker_outputs, timeout=timeout)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     return fn(*args, **kwargs)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2961, in get
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     values, debugger_breakpoint = worker.get_objects(
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]                                   ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1026, in get_objects
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     raise value.as_instanceof_cause()
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] ray.exceptions.RayTaskError(AssertionError): ray::RayWorkerWrapper.execute_method() (pid=274, ip=10.2.1.15, actor_id=61c5edededf45d7a32260d8609000000, repr=<vllm.v1.executor.ray_utils.RayWorkerWrapper object at 0x7f348a83be90>)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 343, in execute_method
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     raise e
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 332, in execute_method
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     return run_method(self, method, args, kwargs)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 318, in determine_available_memory
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     self.model_runner.profile_run()
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3923, in profile_run
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]                                         ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     return func(*args, **kwargs)
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3594, in _dummy_run
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     intermediate_tensors = self.sync_and_slice_intermediate_tensors(
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2072, in sync_and_slice_intermediate_tensors
(EngineCore_DP0 pid=1769) Traceback (most recent call last):
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]     assert self.intermediate_tensors is not None
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1769) ERROR 11-13 06:21:25 [core.py:855] AssertionError

The engine then aborts with:

RuntimeError: Engine core initialization failed. Failed core proc(s): {}

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions