-
-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Description
Summary
Running Qwen/Qwen3-VL-8B-Instruct-FP8 on the latest nightly vLLM OpenAI docker image consistently crashes under load when --async-scheduling is enabled.
Disabling async scheduling makes the issue disappear.
This appears closely related to the previously fixed async multimodal CPU tensor race condition (PR #31373), but the crash is still reproducible on a newer nightly (0.14.0rc1.dev221+g97a01308e).
Environment
Host OS
- Debian GNU/Linux (bare metal / VM)
GPU
- 2 × NVIDIA GeForce RTX 5060 Ti (16 GB each)
- Driver:
590.44.01 - CUDA:
13.1
nvidia-smi
Driver Version: 590.44.01 CUDA Version: 13.1
GPU: RTX 5060 Ti (16GB) ×4
Docker images
vllm/vllm-openai:nightly (ID: 31e08c7f6d05)
vLLM version (nightly)
vLLM API server version: 0.14.0rc1.dev221+g97a01308e
Model
Qwen/Qwen3-VL-8B-Instruct-FP8- Multimodal (image enabled, video disabled)
Launch Command
docker run -d --security-opt apparmor=unconfined \
--name qwen3vl-8b-fp8 \
--restart unless-stopped \
--gpus '"device=1,2"' \
-p 9003:9000 \
-v /mnt/aidisk/cache:/root/.cache \
-e VLLM_SLEEP_WHEN_IDLE=1 \
-e HF_HUB_OFFLINE=1 \
vllm/vllm-openai:nightly \
--model Qwen/Qwen3-VL-8B-Instruct-FP8 \
-tp 2 \
--gpu-memory-utilization 0.85 \
--limit-mm-per-prompt.video 0 \
--limit-mm-per-prompt.image 1 \
--mm-processor-cache-gb 2 \
--mm-encoder-tp-mode data \
--kv-cache-dtype fp8 \
--async-scheduling \
--max-model-len 16384 \
--swap-space 0 \
--port 9000
⚠️ Removing--async-schedulingmakes the system stable.
What Happens
After a period of sustained traffic (multiple concurrent chat/completion requests):
-
Worker processes exit:
Parent process exited, terminating worker -
Engine crashes:
EngineCore_DP0 died unexpectedly vllm.v1.engine.exceptions.EngineDeadError -
API starts returning 500 Internal Server Error for all in-flight requests
-
Server shuts down shortly after
No explicit CUDA OOM occurs. GPU memory usage is within configured limits.
Relevant Logs (excerpt)
(Worker_TP1 pid=53) Parent process exited, terminating worker
(Worker_TP0 pid=52) Parent process exited, terminating worker
(APIServer pid=1) ERROR Engine core proc EngineCore_DP0 died unexpectedly
(APIServer pid=1) vllm.v1.engine.exceptions.EngineDeadError
(APIServer pid=1) AsyncLLM output_handler failed
Full logs can be attached if needed.
Expected Behavior
- vLLM should remain stable under async scheduling for multimodal models, or
- Fail gracefully with a clear error message instead of killing the engine.
Actual Behavior
- Engine crashes under load
- All active requests return HTTP 500
- Requires container restart
Regression / Related Issues
This looks strongly related to prior async multimodal issues:
- [Bug]:
masked_scatter_size_checkfailed when running Qwen3VLMoE #30624 –masked_scatter_size_checkfailures with Qwen3-VL - [BugFix] Re-fix async multimodal cpu tensor race condition #31373 – [BugFix] Re-fix async multimodal cpu tensor race condition
That PR fixed crashes on earlier nightlies, but the issue seems to still reproduce (or has regressed) on newer nightlies.
Workaround
✅ Disable async scheduling:
# remove this flag
--async-schedulingThis makes Qwen3-VL-8B stable in our environment.
Full Logs
(APIServer pid=1) INFO 01-04 01:37:26 [loggers.py:257] Engine 000: Avg prompt throughput: 1187.8 tokens/s, Avg generation throughput: 250.2 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 72.9%, Prefix cache hit rate: 8.2%, MM cache hit rate: 13.7%
(APIServer pid=1) INFO: 172.20.30.1:62540 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 01-04 01:37:36 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 158.1 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 66.1%, Prefix cache hit rate: 8.2%, MM cache hit rate: 13.6%
(APIServer pid=1) INFO: 172.20.30.1:28328 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 01-04 01:37:46 [loggers.py:257] Engine 000: Avg prompt throughput: 1187.8 tokens/s, Avg generation throughput: 38.9 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 65.7%, Prefix cache hit rate: 8.1%, MM cache hit rate: 13.6%
(APIServer pid=1) INFO: 172.20.30.1:27806 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 01-04 01:37:56 [loggers.py:257] Engine 000: Avg prompt throughput: 1187.2 tokens/s, Avg generation throughput: 94.6 tokens/s, Running: 8 reqs, Waiting: 0 reqs, GPU KV cache usage: 65.5%, Prefix cache hit rate: 8.1%, MM cache hit rate: 13.5%
(APIServer pid=1) INFO: 172.20.30.1:6477 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 172.20.30.1:35101 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(Worker_TP1 pid=53) INFO 01-04 01:38:11 [multiproc_executor.py:709] Parent process exited, terminating worker
(Worker_TP0 pid=52) INFO 01-04 01:38:11 [multiproc_executor.py:709] Parent process exited, terminating worker
(APIServer pid=1) ERROR 01-04 01:38:11 [core_client.py:610] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.
(Worker_TP1 pid=53) INFO 01-04 01:38:11 [multiproc_executor.py:753] WorkerProc shutting down.
(APIServer pid=1) INFO 01-04 01:38:11 [loggers.py:257] Engine 000: Avg prompt throughput: 798.0 tokens/s, Avg generation throughput: 30.6 tokens/s, Running: 7 reqs, Waiting: 0 reqs, GPU KV cache usage: 55.4%, Prefix cache hit rate: 8.1%, MM cache hit rate: 13.5%
(Worker_TP0 pid=52) INFO 01-04 01:38:11 [multiproc_executor.py:753] WorkerProc shutting down.
(APIServer pid=1) INFO: 172.20.30.1:35101 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) ERROR 01-04 01:38:11 [async_llm.py:543] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 01-04 01:38:11 [async_llm.py:543] Traceback (most recent call last):
(APIServer pid=1) ERROR 01-04 01:38:11 [async_llm.py:543] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 495, in output_handler
(APIServer pid=1) ERROR 01-04 01:38:11 [async_llm.py:543] outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 01-04 01:38:11 [async_llm.py:543] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 01-04 01:38:11 [async_llm.py:543] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 899, in get_output_async
(APIServer pid=1) ERROR 01-04 01:38:11 [async_llm.py:543] raise self._format_exception(outputs) from None
(APIServer pid=1) ERROR 01-04 01:38:11 [async_llm.py:543] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=1) INFO: 172.20.30.1:62052 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:39847 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:42332 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:62540 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:28328 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:27806 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:6477 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:62052 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:6477 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:42332 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:62540 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:27806 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:35101 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:28328 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:42332 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:62052 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:6477 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:27806 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:62540 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:35101 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:42332 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:28328 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:62052 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:6477 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:39847 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:35101 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:27806 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:62540 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:42332 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:27806 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:35101 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:62052 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:39847 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:6477 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:42332 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:27806 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:62540 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:28328 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:42332 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:62052 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:6477 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:39847 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:62540 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:42332 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:27806 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:35101 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:62052 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:6477 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:39847 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:42332 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:62540 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:62052 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:42332 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:6477 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:27806 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:39847 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:62540 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:62052 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:42332 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:6477 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:39847 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:27806 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:35101 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: 172.20.30.1:28328 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO: Shutting down
(APIServer pid=1) INFO: Waiting for application shutdown.
(APIServer pid=1) INFO: Application shutdown complete.
(APIServer pid=1) INFO: Finished server process [1]
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
WARNING 01-04 01:38:27 [argparse_utils.py:195] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in v0.13.
WARNING 01-04 01:38:27 [argparse_utils.py:342] Found duplicate keys --async-scheduling
(APIServer pid=1) INFO 01-04 01:38:27 [api_server.py:1277] vLLM API server version 0.14.0rc1.dev221+g97a01308e
(APIServer pid=1) INFO 01-04 01:38:27 [utils.py:253] non-default args: {'model_tag': 'Qwen/Qwen3-VL-8B-Instruct-FP8', 'port': 9000, 'model': 'Qwen/Qwen3-VL-8B-Instruct-FP8', 'trust_remote_code': True, 'max_model_len': 16384, 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.85, 'swap_space': 0.0, 'kv_cache_dtype': 'fp8', 'limit_mm_per_prompt': {'video': 0, 'image': 1}, 'mm_processor_cache_gb': 2.0, 'mm_encoder_tp_mode': 'data', 'async_scheduling': True}
(APIServer pid=1) INFO 01-04 01:38:27 [arg_utils.py:599] HF_HUB_OFFLINE is True, replace model_id [Qwen/Qwen3-VL-8B-Instruct-FP8] to model_path [/root/.cache/huggingface/hub/models--Qwen--Qwen3-VL-8B-Instruct-FP8/snapshots/9cdc6310a8cb770ce18efaf4e9935334512aee45]
(APIServer pid=1) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=1) INFO 01-04 01:38:27 [model.py:522] Resolved architecture: Qwen3VLForConditionalGeneration
(APIServer pid=1) INFO 01-04 01:38:27 [model.py:1510] Using max model len 16384
(APIServer pid=1) WARNING 01-04 01:38:27 [vllm.py:1453] Current vLLM config is not set.
(APIServer pid=1) INFO 01-04 01:38:27 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 01-04 01:38:27 [vllm.py:635] Disabling NCCL for DP synchronization when using async scheduling.
(APIServer pid=1) INFO 01-04 01:38:27 [vllm.py:640] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 01-04 01:38:27 [cache.py:205] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=1) INFO 01-04 01:38:27 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(EngineCore_DP0 pid=30) INFO 01-04 01:38:35 [core.py:96] Initializing a V1 LLM engine (v0.14.0rc1.dev221+g97a01308e) with config: model='/root/.cache/huggingface/hub/models--Qwen--Qwen3-VL-8B-Instruct-FP8/snapshots/9cdc6310a8cb770ce18efaf4e9935334512aee45', speculative_config=None, tokenizer='/root/.cache/huggingface/hub/models--Qwen--Qwen3-VL-8B-Instruct-FP8/snapshots/9cdc6310a8cb770ce18efaf4e9935334512aee45', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False), seed=0, served_model_name=/root/.cache/huggingface/hub/models--Qwen--Qwen3-VL-8B-Instruct-FP8/snapshots/9cdc6310a8cb770ce18efaf4e9935334512aee45, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False}, 'local_cache_dir': None}
(EngineCore_DP0 pid=30) WARNING 01-04 01:38:35 [multiproc_executor.py:882] Reducing Torch parallelism from 8 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 01-04 01:38:42 [parallel_state.py:1214] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:50735 backend=nccl
INFO 01-04 01:38:42 [parallel_state.py:1214] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:50735 backend=nccl
INFO 01-04 01:38:42 [pynccl.py:111] vLLM is using nccl==2.27.5
WARNING 01-04 01:38:42 [symm_mem.py:67] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
WARNING 01-04 01:38:42 [symm_mem.py:67] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available.
WARNING 01-04 01:38:42 [custom_all_reduce.py:165] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 01-04 01:38:42 [custom_all_reduce.py:165] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 01-04 01:38:42 [parallel_state.py:1425] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
INFO 01-04 01:38:42 [parallel_state.py:1425] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank N/A
(Worker_TP0 pid=52) INFO 01-04 01:38:46 [gpu_model_runner.py:3762] Starting to load model /root/.cache/huggingface/hub/models--Qwen--Qwen3-VL-8B-Instruct-FP8/snapshots/9cdc6310a8cb770ce18efaf4e9935334512aee45...
(Worker_TP0 pid=52) INFO 01-04 01:38:46 [mm_encoder_attention.py:83] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker_TP1 pid=53) INFO 01-04 01:38:46 [mm_encoder_attention.py:83] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker_TP0 pid=52) INFO 01-04 01:38:47 [cuda.py:351] Using FLASHINFER attention backend out of potential backends: ('FLASHINFER', 'TRITON_ATTN')
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:02<00:02, 2.36s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:05<00:00, 2.62s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:05<00:00, 2.58s/it]
(Worker_TP0 pid=52)
(Worker_TP1 pid=53) WARNING 01-04 01:38:52 [kv_cache.py:90] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).
(Worker_TP1 pid=53) WARNING 01-04 01:38:52 [kv_cache.py:104] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.
(Worker_TP1 pid=53) WARNING 01-04 01:38:52 [kv_cache.py:143] Using uncalibrated q_scale 1.0 and/or prob_scale 1.0 with fp8 attention. This may cause accuracy issues. Please make sure q/prob scaling factors are available in the fp8 checkpoint.
(Worker_TP0 pid=52) INFO 01-04 01:38:52 [default_loader.py:308] Loading weights took 5.22 seconds
(Worker_TP0 pid=52) WARNING 01-04 01:38:52 [kv_cache.py:90] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for FP8 Attention backends (flash-attn or flashinfer).
(Worker_TP0 pid=52) WARNING 01-04 01:38:52 [kv_cache.py:104] Using KV cache scaling factor 1.0 for fp8_e4m3. If this is unintended, verify that k/v_scale scaling factors are properly set in the checkpoint.
(Worker_TP0 pid=52) WARNING 01-04 01:38:52 [kv_cache.py:143] Using uncalibrated q_scale 1.0 and/or prob_scale 1.0 with fp8 attention. This may cause accuracy issues. Please make sure q/prob scaling factors are available in the fp8 checkpoint.
(Worker_TP0 pid=52) INFO 01-04 01:38:53 [gpu_model_runner.py:3859] Model loading took 5.7777 GiB memory and 5.792738 seconds
(Worker_TP0 pid=52) INFO 01-04 01:38:53 [gpu_model_runner.py:4669] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(Worker_TP1 pid=53) INFO 01-04 01:38:53 [gpu_model_runner.py:4669] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(Worker_TP0 pid=52) INFO 01-04 01:39:24 [backends.py:644] Using cache directory: /root/.cache/vllm/torch_compile_cache/59a33dbaf1/rank_0_0/backbone for vLLM's torch.compile
(Worker_TP0 pid=52) INFO 01-04 01:39:24 [backends.py:704] Dynamo bytecode transform time: 13.63 s
(Worker_TP1 pid=53) INFO 01-04 01:39:36 [backends.py:226] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 1.451 s
(Worker_TP0 pid=52) INFO 01-04 01:39:36 [backends.py:226] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 1.429 s
(Worker_TP0 pid=52) INFO 01-04 01:39:36 [monitor.py:34] torch.compile takes 15.06 s in total
(Worker_TP0 pid=52) INFO 01-04 01:39:37 [gpu_worker.py:361] Available KV cache memory: 4.74 GiB
(EngineCore_DP0 pid=30) INFO 01-04 01:39:37 [kv_cache_utils.py:1305] GPU KV cache size: 138,160 tokens
(EngineCore_DP0 pid=30) INFO 01-04 01:39:37 [kv_cache_utils.py:1310] Maximum concurrency for 16,384 tokens per request: 8.43x
(Worker_TP0 pid=52) 2026-01-04 01:39:37,965 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP1 pid=53) 2026-01-04 01:39:37,965 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker_TP0 pid=52) INFO 01-04 01:39:38 [kernel_warmup.py:64] Warming up FlashInfer attention.
(Worker_TP0 pid=52) 2026-01-04 01:39:38,003 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker_TP1 pid=53) INFO 01-04 01:39:38 [kernel_warmup.py:64] Warming up FlashInfer attention.
(Worker_TP1 pid=53) 2026-01-04 01:39:38,009 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:05<00:00, 9.26it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 35/35 [00:03<00:00, 10.97it/s]
(Worker_TP0 pid=52) INFO 01-04 01:39:47 [gpu_model_runner.py:4810] Graph capturing finished in 9 secs, took 1.54 GiB
(EngineCore_DP0 pid=30) INFO 01-04 01:39:48 [core.py:273] init engine (profile, create kv cache, warmup model) took 54.90 seconds
(EngineCore_DP0 pid=30) INFO 01-04 01:39:51 [core.py:185] Batch queue is enabled with size 2
(EngineCore_DP0 pid=30) INFO 01-04 01:39:52 [vllm.py:640] Asynchronous scheduling is enabled.
(APIServer pid=1) INFO 01-04 01:39:52 [api_server.py:1020] Supported tasks: ['generate']
(APIServer pid=1) WARNING 01-04 01:39:52 [model.py:1331] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=1) INFO 01-04 01:39:52 [serving_responses.py:201] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=1) INFO 01-04 01:39:52 [serving_chat.py:142] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=1) INFO 01-04 01:39:52 [serving_chat.py:178] Warming up chat template processing...
(APIServer pid=1) INFO 01-04 01:39:53 [chat_utils.py:599] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 01-04 01:39:53 [serving_chat.py:214] Chat template warmup completed in 773.7ms
(APIServer pid=1) INFO 01-04 01:39:53 [serving_completion.py:78] Using default completion sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=1) INFO 01-04 01:39:53 [serving_chat.py:142] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
(APIServer pid=1) INFO 01-04 01:39:53 [api_server.py:1351] Starting vLLM API server 0 on http://0.0.0.0:9000
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:38] Available routes are:
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /pause, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /resume, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /is_paused, Methods: GET
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /v1/audio/transcriptions, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /v1/audio/translations, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /classify, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /v1/embeddings, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /score, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /v1/score, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /rerank, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /v1/rerank, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /v2/rerank, Methods: POST
(APIServer pid=1) INFO 01-04 01:39:53 [launcher.py:46] Route: /pooling, Methods: POST
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
(APIServer pid=1) INFO: 172.17.0.1:45616 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=1) INFO: 172.17.0.1:45620 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 01-04 01:50:13 [loggers.py:257] Engine 000: Avg prompt throughput: 1.1 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(APIServer pid=1) INFO 01-04 01:50:23 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(APIServer pid=1) INFO: 172.17.0.1:41560 - "GET /health HTTP/1.1" 200 OK
(APIServer pid=1) INFO 01-04 02:50:23 [loggers.py:257] Engine 000: Avg prompt throughput: 1.1 tokens/s, Avg generation throughput: 0.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
(APIServer pid=1) INFO: 172.17.0.1:41562 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=1) INFO 01-04 02:50:33 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.4 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%