Skip to content

I was able to run vllm worker version 2.11.0 but later version not able to run for GPT-OSS 120B model #276

@sarwarbeing-ai

Description

@sarwarbeing-ai

I was able to run the version 2.11.0 for GPT-OSS 120B model but not the later version

Moreover, I'm not able to find the version 2.11.0

Why there are only few versions available on the console to select

I'm getting the following error, seems memory issue which was not the case for version 2.11.0

2026-03-11T17:57:09.855212259Z (EngineCore_DP0 pid=469) WARNING 03-11 17:57:09 [multiproc_executor.py:921] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
2026-03-11T17:57:10.584281362Z (EngineCore_DP0 pid=469) INFO 03-11 17:57:10 [parallel_state.py:1234] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:54953 backend=nccl
2026-03-11T17:57:10.630311523Z (EngineCore_DP0 pid=469) INFO 03-11 17:57:10 [parallel_state.py:1445] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
2026-03-11T17:57:10.976518836Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:10 [multiproc_executor.py:783] WorkerProc failed to start.
2026-03-11T17:57:10.976556087Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:10 [multiproc_executor.py:783] Traceback (most recent call last):
2026-03-11T17:57:10.976560871Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:10 [multiproc_executor.py:783] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 754, in worker_main
2026-03-11T17:57:10.976584022Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:10 [multiproc_executor.py:783] worker = WorkerProc(*args, **kwargs)
2026-03-11T17:57:10.976588046Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:10 [multiproc_executor.py:783] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 571, in init
2026-03-11T17:57:10.976592161Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:10 [multiproc_executor.py:783] self.worker.init_device()
2026-03-11T17:57:10.976595821Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:10 [multiproc_executor.py:783] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/worker_base.py", line 322, in init_device
2026-03-11T17:57:10.976599341Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:10 [multiproc_executor.py:783] self.worker.init_device() # type: ignore
2026-03-11T17:57:10.976602722Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:10 [multiproc_executor.py:783] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_worker.py", line 252, in init_device
2026-03-11T17:57:10.976606321Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:10 [multiproc_executor.py:783] self.requested_memory = request_memory(init_snapshot, self.cache_config)
2026-03-11T17:57:10.976609921Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:10 [multiproc_executor.py:783] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/utils.py", line 102, in request_memory
2026-03-11T17:57:10.976613467Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:10 [multiproc_executor.py:783] raise ValueError(
2026-03-11T17:57:10.976617689Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:10 [multiproc_executor.py:783] ValueError: Free memory on device cuda:0 (78.59/79.19 GiB) on startup is less than desired GPU memory utilization (1.0, 79.19 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
2026-03-11T17:57:11.184945645Z (EngineCore_DP0 pid=469) Process EngineCore_DP0:
2026-03-11T17:57:11.184967974Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:11 [core.py:1006] EngineCore failed to start.
2026-03-11T17:57:11.185015061Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:11 [core.py:1006] Traceback (most recent call last):
2026-03-11T17:57:11.185025103Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:11 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 996, in run_engine_core
2026-03-11T17:57:11.185030932Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:11 [core.py:1006] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
2026-03-11T17:57:11.185036396Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:11 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 740, in init
2026-03-11T17:57:11.185070111Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:11 [core.py:1006] super().init(
2026-03-11T17:57:11.185077962Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:11 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 106, in init
2026-03-11T17:57:11.185083496Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:11 [core.py:1006] self.model_executor = executor_class(vllm_config)
2026-03-11T17:57:11.185090807Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:11 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 98, in init
2026-03-11T17:57:11.185094595Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:11 [core.py:1006] super().init(vllm_config)
2026-03-11T17:57:11.185098474Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:11 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/abstract.py", line 103, in init
2026-03-11T17:57:11.185117570Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:11 [core.py:1006] self._init_executor()
2026-03-11T17:57:11.185136634Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:11 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 166, in _init_executor
2026-03-11T17:57:11.185175049Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:11 [core.py:1006] self.workers = WorkerProc.wait_for_ready(unready_workers)
2026-03-11T17:57:11.185220764Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:11 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 680, in wait_for_ready
2026-03-11T17:57:11.185227391Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:11 [core.py:1006] raise e from None
2026-03-11T17:57:11.185225511Z (EngineCore_DP0 pid=469) Traceback (most recent call last):
2026-03-11T17:57:11.185253631Z (EngineCore_DP0 pid=469) File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
2026-03-11T17:57:11.185262937Z (EngineCore_DP0 pid=469) self.run()
2026-03-11T17:57:11.185267391Z (EngineCore_DP0 pid=469) File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
2026-03-11T17:57:11.185274436Z (EngineCore_DP0 pid=469) self._target(*self._args, **self._kwargs)
2026-03-11T17:57:11.185282362Z (EngineCore_DP0 pid=469) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 1010, in run_engine_core
2026-03-11T17:57:11.185289118Z (EngineCore_DP0 pid=469) raise e
2026-03-11T17:57:11.185296545Z (EngineCore_DP0 pid=469) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 996, in run_engine_core
2026-03-11T17:57:11.185300354Z (EngineCore_DP0 pid=469) engine_core = EngineCoreProc(args, engine_index=dp_rank, kwargs)
2026-03-11T17:57:11.185303956Z (EngineCore_DP0 pid=469) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 740, in init
2026-03-11T17:57:11.185307909Z (EngineCore_DP0 pid=469) super().init(
2026-03-11T17:57:11.185311523Z (EngineCore_DP0 pid=469) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 106, in init
2026-03-11T17:57:11.185315453Z (EngineCore_DP0 pid=469) self.model_executor = executor_class(vllm_config)
2026-03-11T17:57:11.185319241Z (EngineCore_DP0 pid=469) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 98, in init
2026-03-11T17:57:11.185323333Z (EngineCore_DP0 pid=469) super().init(vllm_config)
2026-03-11T17:57:11.185329353Z (EngineCore_DP0 pid=469) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/abstract.py", line 103, in init
2026-03-11T17:57:11.185336848Z (EngineCore_DP0 pid=469) self._init_executor()
2026-03-11T17:57:11.185340544Z (EngineCore_DP0 pid=469) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 166, in _init_executor
2026-03-11T17:57:11.185346520Z (EngineCore_DP0 pid=469) self.workers = WorkerProc.wait_for_ready(unready_workers)
2026-03-11T17:57:11.185353480Z (EngineCore_DP0 pid=469) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 680, in wait_for_ready
2026-03-11T17:57:11.185359832Z (EngineCore_DP0 pid=469) raise e from None
2026-03-11T17:57:11.185231125Z (EngineCore_DP0 pid=469) ERROR 03-11 17:57:11 [core.py:1006] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
2026-03-11T17:57:11.185368922Z (EngineCore_DP0 pid=469) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
2026-03-11T17:57:11.215290308Z /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
2026-03-11T17:57:11.215326127Z warnings.warn('resource_tracker: There appear to be %d '
2026-03-11T17:57:11.216241910Z engine.py :171 2026-03-11 17:57:11,213 Error initializing vLLM engine: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}
2026-03-11T17:57:11.218532827Z {"requestId": null, "message": "Worker startup failed: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}\nTraceback (most recent call last):\n File "/src/handler.py", line 42, in \n vllm_engine = vLLMEngine()\n File "/src/engine.py", line 31, in init\n self.llm = self._initialize_llm() if engine is None else engine.llm\n File "/src/engine.py", line 172, in _initialize_llm\n raise e\n File "/src/engine.py", line 166, in _initialize_llm\n engine = AsyncLLMEngine.from_engine_args(self.engine_args)\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 251, in from_engine_args\n return cls(\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 148, in init\n self.engine_core = EngineCoreClient.make_async_mp_client(\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 124, in make_async_mp_client\n return AsyncMPClient(*client_args)\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 835, in init\n super().init(\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 490, in init\n with launch_core_engines(vllm_config, executor_class, log_stats) as (\n File "/usr/lib/python3.10/contextlib.py", line 142, in exit\n next(self.gen)\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 925, in launch_core_engines\n wait_for_engine_startup(\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 984, in wait_for_engine_startup\n raise RuntimeError(\nRuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}\n", "level": "ERROR"}
2026-03-11T17:57:27.803239378Z engine_args.py :384 2026-03-11 17:57:27,802 Setting max_num_batched_tokens to 8192
2026-03-11T17:57:27.811674918Z engine.py :28 2026-03-11 17:57:27,811 Engine args: AsyncEngineArgs(model='openai/gpt-oss-120b', enable_return_routed_experts=False, model_weights='', served_model_name=None, tokenizer=None, hf_config_path=None, runner='auto', convert='auto', skip_tokenizer_init=False, enable_prompt_embeds=False, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path='', allowed_media_domains=None, download_dir=None, safetensors_load_strategy='lazy', load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', seed=2026, max_model_len=8192, cudagraph_capture_sizes=None, max_cudagraph_capture_size=None, distributed_executor_backend='mp', pipeline_parallel_size=1, master_addr='127.0.0.1', master_port=29501, nnodes=1, node_rank=0, tensor_parallel_size=1, prefill_context_parallel_size=1, decode_context_parallel_size=1, dcp_kv_cache_interleave_size=1, cp_kv_cache_interleave_size=1, data_parallel_size=1, data_parallel_rank=None, data_parallel_start_rank=None, data_parallel_size_local=None, data_parallel_address=None, data_parallel_rpc_port=None, data_parallel_hybrid_lb=False, data_parallel_external_lb=False, data_parallel_backend='mp', enable_expert_parallel=False, all2all_backend='allgather_reducescatter', enable_dbo=False, ubatch_size=0, dbo_decode_token_threshold=32, dbo_prefill_token_threshold=512, disable_nccl_for_dp_synchronization=None, eplb_config=EPLBConfig(window_size=1000, step_interval=3000, num_redundant_experts=0, log_balancedness=False, log_balancedness_interval=1, use_async=False, policy='default'), enable_eplb=False, expert_placement_strategy='linear', _api_process_count=1, _api_process_rank=0, max_parallel_loading_workers=0, block_size=16, enable_prefix_caching=False, prefix_caching_hash_algo='sha256', disable_sliding_window=False, disable_cascade_attn=False, swap_space=4.0, cpu_offload_gb=0, gpu_memory_utilization=1.0, kv_cache_memory_bytes=None, max_num_batched_tokens=8192, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=256, max_logprobs=20, logprobs_mode='raw_logprobs', disable_log_stats=False, aggregate_engine_logging=False, revision='b5c939de8f754692c1647ca79fbf85e8c1e70f8a', code_revision=None, hf_token=None, hf_overrides={}, tokenizer_revision=None, quantization=None, allow_deprecated_quantization=False, enforce_eager=False, disable_custom_all_reduce=False, limit_mm_per_prompt={}, enable_mm_embeds=False, interleave_mm_strings=False, media_io_kwargs={}, mm_processor_kwargs=None, mm_processor_cache_gb=4, mm_processor_cache_type='lru', mm_shm_cache_max_object_size_mb=128, mm_encoder_only=False, mm_encoder_tp_mode='weights', mm_encoder_attn_backend=None, io_processor_plugin=None, skip_mm_profiling=False, video_pruning_rate=None, enable_lora=False, max_loras=1, max_lora_rank=16, default_mm_loras=None, fully_sharded_loras=False, max_cpu_loras=0, lora_dtype='auto', enable_tower_connector_lora=False, specialize_active_lora=False, ray_workers_use_nsight=False, num_gpu_blocks_override=None, model_loader_extra_config={}, ignore_patterns=['original/
/
'], enable_chunked_prefill=False, disable_chunked_mm_input=False, disable_hybrid_kv_cache_manager=None, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), reasoning_parser='', reasoning_parser_plugin=None, logits_processor_pattern=None, speculative_config=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_logging_iteration_details=False, enable_mm_processor_stats=False, scheduling_policy='fcfs', scheduler_cls=None, pooler_config=None, compilation_config={'level': None, 'mode': None, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': [], 'splitting_ops': None, 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': None, 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': None, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': None, 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': None, 'pass_config': {}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': None, 'static_all_moe_layers': []}, attention_config=AttentionConfig(backend=None, flash_attn_version=None, use_prefill_decode_attention=False, flash_attn_max_num_splits_for_cuda_graph=32, use_cudnn_prefill=False, use_trtllm_ragged_deepseek_prefill=True, use_trtllm_attention=None, disable_flashinfer_prefill=False, disable_flashinfer_q_quantization=False), kernel_config=KernelConfig(enable_flashinfer_autotune=None), enable_flashinfer_autotune=None, worker_cls='auto', worker_extension_cls='', profiler_config=ProfilerConfig(profiler=None, torch_profiler_dir='', torch_profiler_with_stack=True, torch_profiler_with_flops=False, torch_profiler_use_gzip=True, torch_profiler_dump_cuda_time_total=True, torch_profiler_record_shapes=False, torch_profiler_with_memory=False, ignore_frontend=False, delay_iterations=0, max_iterations=0), kv_transfer_config=None, kv_events_config=None, ec_transfer_config=None, generation_config='auto', enable_sleep_mode=False, override_generation_config={}, model_impl='auto', override_attention_dtype=None, attention_backend=None, calculate_kv_scales=False, mamba_cache_dtype='auto', mamba_ssm_cache_dtype='auto', mamba_block_size=None, mamba_cache_mode='none', additional_config={}, use_tqdm_on_load=True, pt_load_map_location='cpu', logits_processors=None, async_scheduling=None, stream_interval=1, kv_sharing_fast_prefill=False, optimization_level=<OptimizationLevel.O2: 2>, kv_offloading_size=None, kv_offloading_backend='native', tokens_only=False, weight_transfer_config=None, enable_log_requests=False)
2026-03-11T17:57:28.012938639Z Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
2026-03-11T17:57:28.012994423Z _http.py :857 2026-03-11 17:57:28,012 Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
2026-03-11T17:57:28.072459198Z INFO 03-11 17:57:28 [model.py:529] Resolved architecture: GptOssForCausalLM
2026-03-11T17:57:28.310563478Z
Parse safetensors files: 0%| | 0/15 [00:00<?, ?it/s]
Parse safetensors files: 13%|█▎ | 2/15 [00:00<00:00, 16.94it/s]
Parse safetensors files: 100%|██████████| 15/15 [00:00<00:00, 97.52it/s]
2026-03-11T17:57:28.314727806Z INFO 03-11 17:57:28 [model.py:1549] Using max model len 8192
2026-03-11T17:57:28.442157053Z WARNING 03-11 17:57:28 [arg_utils.py:1949] This model does not officially support disabling chunked prefill. Disabling this manually may cause the engine to crash or produce incorrect outputs.
2026-03-11T17:57:28.656240852Z WARNING 03-11 17:57:28 [parallel.py:648] max_parallel_loading_workers is currently not supported and will be ignored.
2026-03-11T17:57:28.657053008Z INFO 03-11 17:57:28 [config.py:314] Overriding max cuda graph capture size to 1024 for performance.
2026-03-11T17:57:28.657500356Z INFO 03-11 17:57:28 [vllm.py:689] Asynchronous scheduling is enabled.
2026-03-11T17:57:31.010803207Z (EngineCore_DP0 pid=269) INFO 03-11 17:57:31 [core.py:97] Initializing a V1 LLM engine (v0.16.0) with config: model='openai/gpt-oss-120b', speculative_config=None, tokenizer='openai/gpt-oss-120b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=b5c939de8f754692c1647ca79fbf85e8c1e70f8a, tokenizer_revision=b5c939de8f754692c1647ca79fbf85e8c1e70f8a, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=2026, served_model_name=openai/gpt-oss-120b, enable_prefix_caching=False, enable_chunked_prefill=False, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 1024, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
2026-03-11T17:57:31.011162295Z (EngineCore_DP0 pid=269) WARNING 03-11 17:57:31 [multiproc_executor.py:921] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
2026-03-11T17:57:31.737919732Z (EngineCore_DP0 pid=269) INFO 03-11 17:57:31 [parallel_state.py:1234] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:36567 backend=nccl
2026-03-11T17:57:31.795371926Z (EngineCore_DP0 pid=269) INFO 03-11 17:57:31 [parallel_state.py:1445] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
2026-03-11T17:57:32.164303068Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [multiproc_executor.py:783] WorkerProc failed to start.
2026-03-11T17:57:32.164350742Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [multiproc_executor.py:783] Traceback (most recent call last):
2026-03-11T17:57:32.164359129Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [multiproc_executor.py:783] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 754, in worker_main
2026-03-11T17:57:32.164367142Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [multiproc_executor.py:783] worker = WorkerProc(*args, **kwargs)
2026-03-11T17:57:32.164373128Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [multiproc_executor.py:783] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 571, in init
2026-03-11T17:57:32.164378978Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [multiproc_executor.py:783] self.worker.init_device()
2026-03-11T17:57:32.164385914Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [multiproc_executor.py:783] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/worker_base.py", line 322, in init_device
2026-03-11T17:57:32.164392090Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [multiproc_executor.py:783] self.worker.init_device() # type: ignore
2026-03-11T17:57:32.164402257Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [multiproc_executor.py:783] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_worker.py", line 252, in init_device
2026-03-11T17:57:32.164410813Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [multiproc_executor.py:783] self.requested_memory = request_memory(init_snapshot, self.cache_config)
2026-03-11T17:57:32.164416421Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [multiproc_executor.py:783] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/utils.py", line 102, in request_memory
2026-03-11T17:57:32.164422725Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [multiproc_executor.py:783] raise ValueError(
2026-03-11T17:57:32.164430387Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [multiproc_executor.py:783] ValueError: Free memory on device cuda:0 (78.59/79.19 GiB) on startup is less than desired GPU memory utilization (1.0, 79.19 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
2026-03-11T17:57:32.371934776Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [core.py:1006] EngineCore failed to start.
2026-03-11T17:57:32.371984245Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [core.py:1006] Traceback (most recent call last):
2026-03-11T17:57:32.371990551Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 996, in run_engine_core
2026-03-11T17:57:32.372041053Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [core.py:1006] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
2026-03-11T17:57:32.372048164Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 740, in init
2026-03-11T17:57:32.372052362Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [core.py:1006] super().init(
2026-03-11T17:57:32.372086121Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 106, in init
2026-03-11T17:57:32.372091697Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [core.py:1006] self.model_executor = executor_class(vllm_config)
2026-03-11T17:57:32.372097768Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 98, in init
2026-03-11T17:57:32.372103135Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [core.py:1006] super().init(vllm_config)
2026-03-11T17:57:32.372108964Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/abstract.py", line 103, in init
2026-03-11T17:57:32.372114576Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [core.py:1006] self._init_executor()
2026-03-11T17:57:32.372128496Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 166, in _init_executor
2026-03-11T17:57:32.372137189Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [core.py:1006] self.workers = WorkerProc.wait_for_ready(unready_workers)
2026-03-11T17:57:32.372145579Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 680, in wait_for_ready
2026-03-11T17:57:32.372155926Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [core.py:1006] raise e from None
2026-03-11T17:57:32.372168462Z (EngineCore_DP0 pid=269) ERROR 03-11 17:57:32 [core.py:1006] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
2026-03-11T17:57:32.371945016Z (EngineCore_DP0 pid=269) Process EngineCore_DP0:
2026-03-11T17:57:32.372352445Z (EngineCore_DP0 pid=269) Traceback (most recent call last):
2026-03-11T17:57:32.372373133Z (EngineCore_DP0 pid=269) File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
2026-03-11T17:57:32.372378084Z (EngineCore_DP0 pid=269) self.run()
2026-03-11T17:57:32.372381713Z (EngineCore_DP0 pid=269) File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
2026-03-11T17:57:32.372385302Z (EngineCore_DP0 pid=269) self._target(*self._args, **self._kwargs)
2026-03-11T17:57:32.372389142Z (EngineCore_DP0 pid=269) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 1010, in run_engine_core
2026-03-11T17:57:32.372392733Z (EngineCore_DP0 pid=269) raise e
2026-03-11T17:57:32.372397804Z (EngineCore_DP0 pid=269) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 996, in run_engine_core
2026-03-11T17:57:32.372401574Z (EngineCore_DP0 pid=269) engine_core = EngineCoreProc(args, engine_index=dp_rank, kwargs)
2026-03-11T17:57:32.372405148Z (EngineCore_DP0 pid=269) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 740, in init
2026-03-11T17:57:32.372408792Z (EngineCore_DP0 pid=269) super().init(
2026-03-11T17:57:32.372412463Z (EngineCore_DP0 pid=269) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 106, in init
2026-03-11T17:57:32.372416356Z (EngineCore_DP0 pid=269) self.model_executor = executor_class(vllm_config)
2026-03-11T17:57:32.372419883Z (EngineCore_DP0 pid=269) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 98, in init
2026-03-11T17:57:32.372423679Z (EngineCore_DP0 pid=269) super().init(vllm_config)
2026-03-11T17:57:32.372449307Z (EngineCore_DP0 pid=269) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/abstract.py", line 103, in init
2026-03-11T17:57:32.372464838Z (EngineCore_DP0 pid=269) self._init_executor()
2026-03-11T17:57:32.372469087Z (EngineCore_DP0 pid=269) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 166, in _init_executor
2026-03-11T17:57:32.372472786Z (EngineCore_DP0 pid=269) self.workers = WorkerProc.wait_for_ready(unready_workers)
2026-03-11T17:57:32.372476282Z (EngineCore_DP0 pid=269) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 680, in wait_for_ready
2026-03-11T17:57:32.372479746Z (EngineCore_DP0 pid=269) raise e from None
2026-03-11T17:57:32.372484861Z (EngineCore_DP0 pid=269) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
2026-03-11T17:57:32.396058510Z engine.py :171 2026-03-11 17:57:32,394 Error initializing vLLM engine: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}
2026-03-11T17:57:32.396779689Z /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
2026-03-11T17:57:32.396804401Z warnings.warn('resource_tracker: There appear to be %d '
2026-03-11T17:57:32.398077118Z {"requestId": null, "message": "Worker startup failed: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}\nTraceback (most recent call last):\n File "/src/handler.py", line 42, in \n vllm_engine = vLLMEngine()\n File "/src/engine.py", line 31, in init\n self.llm = self._initialize_llm() if engine is None else engine.llm\n File "/src/engine.py", line 172, in _initialize_llm\n raise e\n File "/src/engine.py", line 166, in _initialize_llm\n engine = AsyncLLMEngine.from_engine_args(self.engine_args)\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 251, in from_engine_args\n return cls(\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 148, in init\n self.engine_core = EngineCoreClient.make_async_mp_client(\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 124, in make_async_mp_client\n return AsyncMPClient(*client_args)\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 835, in init\n super().init(\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 490, in init\n with launch_core_engines(vllm_config, executor_class, log_stats) as (\n File "/usr/lib/python3.10/contextlib.py", line 142, in exit\n next(self.gen)\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 925, in launch_core_engines\n wait_for_engine_startup(\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 984, in wait_for_engine_startup\n raise RuntimeError(\nRuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}\n", "level": "ERROR"}
2026-03-11T17:57:55.750204070Z engine_args.py :384 2026-03-11 17:57:55,749 Setting max_num_batched_tokens to 8192
2026-03-11T17:57:55.758908546Z engine.py :28 2026-03-11 17:57:55,758 Engine args: AsyncEngineArgs(model='openai/gpt-oss-120b', enable_return_routed_experts=False, model_weights='', served_model_name=None, tokenizer=None, hf_config_path=None, runner='auto', convert='auto', skip_tokenizer_init=False, enable_prompt_embeds=False, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path='', allowed_media_domains=None, download_dir=None, safetensors_load_strategy='lazy', load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', seed=2026, max_model_len=8192, cudagraph_capture_sizes=None, max_cudagraph_capture_size=None, distributed_executor_backend='mp', pipeline_parallel_size=1, master_addr='127.0.0.1', master_port=29501, nnodes=1, node_rank=0, tensor_parallel_size=1, prefill_context_parallel_size=1, decode_context_parallel_size=1, dcp_kv_cache_interleave_size=1, cp_kv_cache_interleave_size=1, data_parallel_size=1, data_parallel_rank=None, data_parallel_start_rank=None, data_parallel_size_local=None, data_parallel_address=None, data_parallel_rpc_port=None, data_parallel_hybrid_lb=False, data_parallel_external_lb=False, data_parallel_backend='mp', enable_expert_parallel=False, all2all_backend='allgather_reducescatter', enable_dbo=False, ubatch_size=0, dbo_decode_token_threshold=32, dbo_prefill_token_threshold=512, disable_nccl_for_dp_synchronization=None, eplb_config=EPLBConfig(window_size=1000, step_interval=3000, num_redundant_experts=0, log_balancedness=False, log_balancedness_interval=1, use_async=False, policy='default'), enable_eplb=False, expert_placement_strategy='linear', _api_process_count=1, _api_process_rank=0, max_parallel_loading_workers=0, block_size=16, enable_prefix_caching=False, prefix_caching_hash_algo='sha256', disable_sliding_window=False, disable_cascade_attn=False, swap_space=4.0, cpu_offload_gb=0, gpu_memory_utilization=1.0, kv_cache_memory_bytes=None, max_num_batched_tokens=8192, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=256, max_logprobs=20, logprobs_mode='raw_logprobs', disable_log_stats=False, aggregate_engine_logging=False, revision='b5c939de8f754692c1647ca79fbf85e8c1e70f8a', code_revision=None, hf_token=None, hf_overrides={}, tokenizer_revision=None, quantization=None, allow_deprecated_quantization=False, enforce_eager=False, disable_custom_all_reduce=False, limit_mm_per_prompt={}, enable_mm_embeds=False, interleave_mm_strings=False, media_io_kwargs={}, mm_processor_kwargs=None, mm_processor_cache_gb=4, mm_processor_cache_type='lru', mm_shm_cache_max_object_size_mb=128, mm_encoder_only=False, mm_encoder_tp_mode='weights', mm_encoder_attn_backend=None, io_processor_plugin=None, skip_mm_profiling=False, video_pruning_rate=None, enable_lora=False, max_loras=1, max_lora_rank=16, default_mm_loras=None, fully_sharded_loras=False, max_cpu_loras=0, lora_dtype='auto', enable_tower_connector_lora=False, specialize_active_lora=False, ray_workers_use_nsight=False, num_gpu_blocks_override=None, model_loader_extra_config={}, ignore_patterns=['original/
/
'], enable_chunked_prefill=False, disable_chunked_mm_input=False, disable_hybrid_kv_cache_manager=None, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), reasoning_parser='', reasoning_parser_plugin=None, logits_processor_pattern=None, speculative_config=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_logging_iteration_details=False, enable_mm_processor_stats=False, scheduling_policy='fcfs', scheduler_cls=None, pooler_config=None, compilation_config={'level': None, 'mode': None, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': [], 'splitting_ops': None, 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': None, 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': None, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': None, 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': None, 'pass_config': {}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': None, 'static_all_moe_layers': []}, attention_config=AttentionConfig(backend=None, flash_attn_version=None, use_prefill_decode_attention=False, flash_attn_max_num_splits_for_cuda_graph=32, use_cudnn_prefill=False, use_trtllm_ragged_deepseek_prefill=True, use_trtllm_attention=None, disable_flashinfer_prefill=False, disable_flashinfer_q_quantization=False), kernel_config=KernelConfig(enable_flashinfer_autotune=None), enable_flashinfer_autotune=None, worker_cls='auto', worker_extension_cls='', profiler_config=ProfilerConfig(profiler=None, torch_profiler_dir='', torch_profiler_with_stack=True, torch_profiler_with_flops=False, torch_profiler_use_gzip=True, torch_profiler_dump_cuda_time_total=True, torch_profiler_record_shapes=False, torch_profiler_with_memory=False, ignore_frontend=False, delay_iterations=0, max_iterations=0), kv_transfer_config=None, kv_events_config=None, ec_transfer_config=None, generation_config='auto', enable_sleep_mode=False, override_generation_config={}, model_impl='auto', override_attention_dtype=None, attention_backend=None, calculate_kv_scales=False, mamba_cache_dtype='auto', mamba_ssm_cache_dtype='auto', mamba_block_size=None, mamba_cache_mode='none', additional_config={}, use_tqdm_on_load=True, pt_load_map_location='cpu', logits_processors=None, async_scheduling=None, stream_interval=1, kv_sharing_fast_prefill=False, optimization_level=<OptimizationLevel.O2: 2>, kv_offloading_size=None, kv_offloading_backend='native', tokens_only=False, weight_transfer_config=None, enable_log_requests=False)
2026-03-11T17:57:55.953585909Z Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
2026-03-11T17:57:55.953623648Z _http.py :857 2026-03-11 17:57:55,953 Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
2026-03-11T17:57:56.018117630Z INFO 03-11 17:57:56 [model.py:529] Resolved architecture: GptOssForCausalLM
2026-03-11T17:57:56.289472987Z
Parse safetensors files: 0%| | 0/15 [00:00<?, ?it/s]
Parse safetensors files: 7%|▋ | 1/15 [00:00<00:01, 8.73it/s]
Parse safetensors files: 100%|██████████| 15/15 [00:00<00:00, 79.44it/s]
2026-03-11T17:57:56.293821140Z INFO 03-11 17:57:56 [model.py:1549] Using max model len 8192
2026-03-11T17:57:56.422384313Z WARNING 03-11 17:57:56 [arg_utils.py:1949] This model does not officially support disabling chunked prefill. Disabling this manually may cause the engine to crash or produce incorrect outputs.
2026-03-11T17:57:56.552676514Z WARNING 03-11 17:57:56 [parallel.py:648] max_parallel_loading_workers is currently not supported and will be ignored.
2026-03-11T17:57:56.553296471Z INFO 03-11 17:57:56 [config.py:314] Overriding max cuda graph capture size to 1024 for performance.
2026-03-11T17:57:56.553733410Z INFO 03-11 17:57:56 [vllm.py:689] Asynchronous scheduling is enabled.
2026-03-11T17:57:58.953702409Z (EngineCore_DP0 pid=274) INFO 03-11 17:57:58 [core.py:97] Initializing a V1 LLM engine (v0.16.0) with config: model='openai/gpt-oss-120b', speculative_config=None, tokenizer='openai/gpt-oss-120b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=b5c939de8f754692c1647ca79fbf85e8c1e70f8a, tokenizer_revision=b5c939de8f754692c1647ca79fbf85e8c1e70f8a, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=2026, served_model_name=openai/gpt-oss-120b, enable_prefix_caching=False, enable_chunked_prefill=False, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 1024, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
2026-03-11T17:57:58.953878239Z (EngineCore_DP0 pid=274) WARNING 03-11 17:57:58 [multiproc_executor.py:921] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
2026-03-11T17:57:59.683477991Z (EngineCore_DP0 pid=274) INFO 03-11 17:57:59 [parallel_state.py:1234] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:50047 backend=nccl
2026-03-11T17:57:59.735978629Z (EngineCore_DP0 pid=274) INFO 03-11 17:57:59 [parallel_state.py:1445] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
2026-03-11T17:58:00.143821875Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [multiproc_executor.py:783] WorkerProc failed to start.
2026-03-11T17:58:00.143858881Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [multiproc_executor.py:783] Traceback (most recent call last):
2026-03-11T17:58:00.143863874Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [multiproc_executor.py:783] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 754, in worker_main
2026-03-11T17:58:00.143868621Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [multiproc_executor.py:783] worker = WorkerProc(*args, **kwargs)
2026-03-11T17:58:00.143872434Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [multiproc_executor.py:783] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 571, in init
2026-03-11T17:58:00.143876166Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [multiproc_executor.py:783] self.worker.init_device()
2026-03-11T17:58:00.143884390Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [multiproc_executor.py:783] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/worker_base.py", line 322, in init_device
2026-03-11T17:58:00.143911100Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [multiproc_executor.py:783] self.worker.init_device() # type: ignore
2026-03-11T17:58:00.143915165Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [multiproc_executor.py:783] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/gpu_worker.py", line 252, in init_device
2026-03-11T17:58:00.143918729Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [multiproc_executor.py:783] self.requested_memory = request_memory(init_snapshot, self.cache_config)
2026-03-11T17:58:00.143922447Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [multiproc_executor.py:783] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/worker/utils.py", line 102, in request_memory
2026-03-11T17:58:00.143926167Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [multiproc_executor.py:783] raise ValueError(
2026-03-11T17:58:00.143930053Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [multiproc_executor.py:783] ValueError: Free memory on device cuda:0 (78.59/79.19 GiB) on startup is less than desired GPU memory utilization (1.0, 79.19 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
2026-03-11T17:58:00.349488651Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [core.py:1006] EngineCore failed to start.
2026-03-11T17:58:00.349527052Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [core.py:1006] Traceback (most recent call last):
2026-03-11T17:58:00.349531919Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 996, in run_engine_core
2026-03-11T17:58:00.349536395Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [core.py:1006] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
2026-03-11T17:58:00.349540390Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 740, in init
2026-03-11T17:58:00.349545063Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [core.py:1006] super().init(
2026-03-11T17:58:00.349548899Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 106, in init
2026-03-11T17:58:00.349552637Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [core.py:1006] self.model_executor = executor_class(vllm_config)
2026-03-11T17:58:00.349556699Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 98, in init
2026-03-11T17:58:00.349560275Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [core.py:1006] super().init(vllm_config)
2026-03-11T17:58:00.349568302Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/abstract.py", line 103, in init
2026-03-11T17:58:00.349572143Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [core.py:1006] self._init_executor()
2026-03-11T17:58:00.349575522Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 166, in _init_executor
2026-03-11T17:58:00.349579269Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [core.py:1006] self.workers = WorkerProc.wait_for_ready(unready_workers)
2026-03-11T17:58:00.349582814Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [core.py:1006] File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 680, in wait_for_ready
2026-03-11T17:58:00.349587956Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [core.py:1006] raise e from None
2026-03-11T17:58:00.349591586Z (EngineCore_DP0 pid=274) ERROR 03-11 17:58:00 [core.py:1006] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
2026-03-11T17:58:00.349610857Z (EngineCore_DP0 pid=274) Process EngineCore_DP0:
2026-03-11T17:58:00.349812337Z (EngineCore_DP0 pid=274) Traceback (most recent call last):
2026-03-11T17:58:00.349857455Z (EngineCore_DP0 pid=274) File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
2026-03-11T17:58:00.349862651Z (EngineCore_DP0 pid=274) self.run()
2026-03-11T17:58:00.349866641Z (EngineCore_DP0 pid=274) File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
2026-03-11T17:58:00.349870415Z (EngineCore_DP0 pid=274) self._target(*self._args, **self._kwargs)
2026-03-11T17:58:00.349874819Z (EngineCore_DP0 pid=274) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 1010, in run_engine_core
2026-03-11T17:58:00.349878690Z (EngineCore_DP0 pid=274) raise e
2026-03-11T17:58:00.349882379Z (EngineCore_DP0 pid=274) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 996, in run_engine_core
2026-03-11T17:58:00.349885900Z (EngineCore_DP0 pid=274) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
2026-03-11T17:58:00.349889398Z (EngineCore_DP0 pid=274) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 740, in init
2026-03-11T17:58:00.349893002Z (EngineCore_DP0 pid=274) super().init(
2026-03-11T17:58:00.349896450Z (EngineCore_DP0 pid=274) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 106, in init
2026-03-11T17:58:00.349900154Z (EngineCore_DP0 pid=274) self.model_executor = executor_class(vllm_config)
2026-03-11T17:58:00.349904082Z (EngineCore_DP0 pid=274) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 98, in init
2026-03-11T17:58:00.349907989Z (EngineCore_DP0 pid=274) super().init(vllm_config)
2026-03-11T17:58:00.349911433Z (EngineCore_DP0 pid=274) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/abstract.py", line 103, in init
2026-03-11T17:58:00.349915020Z (EngineCore_DP0 pid=274) self._init_executor()
2026-03-11T17:58:00.349918493Z (EngineCore_DP0 pid=274) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 166, in _init_executor
2026-03-11T17:58:00.349922141Z (EngineCore_DP0 pid=274) self.workers = WorkerProc.wait_for_ready(unready_workers)
2026-03-11T17:58:00.349925533Z (EngineCore_DP0 pid=274) File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/multiproc_executor.py", line 680, in wait_for_ready
2026-03-11T17:58:00.349929177Z (EngineCore_DP0 pid=274) raise e from None
2026-03-11T17:58:00.349934194Z (EngineCore_DP0 pid=274) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
2026-03-11T17:58:00.372057502Z /usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
2026-03-11T17:58:00.372095607Z warnings.warn('resource_tracker: There appear to be %d '
2026-03-11T17:58:00.372632706Z engine.py :171 2026-03-11 17:58:00,370 Error initializing vLLM engine: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}
2026-03-11T17:58:00.375071150Z {"requestId": null, "message": "Worker startup failed: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}\nTraceback (most recent call last):\n File "/src/handler.py", line 42, in \n vllm_engine = vLLMEngine()\n File "/src/engine.py", line 31, in init\n self.llm = self._initialize_llm() if engine is None else engine.llm\n File "/src/engine.py", line 172, in _initialize_llm\n raise e\n File "/src/engine.py", line 166, in _initialize_llm\n engine = AsyncLLMEngine.from_engine_args(self.engine_args)\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 251, in from_engine_args\n return cls(\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 148, in init\n self.engine_core = EngineCoreClient.make_async_mp_client(\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 124, in make_async_mp_client\n return AsyncMPClient(*client_args)\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 835, in init\n super().init(\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 490, in init\n with launch_core_engines(vllm_config, executor_class, log_stats) as (\n File "/usr/lib/python3.10/contextlib.py", line 142, in exit\n next(self.gen)\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 925, in launch_core_engines\n wait_for_engine_startup(\n File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 984, in wait_for_engine_startup\n raise RuntimeError(\nRuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore_DP0': 1}\n", "level": "ERROR"}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions