Skip to content

Can't run Qwen3-30B-A3B on H100, same config works fine on A100/L40s #213

@szulcmaciej

Description

@szulcmaciej

Hi!
I've tried to benchmark different GPUs for throughput with Qwen3 30B A3B FP8.
First I tried A100 and L40s - everything was fine, I got my numbers.

Then I tried H100 - and it failed on vLLM startup ( Error initializing vLLM engine). I tried to restart it, create a new worker, etc., but it always ends up with the same error. I'm not sure what the issue is and how to fix it.

Here are the logs:

2025-08-21T15:03:48.762848707Z INFO 08-21 15:03:48 [__init__.py:235] Automatically detected platform cuda.
2025-08-21T15:03:50.200115872Z engine.py           :27   2025-08-21 15:03:50,199 Engine args: AsyncEngineArgs(model='Qwen/Qwen3-30B-A3B-Instruct-2507-FP8', served_model_name=None, tokenizer=None, hf_config_path=None, task='auto', skip_tokenizer_init=False, enable_prompt_embeds=False, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path='', download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='fp8', seed=0, max_model_len=8192, cuda_graph_sizes=[], distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, data_parallel_rank=None, data_parallel_start_rank=None, data_parallel_size_local=None, data_parallel_address=None, data_parallel_rpc_port=None, data_parallel_hybrid_lb=False, data_parallel_backend='mp', enable_expert_parallel=False, enable_eplb=False, num_redundant_experts=0, eplb_window_size=1000, eplb_step_interval=3000, eplb_log_balancedness=False, max_parallel_loading_workers=None, block_size=16, enable_prefix_caching=False, prefix_caching_hash_algo='builtin', disable_sliding_window=False, disable_cascade_attn=False, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=256, max_logprobs=20, logprobs_mode='raw_logprobs', disable_log_stats=False, revision=None, code_revision=None, rope_scaling={}, rope_theta=None, hf_token=None, hf_overrides={}, tokenizer_revision=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, limit_mm_per_prompt={}, interleave_mm_strings=False, media_io_kwargs={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, default_mm_loras=None, fully_sharded_loras=False, max_cpu_loras=None, lora_dtype='auto', lora_extra_vocab_size=256, num_scheduler_steps=1, multi_step_stream_outputs=True, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config={}, ignore_patterns=None, preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, disable_chunked_mm_input=False, disable_hybrid_kv_cache_manager=False, guided_decoding_backend='outlines', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, logits_processor_pattern=None, speculative_config=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config={}, override_pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":null,"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":null,"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":null,"local_cache_dir":null}, worker_cls='auto', worker_extension_cls='', kv_transfer_config=None, kv_events_config=None, generation_config='auto', enable_sleep_mode=False, override_generation_config={}, model_impl='auto', override_attention_dtype=None, calculate_kv_scales=False, additional_config={}, reasoning_parser='', use_tqdm_on_load=True, pt_load_map_location='cpu', enable_multimodal_encoder_data_parallel=False, async_scheduling=False, enable_prompt_adapter=False, disable_log_requests=False)
2025-08-21T15:03:55.869296691Z INFO 08-21 15:03:55 [config.py:1604] Using max model len 8192
2025-08-21T15:03:56.284128956Z INFO 08-21 15:03:56 [config.py:1733] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
2025-08-21T15:03:56.429006608Z INFO 08-21 15:03:56 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=2048.
2025-08-21T15:04:04.812408188Z INFO 08-21 15:04:04 [__init__.py:235] Automatically detected platform cuda.
2025-08-21T15:04:06.050142092Z engine.py           :27   2025-08-21 15:04:06,049 Engine args: AsyncEngineArgs(model='Qwen/Qwen3-30B-A3B-Instruct-2507-FP8', served_model_name=None, tokenizer=None, hf_config_path=None, task='auto', skip_tokenizer_init=False, enable_prompt_embeds=False, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path='', download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='fp8', seed=0, max_model_len=8192, cuda_graph_sizes=[], distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, data_parallel_rank=None, data_parallel_start_rank=None, data_parallel_size_local=None, data_parallel_address=None, data_parallel_rpc_port=None, data_parallel_hybrid_lb=False, data_parallel_backend='mp', enable_expert_parallel=False, enable_eplb=False, num_redundant_experts=0, eplb_window_size=1000, eplb_step_interval=3000, eplb_log_balancedness=False, max_parallel_loading_workers=None, block_size=16, enable_prefix_caching=False, prefix_caching_hash_algo='builtin', disable_sliding_window=False, disable_cascade_attn=False, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=256, max_logprobs=20, logprobs_mode='raw_logprobs', disable_log_stats=False, revision=None, code_revision=None, rope_scaling={}, rope_theta=None, hf_token=None, hf_overrides={}, tokenizer_revision=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, limit_mm_per_prompt={}, interleave_mm_strings=False, media_io_kwargs={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, default_mm_loras=None, fully_sharded_loras=False, max_cpu_loras=None, lora_dtype='auto', lora_extra_vocab_size=256, num_scheduler_steps=1, multi_step_stream_outputs=True, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config={}, ignore_patterns=None, preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, disable_chunked_mm_input=False, disable_hybrid_kv_cache_manager=False, guided_decoding_backend='outlines', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, logits_processor_pattern=None, speculative_config=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config={}, override_pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":null,"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":null,"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":null,"local_cache_dir":null}, worker_cls='auto', worker_extension_cls='', kv_transfer_config=None, kv_events_config=None, generation_config='auto', enable_sleep_mode=False, override_generation_config={}, model_impl='auto', override_attention_dtype=None, calculate_kv_scales=False, additional_config={}, reasoning_parser='', use_tqdm_on_load=True, pt_load_map_location='cpu', enable_multimodal_encoder_data_parallel=False, async_scheduling=False, enable_prompt_adapter=False, disable_log_requests=False)
2025-08-21T15:04:11.457497375Z INFO 08-21 15:04:11 [config.py:1604] Using max model len 8192
2025-08-21T15:04:11.513592686Z INFO 08-21 15:04:11 [config.py:1733] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
2025-08-21T15:04:11.610994571Z INFO 08-21 15:04:11 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=2048.
2025-08-21T15:04:12.535464541Z engine.py           :170  2025-08-21 15:04:12,534 Error initializing vLLM engine:
2025-08-21T15:04:12.535490158Z         An attempt has been made to start a new process before the
2025-08-21T15:04:12.535492124Z         current process has finished its bootstrapping phase.
2025-08-21T15:04:12.535494995Z         This probably means that you are not using fork to start your
2025-08-21T15:04:12.535496221Z         child processes and you have forgotten to use the proper idiom
2025-08-21T15:04:12.535497789Z         in the main module:
2025-08-21T15:04:12.535500407Z             if __name__ == '__main__':
2025-08-21T15:04:12.535502128Z                 freeze_support()
2025-08-21T15:04:12.535503123Z                 ...
2025-08-21T15:04:12.535505680Z         The "freeze_support()" line can be omitted if the program
2025-08-21T15:04:12.535506883Z         is not going to be frozen to produce an executable.
2025-08-21T15:04:12.538381491Z Traceback (most recent call last):
2025-08-21T15:04:12.538402698Z   File "<string>", line 1, in <module>
2025-08-21T15:04:12.538403987Z   File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
2025-08-21T15:04:12.538405747Z     exitcode = _main(fd, parent_sentinel)
2025-08-21T15:04:12.538407086Z   File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
2025-08-21T15:04:12.538408240Z     prepare(preparation_data)
2025-08-21T15:04:12.538409789Z   File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
2025-08-21T15:04:12.538411223Z     _fixup_main_from_path(data['init_main_from_path'])
2025-08-21T15:04:12.538412513Z   File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
2025-08-21T15:04:12.538414114Z     main_content = runpy.run_path(main_path,
2025-08-21T15:04:12.538415032Z   File "/usr/lib/python3.10/runpy.py", line 289, in run_path
2025-08-21T15:04:12.538416090Z     return _run_module_code(code, init_globals, run_name,
2025-08-21T15:04:12.538417078Z   File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
2025-08-21T15:04:12.538418403Z     _run_code(code, mod_globals, init_globals,
2025-08-21T15:04:12.538419499Z   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2025-08-21T15:04:12.538420498Z     exec(code, run_globals)
2025-08-21T15:04:12.538421571Z   File "/src/handler.py", line 6, in <module>
2025-08-21T15:04:12.538422496Z     vllm_engine = vLLMEngine()
2025-08-21T15:04:12.538423268Z   File "/src/engine.py", line 30, in __init__
2025-08-21T15:04:12.538424303Z     self.llm = self._initialize_llm() if engine is None else engine.llm
2025-08-21T15:04:12.538425251Z   File "/src/engine.py", line 171, in _initialize_llm
2025-08-21T15:04:12.538426036Z     raise e
2025-08-21T15:04:12.538427510Z   File "/src/engine.py", line 165, in _initialize_llm
2025-08-21T15:04:12.538428407Z     engine = AsyncLLMEngine.from_engine_args(self.engine_args)
2025-08-21T15:04:12.538429306Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 190, in from_engine_args
2025-08-21T15:04:12.538430941Z     return cls(
2025-08-21T15:04:12.538431743Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 117, in __init__
2025-08-21T15:04:12.538432846Z     self.engine_core = EngineCoreClient.make_async_mp_client(
2025-08-21T15:04:12.538433623Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 98, in make_async_mp_client
2025-08-21T15:04:12.538434524Z     return AsyncMPClient(*client_args)
2025-08-21T15:04:12.538435311Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 677, in __init__
2025-08-21T15:04:12.538436237Z     super().__init__(
2025-08-21T15:04:12.538437417Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 408, in __init__
2025-08-21T15:04:12.538438380Z     with launch_core_engines(vllm_config, executor_class,
2025-08-21T15:04:12.538448595Z   File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
2025-08-21T15:04:12.538449617Z     return next(self.gen)
2025-08-21T15:04:12.538450466Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 680, in launch_core_engines
2025-08-21T15:04:12.538453547Z     local_engine_manager = CoreEngineProcManager(
2025-08-21T15:04:12.538454690Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 133, in __init__
2025-08-21T15:04:12.538455796Z     proc.start()
2025-08-21T15:04:12.538456678Z   File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
2025-08-21T15:04:12.538457621Z     self._popen = self._Popen(self)
2025-08-21T15:04:12.538458594Z   File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
2025-08-21T15:04:12.538460962Z     return Popen(process_obj)
2025-08-21T15:04:12.538461720Z   File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
2025-08-21T15:04:12.538462628Z     super().__init__(process_obj)
2025-08-21T15:04:12.538463465Z   File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
2025-08-21T15:04:12.538464370Z     self._launch(process_obj)
2025-08-21T15:04:12.538465248Z   File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
2025-08-21T15:04:12.538466454Z     prep_data = spawn.get_preparation_data(process_obj._name)
2025-08-21T15:04:12.538467392Z   File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
2025-08-21T15:04:12.538468298Z     _check_not_importing_main()
2025-08-21T15:04:12.538469048Z   File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
2025-08-21T15:04:12.538470161Z     raise RuntimeError('''
2025-08-21T15:04:12.538471053Z RuntimeError:
2025-08-21T15:04:12.538472138Z         An attempt has been made to start a new process before the
2025-08-21T15:04:12.538473003Z         current process has finished its bootstrapping phase.
2025-08-21T15:04:12.538474696Z         This probably means that you are not using fork to start your
2025-08-21T15:04:12.538475518Z         child processes and you have forgotten to use the proper idiom
2025-08-21T15:04:12.538476419Z         in the main module:
2025-08-21T15:04:12.538477958Z             if __name__ == '__main__':
2025-08-21T15:04:12.538478808Z                 freeze_support()
2025-08-21T15:04:12.538479711Z                 ...
2025-08-21T15:04:12.538481175Z         The "freeze_support()" line can be omitted if the program
2025-08-21T15:04:12.538482022Z         is not going to be frozen to produce an executable.
2025-08-21T15:04:13.468176690Z engine.py           :170  2025-08-21 15:04:13,467 Error initializing vLLM engine: Engine core initialization failed. See root cause above. Failed core proc(s): {}
2025-08-21T15:04:13.469869348Z Traceback (most recent call last):
2025-08-21T15:04:13.469874242Z   File "/src/handler.py", line 6, in <module>
2025-08-21T15:04:13.469875445Z     vllm_engine = vLLMEngine()
2025-08-21T15:04:13.469877633Z   File "/src/engine.py", line 30, in __init__
2025-08-21T15:04:13.469879003Z     self.llm = self._initialize_llm() if engine is None else engine.llm
2025-08-21T15:04:13.469881501Z   File "/src/engine.py", line 171, in _initialize_llm
2025-08-21T15:04:13.469883456Z     raise e
2025-08-21T15:04:13.469885468Z   File "/src/engine.py", line 165, in _initialize_llm
2025-08-21T15:04:13.469886718Z     engine = AsyncLLMEngine.from_engine_args(self.engine_args)
2025-08-21T15:04:13.469888363Z   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 653, in from_engine_args
2025-08-21T15:04:13.469890531Z     return async_engine_cls.from_vllm_config(
2025-08-21T15:04:13.469891700Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 163, in from_vllm_config
2025-08-21T15:04:13.469893429Z     return cls(
2025-08-21T15:04:13.469894677Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 117, in __init__
2025-08-21T15:04:13.469904371Z     self.engine_core = EngineCoreClient.make_async_mp_client(
2025-08-21T15:04:13.469905665Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 98, in make_async_mp_client
2025-08-21T15:04:13.469906807Z     return AsyncMPClient(*client_args)
2025-08-21T15:04:13.469908131Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 677, in __init__
2025-08-21T15:04:13.469909266Z     super().__init__(
2025-08-21T15:04:13.469911346Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 408, in __init__
2025-08-21T15:04:13.469912333Z     with launch_core_engines(vllm_config, executor_class,
2025-08-21T15:04:13.469913543Z   File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
2025-08-21T15:04:13.469914639Z     next(self.gen)
2025-08-21T15:04:13.469915795Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 697, in launch_core_engines
2025-08-21T15:04:13.469916789Z     wait_for_engine_startup(
2025-08-21T15:04:13.469917895Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
2025-08-21T15:04:13.469918891Z     raise RuntimeError("Engine core initialization failed. "
2025-08-21T15:04:13.469920299Z RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
2025-08-21T15:04:23.935732003Z INFO 08-21 15:04:23 [__init__.py:235] Automatically detected platform cuda.
2025-08-21T15:04:25.163807834Z engine.py           :27   2025-08-21 15:04:25,163 Engine args: AsyncEngineArgs(model='Qwen/Qwen3-30B-A3B-Instruct-2507-FP8', served_model_name=None, tokenizer=None, hf_config_path=None, task='auto', skip_tokenizer_init=False, enable_prompt_embeds=False, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path='', download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='fp8', seed=0, max_model_len=8192, cuda_graph_sizes=[], distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, data_parallel_rank=None, data_parallel_start_rank=None, data_parallel_size_local=None, data_parallel_address=None, data_parallel_rpc_port=None, data_parallel_hybrid_lb=False, data_parallel_backend='mp', enable_expert_parallel=False, enable_eplb=False, num_redundant_experts=0, eplb_window_size=1000, eplb_step_interval=3000, eplb_log_balancedness=False, max_parallel_loading_workers=None, block_size=16, enable_prefix_caching=False, prefix_caching_hash_algo='builtin', disable_sliding_window=False, disable_cascade_attn=False, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=256, max_logprobs=20, logprobs_mode='raw_logprobs', disable_log_stats=False, revision=None, code_revision=None, rope_scaling={}, rope_theta=None, hf_token=None, hf_overrides={}, tokenizer_revision=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, limit_mm_per_prompt={}, interleave_mm_strings=False, media_io_kwargs={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, default_mm_loras=None, fully_sharded_loras=False, max_cpu_loras=None, lora_dtype='auto', lora_extra_vocab_size=256, num_scheduler_steps=1, multi_step_stream_outputs=True, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config={}, ignore_patterns=None, preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, disable_chunked_mm_input=False, disable_hybrid_kv_cache_manager=False, guided_decoding_backend='outlines', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, logits_processor_pattern=None, speculative_config=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config={}, override_pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":null,"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":null,"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":null,"local_cache_dir":null}, worker_cls='auto', worker_extension_cls='', kv_transfer_config=None, kv_events_config=None, generation_config='auto', enable_sleep_mode=False, override_generation_config={}, model_impl='auto', override_attention_dtype=None, calculate_kv_scales=False, additional_config={}, reasoning_parser='', use_tqdm_on_load=True, pt_load_map_location='cpu', enable_multimodal_encoder_data_parallel=False, async_scheduling=False, enable_prompt_adapter=False, disable_log_requests=False)
2025-08-21T15:04:30.705151790Z INFO 08-21 15:04:30 [config.py:1604] Using max model len 8192
2025-08-21T15:04:31.060571466Z INFO 08-21 15:04:31 [config.py:1733] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
2025-08-21T15:04:31.160376965Z INFO 08-21 15:04:31 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=2048.
2025-08-21T15:04:36.657001974Z INFO 08-21 15:04:36 [__init__.py:235] Automatically detected platform cuda.
2025-08-21T15:04:37.879273101Z engine.py           :27   2025-08-21 15:04:37,878 Engine args: AsyncEngineArgs(model='Qwen/Qwen3-30B-A3B-Instruct-2507-FP8', served_model_name=None, tokenizer=None, hf_config_path=None, task='auto', skip_tokenizer_init=False, enable_prompt_embeds=False, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path='', download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='fp8', seed=0, max_model_len=8192, cuda_graph_sizes=[], distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, data_parallel_rank=None, data_parallel_start_rank=None, data_parallel_size_local=None, data_parallel_address=None, data_parallel_rpc_port=None, data_parallel_hybrid_lb=False, data_parallel_backend='mp', enable_expert_parallel=False, enable_eplb=False, num_redundant_experts=0, eplb_window_size=1000, eplb_step_interval=3000, eplb_log_balancedness=False, max_parallel_loading_workers=None, block_size=16, enable_prefix_caching=False, prefix_caching_hash_algo='builtin', disable_sliding_window=False, disable_cascade_attn=False, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.95, max_num_batched_tokens=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, max_num_seqs=256, max_logprobs=20, logprobs_mode='raw_logprobs', disable_log_stats=False, revision=None, code_revision=None, rope_scaling={}, rope_theta=None, hf_token=None, hf_overrides={}, tokenizer_revision=None, quantization=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, limit_mm_per_prompt={}, interleave_mm_strings=False, media_io_kwargs={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, default_mm_loras=None, fully_sharded_loras=False, max_cpu_loras=None, lora_dtype='auto', lora_extra_vocab_size=256, num_scheduler_steps=1, multi_step_stream_outputs=True, ray_workers_use_nsight=False, num_gpu_blocks_override=None, num_lookahead_slots=0, model_loader_extra_config={}, ignore_patterns=None, preemption_mode=None, scheduler_delay_factor=0.0, enable_chunked_prefill=None, disable_chunked_mm_input=False, disable_hybrid_kv_cache_manager=False, guided_decoding_backend='outlines', guided_decoding_disable_fallback=False, guided_decoding_disable_any_whitespace=False, guided_decoding_disable_additional_properties=False, logits_processor_pattern=None, speculative_config=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config={}, override_pooler_config=None, compilation_config={"level":0,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":[],"use_inductor":true,"compile_sizes":null,"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":null,"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":null,"local_cache_dir":null}, worker_cls='auto', worker_extension_cls='', kv_transfer_config=None, kv_events_config=None, generation_config='auto', enable_sleep_mode=False, override_generation_config={}, model_impl='auto', override_attention_dtype=None, calculate_kv_scales=False, additional_config={}, reasoning_parser='', use_tqdm_on_load=True, pt_load_map_location='cpu', enable_multimodal_encoder_data_parallel=False, async_scheduling=False, enable_prompt_adapter=False, disable_log_requests=False)
2025-08-21T15:04:43.403426412Z INFO 08-21 15:04:43 [config.py:1604] Using max model len 8192
2025-08-21T15:04:43.461700225Z INFO 08-21 15:04:43 [config.py:1733] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
2025-08-21T15:04:43.561869007Z INFO 08-21 15:04:43 [config.py:2434] Chunked prefill is enabled with max_num_batched_tokens=2048.
2025-08-21T15:04:44.479653524Z engine.py           :170  2025-08-21 15:04:44,479 Error initializing vLLM engine:
2025-08-21T15:04:44.479688587Z         An attempt has been made to start a new process before the
2025-08-21T15:04:44.479691310Z         current process has finished its bootstrapping phase.
2025-08-21T15:04:44.479694768Z         This probably means that you are not using fork to start your
2025-08-21T15:04:44.479695945Z         child processes and you have forgotten to use the proper idiom
2025-08-21T15:04:44.479697147Z         in the main module:
2025-08-21T15:04:44.479700014Z             if __name__ == '__main__':
2025-08-21T15:04:44.479701751Z                 freeze_support()
2025-08-21T15:04:44.479702805Z                 ...
2025-08-21T15:04:44.479705433Z         The "freeze_support()" line can be omitted if the program
2025-08-21T15:04:44.479706430Z         is not going to be frozen to produce an executable.
2025-08-21T15:04:44.480419736Z Traceback (most recent call last):
2025-08-21T15:04:44.480437074Z   File "<string>", line 1, in <module>
2025-08-21T15:04:44.480438201Z   File "/usr/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
2025-08-21T15:04:44.480439985Z     exitcode = _main(fd, parent_sentinel)
2025-08-21T15:04:44.480440983Z   File "/usr/lib/python3.10/multiprocessing/spawn.py", line 125, in _main
2025-08-21T15:04:44.480442508Z     prepare(preparation_data)
2025-08-21T15:04:44.480444652Z   File "/usr/lib/python3.10/multiprocessing/spawn.py", line 236, in prepare
2025-08-21T15:04:44.480445702Z     _fixup_main_from_path(data['init_main_from_path'])
2025-08-21T15:04:44.480447055Z   File "/usr/lib/python3.10/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
2025-08-21T15:04:44.480448771Z     main_content = runpy.run_path(main_path,
2025-08-21T15:04:44.480449664Z   File "/usr/lib/python3.10/runpy.py", line 289, in run_path
2025-08-21T15:04:44.480450624Z     return _run_module_code(code, init_globals, run_name,
2025-08-21T15:04:44.480451541Z   File "/usr/lib/python3.10/runpy.py", line 96, in _run_module_code
2025-08-21T15:04:44.480452460Z     _run_code(code, mod_globals, init_globals,
2025-08-21T15:04:44.480453325Z   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
2025-08-21T15:04:44.480454293Z     exec(code, run_globals)
2025-08-21T15:04:44.480455115Z   File "/src/handler.py", line 6, in <module>
2025-08-21T15:04:44.480455982Z     vllm_engine = vLLMEngine()
2025-08-21T15:04:44.480457079Z   File "/src/engine.py", line 30, in __init__
2025-08-21T15:04:44.480466967Z     self.llm = self._initialize_llm() if engine is None else engine.llm
2025-08-21T15:04:44.480467955Z   File "/src/engine.py", line 171, in _initialize_llm
2025-08-21T15:04:44.480468879Z     raise e
2025-08-21T15:04:44.480470275Z   File "/src/engine.py", line 165, in _initialize_llm
2025-08-21T15:04:44.480471131Z     engine = AsyncLLMEngine.from_engine_args(self.engine_args)
2025-08-21T15:04:44.480472063Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 190, in from_engine_args
2025-08-21T15:04:44.480474147Z     return cls(
2025-08-21T15:04:44.480474934Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 117, in __init__
2025-08-21T15:04:44.480475816Z     self.engine_core = EngineCoreClient.make_async_mp_client(
2025-08-21T15:04:44.480476705Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 98, in make_async_mp_client
2025-08-21T15:04:44.480477609Z     return AsyncMPClient(*client_args)
2025-08-21T15:04:44.480478498Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 677, in __init__
2025-08-21T15:04:44.480479603Z     super().__init__(
2025-08-21T15:04:44.480481189Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 408, in __init__
2025-08-21T15:04:44.480482101Z     with launch_core_engines(vllm_config, executor_class,
2025-08-21T15:04:44.480483680Z   File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
2025-08-21T15:04:44.480484591Z     return next(self.gen)
2025-08-21T15:04:44.480485490Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 680, in launch_core_engines
2025-08-21T15:04:44.480486266Z     local_engine_manager = CoreEngineProcManager(
2025-08-21T15:04:44.480487254Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 133, in __init__
2025-08-21T15:04:44.480488206Z     proc.start()
2025-08-21T15:04:44.480488987Z   File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
2025-08-21T15:04:44.480489885Z     self._popen = self._Popen(self)
2025-08-21T15:04:44.480494339Z   File "/usr/lib/python3.10/multiprocessing/context.py", line 288, in _Popen
2025-08-21T15:04:44.480495395Z     return Popen(process_obj)
2025-08-21T15:04:44.480496299Z   File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
2025-08-21T15:04:44.480497344Z     super().__init__(process_obj)
2025-08-21T15:04:44.480498231Z   File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
2025-08-21T15:04:44.480499157Z     self._launch(process_obj)
2025-08-21T15:04:44.480499921Z   File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 42, in _launch
2025-08-21T15:04:44.480501082Z     prep_data = spawn.get_preparation_data(process_obj._name)
2025-08-21T15:04:44.480501969Z   File "/usr/lib/python3.10/multiprocessing/spawn.py", line 154, in get_preparation_data
2025-08-21T15:04:44.480502978Z     _check_not_importing_main()
2025-08-21T15:04:44.480503920Z   File "/usr/lib/python3.10/multiprocessing/spawn.py", line 134, in _check_not_importing_main
2025-08-21T15:04:44.480504928Z     raise RuntimeError('''
2025-08-21T15:04:44.480505670Z RuntimeError:
2025-08-21T15:04:44.480506528Z         An attempt has been made to start a new process before the
2025-08-21T15:04:44.480507439Z         current process has finished its bootstrapping phase.
2025-08-21T15:04:44.480509126Z         This probably means that you are not using fork to start your
2025-08-21T15:04:44.480509897Z         child processes and you have forgotten to use the proper idiom
2025-08-21T15:04:44.480510685Z         in the main module:
2025-08-21T15:04:44.480512254Z             if __name__ == '__main__':
2025-08-21T15:04:44.480513015Z                 freeze_support()
2025-08-21T15:04:44.480513790Z                 ...
2025-08-21T15:04:44.480515237Z         The "freeze_support()" line can be omitted if the program
2025-08-21T15:04:44.480516013Z         is not going to be frozen to produce an executable.
2025-08-21T15:04:45.449484424Z engine.py           :170  2025-08-21 15:04:45,448 Error initializing vLLM engine: Engine core initialization failed. See root cause above. Failed core proc(s): {}
2025-08-21T15:04:45.450628202Z Traceback (most recent call last):
2025-08-21T15:04:45.450647248Z   File "/src/handler.py", line 6, in <module>
2025-08-21T15:04:45.450648646Z     vllm_engine = vLLMEngine()
2025-08-21T15:04:45.450650500Z   File "/src/engine.py", line 30, in __init__
2025-08-21T15:04:45.450651605Z     self.llm = self._initialize_llm() if engine is None else engine.llm
2025-08-21T15:04:45.450653655Z   File "/src/engine.py", line 171, in _initialize_llm
2025-08-21T15:04:45.450655700Z     raise e
2025-08-21T15:04:45.450656877Z   File "/src/engine.py", line 165, in _initialize_llm
2025-08-21T15:04:45.450657964Z     engine = AsyncLLMEngine.from_engine_args(self.engine_args)
2025-08-21T15:04:45.450659069Z   File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 653, in from_engine_args
2025-08-21T15:04:45.450661249Z     return async_engine_cls.from_vllm_config(
2025-08-21T15:04:45.450662384Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 163, in from_vllm_config
2025-08-21T15:04:45.450663539Z     return cls(
2025-08-21T15:04:45.450664524Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/async_llm.py", line 117, in __init__
2025-08-21T15:04:45.450665716Z     self.engine_core = EngineCoreClient.make_async_mp_client(
2025-08-21T15:04:45.450666776Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 98, in make_async_mp_client
2025-08-21T15:04:45.450667820Z     return AsyncMPClient(*client_args)
2025-08-21T15:04:45.450668733Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 677, in __init__
2025-08-21T15:04:45.450669728Z     super().__init__(
2025-08-21T15:04:45.450671203Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core_client.py", line 408, in __init__
2025-08-21T15:04:45.450672164Z     with launch_core_engines(vllm_config, executor_class,
2025-08-21T15:04:45.450673139Z   File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
2025-08-21T15:04:45.450674042Z     next(self.gen)
2025-08-21T15:04:45.450675046Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 697, in launch_core_engines
2025-08-21T15:04:45.450675961Z     wait_for_engine_startup(
2025-08-21T15:04:45.450676846Z   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/utils.py", line 750, in wait_for_engine_startup
2025-08-21T15:04:45.450677755Z     raise RuntimeError("Engine core initialization failed. "
2025-08-21T15:04:45.450678804Z RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions