[Bugfix] Bump DeepGEMM Version to Fix SMXX Layout Issues #22606

frankwang28 · 2025-08-10T23:44:31Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

There is a newer version of DeepGEMM which fixes some bugs, notably around smxx layout assertions. The current version of DeepGEMM in the Dockerfile does not have these fixes and we should bump the git reference accordingly. For reference, I am encountering the issue when serving DeepSeek-R1-0528 on 8xB200 with DeepGEMM, expert parallelism (DP=8, TP=1) and pplx All2All backend.

Test Plan

Launch vLLM before and after the updated DeepGEMM version.

Test Result

Before:

^[[1;36m(EngineCore_1 pid=303)^[[0;0m WARNING 08-10 16:16:08 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_6 pid=308)^[[0;0m WARNING 08-10 16:16:09 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_3 pid=305)^[[0;0m WARNING 08-10 16:16:09 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_0 pid=302)^[[0;0m WARNING 08-10 16:16:09 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_4 pid=306)^[[0;0m WARNING 08-10 16:16:09 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_2 pid=304)^[[0;0m WARNING 08-10 16:16:09 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_7 pid=309)^[[0;0m WARNING 08-10 16:16:10 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_5 pid=307)^[[0;0m WARNING 08-10 16:16:10 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683] EngineCore failed to start.
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683] Traceback (most recent call last):
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 670, in run_engine_core
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     engine_core = DPEngineCoreProc(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 937, in __init__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     super().__init__(vllm_config, local_client, handshake_address,
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 475, in __init__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     super().__init__(vllm_config, executor_class, log_stats,
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 87, in __init__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     self._initialize_kv_caches(vllm_config)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 164, in _initialize_kv_caches
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     self.model_executor.determine_available_memory())
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     output = self.collective_rpc("determine_available_memory")
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     answer = run_method(self.driver_worker, method, args, kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 2974, in run_method
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return func(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return func(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 244, in determine_available_memory
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     self.model_runner.profile_run()
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2470, in profile_run
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     = self._dummy_run(self.max_num_tokens, is_profile=True)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return func(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2246, in _dummy_run
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     outputs = self.model(
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]               ^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self._call_impl(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return forward_call(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 836, in forward
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 272, in __call__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     output = self.compiled_callable(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return fn(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 697, in forward
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     def forward(
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self._call_impl(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return forward_call(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return fn(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self._wrapped_call(self, *args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 406, in __call__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     raise e
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 393, in __call__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self._call_impl(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return forward_call(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "<eval_with_key>.124", line 950, in forward
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     submod_8 = self.submod_8(getitem_18, s0, l_self_modules_layers_modules_3_modules_self_attn_modules_o_proj_parameters_weight_, l_self_modules_layers_modules_3_modules_self_attn_modules_o_proj_parameters_weight_scale_inv_, getitem_19, l_self_modules_layers_modules_3_modules_post_attention_layernorm_parameters_weight_, l_self_modules_layers_modules_3_modules_mlp_modules_shared_experts_modules_gate_up_proj_parameters_weight_, l_self_modules_layers_modules_3_modules_mlp_modules_shared_experts_modules_gate_up_proj_parameters_weight_scale_inv_, l_self_modules_layers_modules_3_modules_mlp_modules_shared_experts_modules_down_proj_parameters_weight_, l_self_modules_layers_modules_3_modules_mlp_modules_shared_experts_modules_down_proj_parameters_weight_scale_inv_, l_self_modules_layers_modules_3_modules_mlp_modules_gate_parameters_weight_, l_self_modules_layers_modules_4_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_4_modules_self_attn_modules_fused_qkv_a_proj_parameters_weight_, l_self_modules_layers_modules_4_modules_self_attn_modules_fused_qkv_a_proj_parameters_weight_scale_inv_, l_self_modules_layers_modules_4_modules_self_attn_modules_q_a_layernorm_parameters_weight_, l_self_modules_layers_modules_4_modules_self_attn_modules_q_b_proj_parameters_weight_, l_self_modules_layers_modules_4_modules_self_attn_modules_q_b_proj_parameters_weight_scale_inv_, l_self_modules_layers_modules_4_modules_self_attn_modules_kv_a_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_, l_positions_);  getitem_18 = l_self_modules_layers_modules_3_modules_self_attn_modules_o_proj_parameters_weight_ = l_self_modules_layers_modules_3_modules_self_attn_modules_o_proj_parameters_weight_scale_inv_ = getitem_19 = l_self_modules_layers_modules_3_modules_post_attention_layernorm_parameters_weight_ = l_self_modules_layers_modules_3_modules_mlp_modules_shared_experts_modules_gate_up_proj_parameters_weight_ = l_self_modules_layers_modules_3_modules_mlp_modules_shared_experts_modules_gate_up_proj_parameters_weight_scale_inv_ = l_self_modules_layers_modules_3_modules_mlp_modules_shared_experts_modules_down_proj_parameters_weight_ = l_self_modules_layers_modules_3_modules_mlp_modules_shared_experts_modules_down_proj_parameters_weight_scale_inv_ = l_self_modules_layers_modules_3_modules_mlp_modules_gate_parameters_weight_ = l_self_modules_layers_modules_4_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_4_modules_self_attn_modules_fused_qkv_a_proj_parameters_weight_ = l_self_modules_layers_modules_4_modules_self_attn_modules_fused_qkv_a_proj_parameters_weight_scale_inv_ = l_self_modules_layers_modules_4_modules_self_attn_modules_q_a_layernorm_parameters_weight_ = l_self_modules_layers_modules_4_modules_self_attn_modules_q_b_proj_parameters_weight_ = l_self_modules_layers_modules_4_modules_self_attn_modules_q_b_proj_parameters_weight_scale_inv_ = l_self_modules_layers_modules_4_modules_self_attn_modules_kv_a_layernorm_parameters_weight_ = None
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_piecewise_backend.py", line 112, in __call__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self.compiled_graph_for_general_shape(*args)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return fn(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1209, in forward
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return compiled_fn(full_args)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 328, in runtime_wrapper
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     all_outs = call_func_at_runtime_with_args(
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     out = normalize_as_list(f(args))
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]                             ^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 689, in inner_fn
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     outs = compiled_fn(args)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 495, in wrapper
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return compiled_fn(runtime_args)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 460, in __call__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self.current_callable(inputs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2404, in run
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return model(new_inputs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/root/.cache/vllm/torch_compile_cache/1ef39f9610/rank_0_2/inductor_cache/g6/cg6a6v2v4v5jenw63o4yj3sa5ytgpvlig3c4scbqnekrl4ewtclu.py", line 684, in call
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     buf6 = torch.ops.vllm.moe_forward.default(reinterpret_tensor(buf5, (s0, 7168), (7168, 1), 0), buf4, 'model.layers.3.mlp.experts')
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 756, in __call__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self._op(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1708, in moe_forward
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self.forward_impl(hidden_states, router_logits)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1607, in forward_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self.forward_impl_chunked(hidden_states, router_logits)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1590, in forward_impl_chunked
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     process_chunk(chunk_start,
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1550, in process_chunk
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     final_hidden_states = self.quant_method.apply(
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]                           ^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 1039, in apply
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self.fused_experts(
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self._call_impl(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return forward_call(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 770, in forward
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     fused_out = self._maybe_chunk_fused_experts(
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 545, in _maybe_chunk_fused_experts
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self._do_fused_experts(
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 492, in _do_fused_experts
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     self.fused_experts.apply(
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/batched_triton_or_deep_gemm_moe.py", line 150, in apply
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     experts.apply(output, hidden_states, w1, w2, topk_weights, topk_ids,
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py", line 295, in apply
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     fp8_m_grouped_gemm_nt_masked((a2q, a2q_scale), (w2, w2_scale), output,
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/deep_gemm.py", line 126, in fp8_m_grouped_gemm_nt_masked
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return _grouped_masked_impl(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683] RuntimeError: Failed: Assertion error csrc/jit_kernels/impls/smxx_layout.hpp:136 'mn % 4 == 0 and num_groups == 1'

After:

^[[1;36m(EngineCore_7 pid=309)^[[0;0m WARNING 08-10 16:36:31 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_5 pid=307)^[[0;0m WARNING 08-10 16:36:32 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_2 pid=304)^[[0;0m WARNING 08-10 16:36:32 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_6 pid=308)^[[0;0m WARNING 08-10 16:36:32 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_3 pid=305)^[[0;0m WARNING 08-10 16:36:33 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_0 pid=302)^[[0;0m WARNING 08-10 16:36:33 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_4 pid=306)^[[0;0m WARNING 08-10 16:36:33 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_1 pid=303)^[[0;0m WARNING 08-10 16:36:33 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(APIServer pid=1)^[[0;0m DEBUG 08-10 16:36:37 [utils.py:750] Waiting for 8 local, 0 remote core engine proc(s) to start.
^[[1;36m(APIServer pid=1)^[[0;0m DEBUG 08-10 16:36:47 [utils.py:750] Waiting for 8 local, 0 remote core engine proc(s) to start.
^[[1;36m(APIServer pid=1)^[[0;0m DEBUG 08-10 16:36:57 [utils.py:750] Waiting for 8 local, 0 remote core engine proc(s) to start.
^[[1;36m(EngineCore_1 pid=303)^[[0;0m INFO 08-10 16:37:02 [monitor.py:34] torch.compile takes 68.27 s in total
^[[1;36m(EngineCore_4 pid=306)^[[0;0m INFO 08-10 16:37:02 [monitor.py:34] torch.compile takes 68.38 s in total
^[[1;36m(EngineCore_6 pid=308)^[[0;0m INFO 08-10 16:37:02 [monitor.py:34] torch.compile takes 67.44 s in total
^[[1;36m(EngineCore_3 pid=305)^[[0;0m INFO 08-10 16:37:02 [monitor.py:34] torch.compile takes 67.95 s in total
^[[1;36m(EngineCore_0 pid=302)^[[0;0m INFO 08-10 16:37:02 [monitor.py:34] torch.compile takes 68.36 s in total
^[[1;36m(EngineCore_5 pid=307)^[[0;0m INFO 08-10 16:37:02 [monitor.py:34] torch.compile takes 67.28 s in total
^[[1;36m(EngineCore_7 pid=309)^[[0;0m INFO 08-10 16:37:02 [monitor.py:34] torch.compile takes 66.78 s in total
^[[1;36m(EngineCore_2 pid=304)^[[0;0m INFO 08-10 16:37:02 [monitor.py:34] torch.compile takes 67.46 s in total
^[[1;36m(EngineCore_0 pid=302)^[[0;0m INFO 08-10 16:37:02 [eplb_state.py:425] Rearranging experts (profile)...
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance[rank7]:[W810 16:37:04.415670924 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 7]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance[rank1]:[W810 16:37:04.421049663 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 1]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance[rank5]:[W810 16:37:04.421938427 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 5]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance[rank2]:[W810 16:37:04.428433311 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 2]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance[rank3]:[W810 16:37:04.433295013 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 3]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance[rank6]:[W810 16:37:04.438892132 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 6]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance[rank0]:[W810 16:37:04.444570276 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance[rank4]:[W810 16:37:04.453769311 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 4]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
^[[1;36m(EngineCore_0 pid=302)^[[0;0m INFO 08-10 16:37:06 [eplb_state.py:551] Rearranged experts (profile) in 3.56 seconds.

...

^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [api_server.py:1853] Starting vLLM API server 0 on http://0.0.0.0:9451
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:29] Available routes are:
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /docs, Methods: GET, HEAD
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: GET, HEAD
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /redoc, Methods: GET, HEAD
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /health, Methods: GET
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /load, Methods: GET
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /ping, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /ping, Methods: GET
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /tokenize, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /detokenize, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/models, Methods: GET
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /version, Methods: GET
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/responses, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/completions, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/embeddings, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /pooling, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /classify, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /score, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/score, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /rerank, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/rerank, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v2/rerank, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /invocations, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /metrics, Methods: GET
^[[1;36m(APIServer pid=1)^[[0;0m INFO:     Started server process [1]
^[[1;36m(APIServer pid=1)^[[0;0m INFO:     Waiting for application startup.
^[[1;36m(APIServer pid=1)^[[0;0m INFO:     Application startup complete.

(Optional) Documentation Update

gemini-code-assist

Code Review

This pull request updates the DEEPGEMM_GIT_REF in the Dockerfile to a newer commit to fix an smxx layout assertion error. The change is well-justified by the provided logs showing the error is resolved. The update to the specific commit hash is appropriate given the DeepGEMM repository does not use tags or releases. The change correctly addresses the bug and improves the stability of the vLLM build.

Signed-off-by: frankwang28 <[email protected]>

github-actions · 2025-08-10T23:50:05Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mgoin · 2025-08-12T15:54:47Z

cc @yewentao256 @MatthewBonanni

mgoin

LGTM, thanks for testing

yewentao256

Looks good to me, thanks for the work!

mergify bot added the ci/build label Aug 10, 2025

gemini-code-assist bot reviewed Aug 10, 2025

View reviewed changes

Bump DEEPGEMM_GIT_REF

5d563a4

Signed-off-by: frankwang28 <[email protected]>

frankwang28 force-pushed the bump-deepgemm-git-ref branch from c508d21 to 5d563a4 Compare August 10, 2025 23:48

simon-mo approved these changes Aug 12, 2025

View reviewed changes

simon-mo enabled auto-merge (squash) August 12, 2025 15:46

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 12, 2025

mgoin approved these changes Aug 12, 2025

View reviewed changes

yewentao256 approved these changes Aug 12, 2025

View reviewed changes

mgoin added this to the v0.10.1 milestone Aug 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Bump DeepGEMM Version to Fix SMXX Layout Issues #22606

[Bugfix] Bump DeepGEMM Version to Fix SMXX Layout Issues #22606

frankwang28 commented Aug 10, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Aug 10, 2025

Uh oh!

mgoin commented Aug 12, 2025

Uh oh!

mgoin left a comment

Uh oh!

yewentao256 left a comment

Uh oh!

Uh oh!

Uh oh!

[Bugfix] Bump DeepGEMM Version to Fix SMXX Layout Issues #22606

Are you sure you want to change the base?

[Bugfix] Bump DeepGEMM Version to Fix SMXX Layout Issues #22606

Conversation

frankwang28 commented Aug 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Aug 10, 2025

Uh oh!

mgoin commented Aug 12, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

frankwang28 commented Aug 10, 2025 •

edited by github-actions bot

Loading