Skip to content

[Bugfix] Bump DeepGEMM Version to Fix SMXX Layout Issues #22606

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

frankwang28
Copy link

@frankwang28 frankwang28 commented Aug 10, 2025

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

There is a newer version of DeepGEMM which fixes some bugs, notably around smxx layout assertions. The current version of DeepGEMM in the Dockerfile does not have these fixes and we should bump the git reference accordingly. For reference, I am encountering the issue when serving DeepSeek-R1-0528 on 8xB200 with DeepGEMM, expert parallelism (DP=8, TP=1) and pplx All2All backend.

Test Plan

Launch vLLM before and after the updated DeepGEMM version.

Test Result

Before:

^[[1;36m(EngineCore_1 pid=303)^[[0;0m WARNING 08-10 16:16:08 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_6 pid=308)^[[0;0m WARNING 08-10 16:16:09 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_3 pid=305)^[[0;0m WARNING 08-10 16:16:09 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_0 pid=302)^[[0;0m WARNING 08-10 16:16:09 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_4 pid=306)^[[0;0m WARNING 08-10 16:16:09 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_2 pid=304)^[[0;0m WARNING 08-10 16:16:09 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_7 pid=309)^[[0;0m WARNING 08-10 16:16:10 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_5 pid=307)^[[0;0m WARNING 08-10 16:16:10 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683] EngineCore failed to start.
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683] Traceback (most recent call last):
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 670, in run_engine_core
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     engine_core = DPEngineCoreProc(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 937, in __init__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     super().__init__(vllm_config, local_client, handshake_address,
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 475, in __init__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     super().__init__(vllm_config, executor_class, log_stats,
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 87, in __init__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     self._initialize_kv_caches(vllm_config)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 164, in _initialize_kv_caches
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     self.model_executor.determine_available_memory())
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     output = self.collective_rpc("determine_available_memory")
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     answer = run_method(self.driver_worker, method, args, kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/__init__.py", line 2974, in run_method
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return func(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return func(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 244, in determine_available_memory
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     self.model_runner.profile_run()
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2470, in profile_run
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     = self._dummy_run(self.max_num_tokens, is_profile=True)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return func(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2246, in _dummy_run
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     outputs = self.model(
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]               ^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self._call_impl(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return forward_call(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 836, in forward
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     hidden_states = self.model(input_ids, positions, intermediate_tensors,
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 272, in __call__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     output = self.compiled_callable(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 655, in _fn
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return fn(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 697, in forward
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     def forward(
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self._call_impl(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return forward_call(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return fn(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self._wrapped_call(self, *args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 406, in __call__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     raise e
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 393, in __call__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self._call_impl(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return forward_call(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "<eval_with_key>.124", line 950, in forward
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     submod_8 = self.submod_8(getitem_18, s0, l_self_modules_layers_modules_3_modules_self_attn_modules_o_proj_parameters_weight_, l_self_modules_layers_modules_3_modules_self_attn_modules_o_proj_parameters_weight_scale_inv_, getitem_19, l_self_modules_layers_modules_3_modules_post_attention_layernorm_parameters_weight_, l_self_modules_layers_modules_3_modules_mlp_modules_shared_experts_modules_gate_up_proj_parameters_weight_, l_self_modules_layers_modules_3_modules_mlp_modules_shared_experts_modules_gate_up_proj_parameters_weight_scale_inv_, l_self_modules_layers_modules_3_modules_mlp_modules_shared_experts_modules_down_proj_parameters_weight_, l_self_modules_layers_modules_3_modules_mlp_modules_shared_experts_modules_down_proj_parameters_weight_scale_inv_, l_self_modules_layers_modules_3_modules_mlp_modules_gate_parameters_weight_, l_self_modules_layers_modules_4_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_4_modules_self_attn_modules_fused_qkv_a_proj_parameters_weight_, l_self_modules_layers_modules_4_modules_self_attn_modules_fused_qkv_a_proj_parameters_weight_scale_inv_, l_self_modules_layers_modules_4_modules_self_attn_modules_q_a_layernorm_parameters_weight_, l_self_modules_layers_modules_4_modules_self_attn_modules_q_b_proj_parameters_weight_, l_self_modules_layers_modules_4_modules_self_attn_modules_q_b_proj_parameters_weight_scale_inv_, l_self_modules_layers_modules_4_modules_self_attn_modules_kv_a_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_self_attn_modules_rotary_emb_buffers_cos_sin_cache_, l_positions_);  getitem_18 = l_self_modules_layers_modules_3_modules_self_attn_modules_o_proj_parameters_weight_ = l_self_modules_layers_modules_3_modules_self_attn_modules_o_proj_parameters_weight_scale_inv_ = getitem_19 = l_self_modules_layers_modules_3_modules_post_attention_layernorm_parameters_weight_ = l_self_modules_layers_modules_3_modules_mlp_modules_shared_experts_modules_gate_up_proj_parameters_weight_ = l_self_modules_layers_modules_3_modules_mlp_modules_shared_experts_modules_gate_up_proj_parameters_weight_scale_inv_ = l_self_modules_layers_modules_3_modules_mlp_modules_shared_experts_modules_down_proj_parameters_weight_ = l_self_modules_layers_modules_3_modules_mlp_modules_shared_experts_modules_down_proj_parameters_weight_scale_inv_ = l_self_modules_layers_modules_3_modules_mlp_modules_gate_parameters_weight_ = l_self_modules_layers_modules_4_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_4_modules_self_attn_modules_fused_qkv_a_proj_parameters_weight_ = l_self_modules_layers_modules_4_modules_self_attn_modules_fused_qkv_a_proj_parameters_weight_scale_inv_ = l_self_modules_layers_modules_4_modules_self_attn_modules_q_a_layernorm_parameters_weight_ = l_self_modules_layers_modules_4_modules_self_attn_modules_q_b_proj_parameters_weight_ = l_self_modules_layers_modules_4_modules_self_attn_modules_q_b_proj_parameters_weight_scale_inv_ = l_self_modules_layers_modules_4_modules_self_attn_modules_kv_a_layernorm_parameters_weight_ = None
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_piecewise_backend.py", line 112, in __call__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self.compiled_graph_for_general_shape(*args)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return fn(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/aot_autograd.py", line 1209, in forward
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return compiled_fn(full_args)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 328, in runtime_wrapper
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     all_outs = call_func_at_runtime_with_args(
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/utils.py", line 126, in call_func_at_runtime_with_args
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     out = normalize_as_list(f(args))
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]                             ^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 689, in inner_fn
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     outs = compiled_fn(args)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 495, in wrapper
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return compiled_fn(runtime_args)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/output_code.py", line 460, in __call__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self.current_callable(inputs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/utils.py", line 2404, in run
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return model(new_inputs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/root/.cache/vllm/torch_compile_cache/1ef39f9610/rank_0_2/inductor_cache/g6/cg6a6v2v4v5jenw63o4yj3sa5ytgpvlig3c4scbqnekrl4ewtclu.py", line 684, in call
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     buf6 = torch.ops.vllm.moe_forward.default(reinterpret_tensor(buf5, (s0, 7168), (7168, 1), 0), buf4, 'model.layers.3.mlp.experts')
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 756, in __call__
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self._op(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1708, in moe_forward
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self.forward_impl(hidden_states, router_logits)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1607, in forward_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self.forward_impl_chunked(hidden_states, router_logits)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1590, in forward_impl_chunked
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     process_chunk(chunk_start,
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/layer.py", line 1550, in process_chunk
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     final_hidden_states = self.quant_method.apply(
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]                           ^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 1039, in apply
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self.fused_experts(
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self._call_impl(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return forward_call(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 770, in forward
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     fused_out = self._maybe_chunk_fused_experts(
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 545, in _maybe_chunk_fused_experts
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return self._do_fused_experts(
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/modular_kernel.py", line 492, in _do_fused_experts
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     self.fused_experts.apply(
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/batched_triton_or_deep_gemm_moe.py", line 150, in apply
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     experts.apply(output, hidden_states, w1, w2, topk_weights, topk_ids,
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py", line 295, in apply
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     fp8_m_grouped_gemm_nt_masked((a2q, a2q_scale), (w2, w2_scale), output,
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]   File "/usr/local/lib/python3.12/dist-packages/vllm/utils/deep_gemm.py", line 126, in fp8_m_grouped_gemm_nt_masked
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]     return _grouped_masked_impl(*args, **kwargs)
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^[[1;36m(EngineCore_2 pid=304)^[[0;0m ERROR 08-10 16:16:12 [core.py:683] RuntimeError: Failed: Assertion error csrc/jit_kernels/impls/smxx_layout.hpp:136 'mn % 4 == 0 and num_groups == 1'

After:

^[[1;36m(EngineCore_7 pid=309)^[[0;0m WARNING 08-10 16:36:31 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_5 pid=307)^[[0;0m WARNING 08-10 16:36:32 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_2 pid=304)^[[0;0m WARNING 08-10 16:36:32 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_6 pid=308)^[[0;0m WARNING 08-10 16:36:32 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_3 pid=305)^[[0;0m WARNING 08-10 16:36:33 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_0 pid=302)^[[0;0m WARNING 08-10 16:36:33 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_4 pid=306)^[[0;0m WARNING 08-10 16:36:33 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(EngineCore_1 pid=303)^[[0;0m WARNING 08-10 16:36:33 [pplx_prepare_finalize.py:110] The PPLX backend does not support expert mapping. The provided `expert_map` will be ignored.
^[[1;36m(APIServer pid=1)^[[0;0m DEBUG 08-10 16:36:37 [utils.py:750] Waiting for 8 local, 0 remote core engine proc(s) to start.
^[[1;36m(APIServer pid=1)^[[0;0m DEBUG 08-10 16:36:47 [utils.py:750] Waiting for 8 local, 0 remote core engine proc(s) to start.
^[[1;36m(APIServer pid=1)^[[0;0m DEBUG 08-10 16:36:57 [utils.py:750] Waiting for 8 local, 0 remote core engine proc(s) to start.
^[[1;36m(EngineCore_1 pid=303)^[[0;0m INFO 08-10 16:37:02 [monitor.py:34] torch.compile takes 68.27 s in total
^[[1;36m(EngineCore_4 pid=306)^[[0;0m INFO 08-10 16:37:02 [monitor.py:34] torch.compile takes 68.38 s in total
^[[1;36m(EngineCore_6 pid=308)^[[0;0m INFO 08-10 16:37:02 [monitor.py:34] torch.compile takes 67.44 s in total
^[[1;36m(EngineCore_3 pid=305)^[[0;0m INFO 08-10 16:37:02 [monitor.py:34] torch.compile takes 67.95 s in total
^[[1;36m(EngineCore_0 pid=302)^[[0;0m INFO 08-10 16:37:02 [monitor.py:34] torch.compile takes 68.36 s in total
^[[1;36m(EngineCore_5 pid=307)^[[0;0m INFO 08-10 16:37:02 [monitor.py:34] torch.compile takes 67.28 s in total
^[[1;36m(EngineCore_7 pid=309)^[[0;0m INFO 08-10 16:37:02 [monitor.py:34] torch.compile takes 66.78 s in total
^[[1;36m(EngineCore_2 pid=304)^[[0;0m INFO 08-10 16:37:02 [monitor.py:34] torch.compile takes 67.46 s in total
^[[1;36m(EngineCore_0 pid=302)^[[0;0m INFO 08-10 16:37:02 [eplb_state.py:425] Rearranging experts (profile)...
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance[rank7]:[W810 16:37:04.415670924 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 7]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance[rank1]:[W810 16:37:04.421049663 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 1]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance[rank5]:[W810 16:37:04.421938427 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 5]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance[rank2]:[W810 16:37:04.428433311 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 2]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance[rank3]:[W810 16:37:04.433295013 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 3]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance[rank6]:[W810 16:37:04.438892132 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 6]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance[rank0]:[W810 16:37:04.444570276 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
Warning: please use at least NVCC 12.9 for the best DeepGEMM performance[rank4]:[W810 16:37:04.453769311 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 4]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
^[[1;36m(EngineCore_0 pid=302)^[[0;0m INFO 08-10 16:37:06 [eplb_state.py:551] Rearranged experts (profile) in 3.56 seconds.

...

^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [api_server.py:1853] Starting vLLM API server 0 on http://0.0.0.0:9451
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:29] Available routes are:
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /openapi.json, Methods: GET, HEAD
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /docs, Methods: GET, HEAD
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /docs/oauth2-redirect, Methods: GET, HEAD
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /redoc, Methods: GET, HEAD
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /health, Methods: GET
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /load, Methods: GET
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /ping, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /ping, Methods: GET
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /tokenize, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /detokenize, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/models, Methods: GET
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /version, Methods: GET
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/responses, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/responses/{response_id}, Methods: GET
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/responses/{response_id}/cancel, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/chat/completions, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/completions, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/embeddings, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /pooling, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /classify, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /score, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/score, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/audio/transcriptions, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/audio/translations, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /rerank, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v1/rerank, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /v2/rerank, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /scale_elastic_ep, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /is_scaling_elastic_ep, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /invocations, Methods: POST
^[[1;36m(APIServer pid=1)^[[0;0m INFO 08-10 16:37:37 [launcher.py:37] Route: /metrics, Methods: GET
^[[1;36m(APIServer pid=1)^[[0;0m INFO:     Started server process [1]
^[[1;36m(APIServer pid=1)^[[0;0m INFO:     Waiting for application startup.
^[[1;36m(APIServer pid=1)^[[0;0m INFO:     Application startup complete.

(Optional) Documentation Update

@mergify mergify bot added the ci/build label Aug 10, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the DEEPGEMM_GIT_REF in the Dockerfile to a newer commit to fix an smxx layout assertion error. The change is well-justified by the provided logs showing the error is resolved. The update to the specific commit hash is appropriate given the DeepGEMM repository does not use tags or releases. The change correctly addresses the bug and improves the stability of the vLLM build.

Signed-off-by: frankwang28 <[email protected]>
@frankwang28 frankwang28 force-pushed the bump-deepgemm-git-ref branch from c508d21 to 5d563a4 Compare August 10, 2025 23:48
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@simon-mo simon-mo enabled auto-merge (squash) August 12, 2025 15:46
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 12, 2025
@mgoin
Copy link
Member

mgoin commented Aug 12, 2025

cc @yewentao256 @MatthewBonanni

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for testing

Copy link
Collaborator

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for the work!

@mgoin mgoin added this to the v0.10.1 milestone Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/build ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants