Skip to content

[Bug] Intermittent segfault in Triton MoE kernel during piecewise CUDA graph warmup on B200 #21629

@yhyang201

Description

@yhyang201

Description

During piecewise CUDA graph warmup compilation for nvidia/Qwen3.5-397B-A17B-NVFP4 (4-GPU, FP4 quantization) on B200, a segmentation fault occurs in the Triton NVIDIA driver backend while executing the fused MoE experts kernel.

The crash happens at ~93% of the "Compiling num tokens" phase (69/74 iterations).

Error Stack Trace

Fatal Python error: Segmentation fault

Current thread (most recent call first):
  File "triton/backends/nvidia/driver.py", line 668 in inner
  File "triton/backends/nvidia/driver.py", line 712 in __call__
  File "triton/runtime/jit.py", line 757 in run
  File "triton_kernels/matmul_ogs.py", line 467 in matmul_ogs
  File "sglang/srt/layers/moe/fused_moe_triton/triton_kernels_moe.py", line 306 in triton_kernel_fused_experts_with_bias
  File "sglang/srt/layers/moe/moe_runner/triton_kernels.py", line 115 in run
  File "sglang/srt/layers/moe/moe_runner/runner.py", line 117 in run
  File "sglang/srt/layers/quantization/unquant.py", line 423 in forward_cuda
  File "sglang/srt/layers/moe/fused_moe_triton/layer.py", line 1034 in run_moe_core
  File "sglang/srt/layers/moe/fused_moe_triton/layer.py", line 1013 in forward_impl
  File "sglang/srt/models/gpt_oss.py", line 269 in moe_impl
  ...
  File "sglang/srt/model_executor/piecewise_cuda_graph_runner.py", line 406 in warmup_compile
  File "sglang/srt/model_executor/piecewise_cuda_graph_runner.py", line 309 in __init__
  File "sglang/srt/model_executor/model_runner.py", line 2450 in init_piecewise_cuda_graphs

Environment

  • GPU: NVIDIA B200
  • Model: nvidia/Qwen3.5-397B-A17B-NVFP4 (FP4 quantization, TP=4)
  • Attention backend: trtllm_mha
  • CI job: stage-c-test-4-gpu-b200 (0) in PR Test run

Analysis

The segfault originates in the Triton NVIDIA driver backend (triton/backends/nvidia/driver.py:668) during JIT compilation/execution of the MoE matmul_ogs kernel. This appears to be a Triton + B200 (SM100) driver-level issue during CUDA graph warmup.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions