trtllm_fp4_block_scale_moe produces incorrect results for nvidia/Qwen3-Coder-480B-A35B-Instruct-NVFP4

## Description

`trtllm_fp4_block_scale_moe` produces significantly different results compared to `flashinfer_cutlass_fused_moe`
when running real model weights and activations from `nvidia/Qwen3-Coder-480B-A35B-Instruct-NVFP4`.
The discrepancy is **not reproducible with randomly generated data** — random inputs always produce
matching results between the two kernels. This strongly suggests a kernel-level numerical accuracy
bug triggered only by specific real-world input distributions.

The companion model `nvidia/Qwen3-235B-A22B-Instruct-2507-NVFP4` is **not affected**: both random
and real data produce matching results between `trtllm_fp4_block_scale_moe` and `cutlass_fused_moe`.

## Environment

- SGLang with `--moe-runner-backend flashinfer_trtllm`
- FlashInfer (version with `trtllm_fp4_block_scale_moe` support)
- GPU: NVIDIA (G)B200
- Model (affected): `nvidia/Qwen3-Coder-480B-A35B-Instruct-NVFP4`
- Model (unaffected): `nvidia/Qwen3-235B-A22B-Instruct-2507-NVFP4`

## Reproducer

A standalone reproducer script is at
[`python/sglang/srt/layers/moe/fused_moe_triton/repro_fp4.py`](repro_fp4.py).
It loads a `.pt` dump of real kernel inputs captured during inference and runs both
`trtllm_fp4_block_scale_moe` and `cutlass_fused_moe` side-by-side, reporting the norm difference.

### Step 1 — Capture a dump

Run SGLang with the patched `layer.py` (which dumps kernel inputs to `/tmp/dbg_moe_repro.pt` on the
first call where the trtllm/cutlass norm difference exceeds 10%):

```bash
python -m sglang.launch_server \
    --model-path nvidia/Qwen3-Coder-480B-A35B-Instruct-NVFP4 \
    --moe-runner-backend flashinfer_trtllm \
    <other args>
```

Then send a prompt; the dump will be written to `/tmp/dbg_moe_repro.pt` automatically.

### Step 2 — Run the reproducer

```bash
# Real data from dump → large norm difference (bug visible)
python repro_fp4.py /tmp/dbg_moe_repro_affected.pt

# Random data with same shapes → no significant difference (bug hidden)
python repro_fp4.py /tmp/dbg_moe_repro_affected.pt --random

# Random hidden states, real weights/scales/router from dump → bisect hidden-state contribution
python repro_fp4.py /tmp/dbg_moe_repro_affected.pt --random-hidden

# Unaffected model: both modes should show no significant difference
python repro_fp4.py /tmp/dbg_moe_repro_unaffected.pt
python repro_fp4.py /tmp/dbg_moe_repro_unaffected.pt --random
```

### Reproducer modes summary

| Flag | Hidden states | Weights / router / scales | Purpose |
|---|---|---|---|
| *(none)* | from dump | from dump | Full real-data comparison |
| `--random` | random | random | Confirm random data is fine |
| `--random-hidden` | random | from dump | Bisect hidden-state contribution |

## Observed Output

### Affected model (`nvidia/Qwen3-Coder-480B-A35B-Instruct-NVFP4`) — real data

```
trtllm:  norm=2.419e+03  nan=0
tensor([[-0.7344,  0.2832, -0.3535,  ..., -0.8594,  0.3477,  0.3887],
        [-0.7344,  0.2832, -0.3535,  ..., -0.8594,  0.3477,  0.3887],
        [-0.7344,  0.2832, -0.3535,  ..., -0.8594,  0.3477,  0.3887],
        ...,
        [-0.7305,  0.2773, -0.3633,  ..., -0.8477,  0.3066,  0.4004],
        [-0.7305,  0.2773, -0.3633,  ..., -0.8477,  0.3066,  0.4004],
        [-0.7305,  0.2773, -0.3633,  ..., -0.8477,  0.3066,  0.4004]],
       device='cuda:0')

cutlass: norm=6.900e+02  nan=0
tensor([[ 0.0801, -0.0464, -0.1387,  ..., -0.0227, -0.1289,  0.0483],
        [ 0.0801, -0.0464, -0.1387,  ..., -0.0227, -0.1289,  0.0483],
        [ 0.0801, -0.0464, -0.1387,  ..., -0.0227, -0.1289,  0.0483],
        ...,
        [ 0.0767, -0.0452, -0.1436,  ..., -0.0176, -0.1309,  0.0520],
        [ 0.0767, -0.0452, -0.1436,  ..., -0.0176, -0.1309,  0.0520],
        [ 0.0767, -0.0452, -0.1436,  ..., -0.0176, -0.1309,  0.0520]],
       device='cuda:0')

trtllm vs cutlass:  rel_norm_diff=2.506e+00  max_diff=1.875e+00  mean_diff=3.969e-0
```

### Affected model — `--random`

```
trtllm:  norm=7.091e+14  nan=0
tensor([[-2.0616e+11, -3.9460e+10,  2.2012e+10,  ...,  1.9596e+10,
          2.7166e+11,  1.3583e+11],
        [ 1.7395e+11,  2.8991e+11, -1.2818e+10,  ..., -1.5784e+11,
          4.4292e+10, -5.9861e+10],
        [ 1.6750e+11,  5.0332e+09, -7.5497e+09,  ..., -2.8991e+11,
          1.0039e+11,  5.5298e+10],
        ...,
        [-1.5368e+10, -3.1998e+11, -1.8039e+11,  ..., -1.7287e+11,
         -1.2992e+11, -1.6106e+11],
        [-1.2885e+10, -2.6441e+10,  8.3752e+10,  ...,  3.7044e+10,
          4.3594e+11, -3.7849e+10],
        [-1.0670e+10,  5.8385e+09,  4.7245e+10,  ..., -1.2616e+11,
         -7.7309e+10,  3.7581e+10]], device='cuda:0')

cutlass: norm=7.091e+14  nan=0
tensor([[-2.0508e+11, -3.9192e+10,  2.2280e+10,  ...,  1.9327e+10,
          2.7166e+11,  1.3637e+11],
        [ 1.7502e+11,  2.8991e+11, -1.2952e+10,  ..., -1.5891e+11,
          4.4560e+10, -6.0398e+10],
        [ 1.6858e+11,  4.9661e+09, -7.3820e+09,  ..., -2.8991e+11,
          1.0093e+11,  5.6103e+10],
        ...,
        [-1.5099e+10, -3.1998e+11, -1.7931e+11,  ..., -1.7287e+11,
         -1.2939e+11, -1.5999e+11],
        [-1.3086e+10, -2.6844e+10,  8.3215e+10,  ...,  3.7044e+10,
          4.3594e+11, -3.8655e+10],
        [-9.8650e+09,  6.0062e+09,  4.6976e+10,  ..., -1.2563e+11,
         -7.7309e+10,  3.7849e+10]], device='cuda:0')

trtllm vs cutlass:  rel_norm_diff=7.259e-05  max_diff=4.295e+09  mean_diff=3.269e+08
```

### Unaffected model (`nvidia/Qwen3-235B-A22B-Instruct-2507-NVFP4`) — real data

```
trtllm:  norm=2.852e+02  nan=0
tensor([[-0.0435,  0.0116, -0.0164,  ...,  0.0066,  0.1069, -0.0093],
        [-0.0435,  0.0116, -0.0164,  ...,  0.0066,  0.1069, -0.0093],
        [-0.0435,  0.0116, -0.0164,  ...,  0.0066,  0.1069, -0.0093],
        ...,
        [-0.0432,  0.0114, -0.0167,  ...,  0.0061,  0.1069, -0.0099],
        [-0.0432,  0.0114, -0.0167,  ...,  0.0061,  0.1069, -0.0099],
        [-0.0432,  0.0114, -0.0167,  ...,  0.0061,  0.1069, -0.0099]],
       device='cuda:0')

cutlass: norm=2.849e+02  nan=0
tensor([[-0.0437,  0.0117, -0.0165,  ...,  0.0066,  0.1074, -0.0093],
        [-0.0437,  0.0117, -0.0165,  ...,  0.0066,  0.1074, -0.0093],
        [-0.0437,  0.0117, -0.0165,  ...,  0.0066,  0.1074, -0.0093],
        ...,
        [-0.0435,  0.0109, -0.0168,  ...,  0.0061,  0.1069, -0.0095],
        [-0.0435,  0.0109, -0.0168,  ...,  0.0061,  0.1069, -0.0095],
        [-0.0435,  0.0109, -0.0168,  ...,  0.0061,  0.1069, -0.0095]],
       device='cuda:0')

trtllm vs cutlass:  rel_norm_diff=1.050e-03  max_diff=7.812e-03  mean_diff=3.126e-04
```

*(Actual numbers to be filled in once `.pt` files are attached.)*

## `.pt` Dump Files

- **Affected** (`nvidia/Qwen3-Coder-480B-A35B-Instruct-NVFP4`): `dbg_moe_repro_affected.pt` *(to be attached)*
- **Unaffected** (`nvidia/Qwen3-235B-A22B-Instruct-2507-NVFP4`): `dbg_moe_repro_unaffected.pt` *(to be attached)*

Each `.pt` file contains:

```python
{
    "hidden_states_bf16": ...,  # [T, H] bfloat16 original hidden states before FP4 quantization
    "router_logits":      ...,  # [T, E] float32
    "routing_bias":       ...,  # [E] float32 or None
    "trtllm": {
        "hidden_states":             ...,  # [T, H//2] uint8 FP4-packed
        "hidden_states_scale":       ...,  # [T, H//16] fp8 linear block scales
        "gemm1_weights":             ...,  # trtllm-shuffled w13 FP4 weights
        "gemm1_weights_scale":       ...,  # trtllm-shuffled w13 block scales
        "gemm2_weights":             ...,  # trtllm-shuffled w2 FP4 weights
        "gemm2_weights_scale":       ...,  # trtllm-shuffled w2 block scales
        "output1_scale_gate_scalar": ...,  # g1_alphas  [E]
        "output1_scale_scalar":      ...,  # g1_scale_c [E]
        "output2_scale_scalar":      ...,  # g2_alphas  [E]
        "num_experts": int, "top_k": int, "n_group": int, "topk_group": int,
        "intermediate_size": int, "local_expert_offset": int, "local_num_experts": int,
        "routed_scaling_factor": float, "routing_method_type": int,
        "tune_max_num_tokens": int,
        "output": ...,  # kernel result [T, H] bfloat16
    },
    "cutlass": {
        "input":                  ...,  # [T, H//2] FP4-packed
        "input_sf":               ...,  # [T, H//16] fp8 swizzled block scales
        "fc1_expert_weights":     ...,  # w13 FP4 weights (cutlass layout)
        "fc2_expert_weights":     ...,  # w2 FP4 weights (cutlass layout)
        "token_selected_experts": ...,  # [T, top_k] int32
        "token_final_scales":     ...,  # [T, top_k] float32
        "w13_input_scale_quant":  ...,  # scalar float32
        "w13_blockscale_swizzled":..,  # swizzled w13 block scales
        "w2_input_scale_quant":   ...,  # scalar float32
        "w2_blockscale_swizzled": ...,  # swizzled w2 block scales
        "g1_alphas":              ...,  # [E]
        "g2_alphas":              ...,  # [E]
        "ep_size": int, "ep_rank": int, "tp_size": int, "tp_rank": int,
        "tune_max_num_tokens": int, "activation_type": int,
        "output": ...,  # kernel result [T, H] bfloat16
    },
}
```

https://github.com/flashinfer-ai/flashinfer/issues/2714 to track.
## Reproducer:

[repro_fp4.py](https://github.com/user-attachments/files/25852960/repro_fp4.py)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

trtllm_fp4_block_scale_moe produces incorrect results for nvidia/Qwen3-Coder-480B-A35B-Instruct-NVFP4 #2732

Description

Environment

Reproducer

Step 1 — Capture a dump

Step 2 — Run the reproducer

Reproducer modes summary

Observed Output

Affected model (`nvidia/Qwen3-Coder-480B-A35B-Instruct-NVFP4`) — real data

Affected model — `--random`

Unaffected model (`nvidia/Qwen3-235B-A22B-Instruct-2507-NVFP4`) — real data

`.pt` Dump Files

Reproducer:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Flag	Hidden states	Weights / router / scales	Purpose
(none)	from dump	from dump	Full real-data comparison
`--random`	random	random	Confirm random data is fine
`--random-hidden`	random	from dump	Bisect hidden-state contribution

trtllm_fp4_block_scale_moe produces incorrect results for nvidia/Qwen3-Coder-480B-A35B-Instruct-NVFP4 #2732

Description

Description

Environment

Reproducer

Step 1 — Capture a dump

Step 2 — Run the reproducer

Reproducer modes summary

Observed Output

Affected model (nvidia/Qwen3-Coder-480B-A35B-Instruct-NVFP4) — real data

Affected model — --random

Unaffected model (nvidia/Qwen3-235B-A22B-Instruct-2507-NVFP4) — real data

.pt Dump Files

Reproducer:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Affected model (`nvidia/Qwen3-Coder-480B-A35B-Instruct-NVFP4`) — real data

Affected model — `--random`

Unaffected model (`nvidia/Qwen3-235B-A22B-Instruct-2507-NVFP4`) — real data

`.pt` Dump Files