[Feature]: Someone please upstream this gfx1201/RDNA4 FP8 Patch into vllm-rocm

### 🚀 The feature, motivation and pitch

# Disclosure:

Yes: I used an LLM to prepare this post, obviously, who has time to type all this up…
Yes: I used an LLM to find and patch in what was necessary.
Yes: It works and provides significant performance uplift.
Yes: The work is incomplete to provide full FP8 across all models, but this is a start.
Yes: I’m bitter AF about having to take the time to do this.
No: I will not jump through all the hoops to upstream this as a PR myself, I'll continue to patch locally as required, I don't have time to deal with PR/CR process.

BLUF: LLM go faster with small code changes, benchmark results:

<img width="2155" height="1286" alt="Image" src="https://github.com/user-attachments/assets/55076c24-adb4-4746-8ac5-11362a5d5e11" />

<img width="2158" height="1318" alt="Image" src="https://github.com/user-attachments/assets/37d734ae-9628-45d9-b915-ff8cc8472038" />

# Native FP8 WMMA Support for AMD RDNA4 (RX 9070 XT / R9700) in vLLM

## Summary

Successfully enabled native FP8 WMMA operations on AMD RDNA4 GPUs in vLLM, achieving significant performance improvements by utilizing the hardware’s 128 AI accelerators instead of dequantizing FP8 weights to FP32. This has been in production deployment for 4 days with zero observed aberrations as a result.

## Performance Results

Testing with FP8-quantized Qwen3 models on AMD Radeon R9700:

| Model | Before (TPS) | After (TPS) | Improvement |
|----|----|----|----|
| **Qwen3-0.6B** | \~160 decode | \~200 decode | **25% faster** |
| **Qwen3-30B-2507** | \~52 decode | \~85 decode | **63% faster** |

Additionally I found nearly **doubling of prefill performance** in certain scenarios with these rough kernel configs at prompt token counts up to 10,000 tokens with tapering gains as token count grows resulting from memory pressure. Significant additional performance remains untapped with further kernel tuning.

With further improvements this will scale well, raising memory speed to 1375 from 1258 shows a further 5% uplift in performance, indicating there is room for memory transfer optimization to make more efficient use of existing bandwidth for further gains.

---

## System Environment

* **Hardware**: AMD Radeon AI Pro R9700 (RDNA4 gfx1201)
 * 128 AI Accelerators
 * Native FP8 E4M3FN support
 * 16x16 WMMA instruction tile size
* **Software Stack**:
 * **vLLM Version**: 0.11.1rc6.dev223+g404d7a9d1
 * **ROCm Version**: 7.0.0
 * **AMD Driver**: 6.16.6
 * **Base Image**: rocm/vllm-dev:nightly

---

## The Problem

By default, vLLM on RDNA4 was **dequantizing FP8 weights to FP32** for all operations, completely wasting the hardware’s 128 AI accelerators. The execution path fell back to `torch_channelwise_w8a8_scaled_mm()` which explicitly upcasts to FP32:

```python
# Default broken path
output = torch._scaled_mm(
 qinput, weight,
 scale_a=TORCH_DEVICE_IDENTITY,
 scale_b=TORCH_DEVICE_IDENTITY,
 out_dtype=torch.float32 # ← Upcast to FP32!
)
```

This meant zero performance benefit from FP8 quantization on RDNA4. In fact, it ran like trash, absolutely horrible.

---

## The Solution

Follow the **MI350X Triton kernel path** by:

1. Adding RDNA4 (gfx1201) to platform detection
2. Patching AITER's architecture mapping to recognize gfx1201
3. Adding RDNA4-specific matrix sizes to kernel tuning
4. Providing optimized kernel configurations

**Important Note**: AITER's C++/ASM kernels do not work on RDNA4 and must be disabled (`VLLM_ROCM_USE_AITER=0`). However, vLLM's FP8 code path imports **AITER's Triton kernels** (`aiter.ops.triton.gemm_a8w8_blockscale`). These Triton kernels check AITER's architecture mapping and will crash with a `KeyError` if `gfx1201` is not recognized. Therefore, we must **patch AITER's architecture detection** before vLLM starts, then Triton automatically compiles these kernels down to native WMMA instructions when it detects FP8 data types on gfx1201 hardware.

This routes FP8 operations through native WMMA instructions instead of dequantization.

---

## Implementation Details

### 1. vLLM Code Modifications

Two files need to be modified in the vLLM source code:

#### File: `vllm/platforms/rocm.py`

**Original:**

```python
def on_mi3xx() -> bool:
 GPU_ARCH = torch.cuda.get_device_properties("cuda").gcnArchName
 return any(arch in GPU_ARCH for arch in ["gfx942", "gfx950"])
```

**Modified:**

```python
def on_mi3xx() -> bool:
 GPU_ARCH = torch.cuda.get_device_properties("cuda").gcnArchName
 # Added gfx1201 (RDNA4) to enable FP8 Triton kernel path
 return any(arch in GPU_ARCH for arch in ["gfx942", "gfx950", "gfx1201"])
```

**Change**: Added `"gfx1201"` to enable RDNA4 to use the MI350X Triton kernel code path.

---

#### File: `vllm/model_executor/layers/quantization/utils/fp8_utils.py`

**Original:**

```python
def is_aiter_triton_kernel_tuned(n, k):
 return (n, k) in [
 (1024, 8192),
 (2112, 7168),
 (3072, 1536),
 (32768, 8192),
 (4096, 7168),
 (4608, 7168),
 (512, 7168),
 (7168, 2048),
 (7168, 256),
 (8192, 1024),
 (8192, 32768),
 ]
```

**Modified:**

```python
def is_aiter_triton_kernel_tuned(n, k):
 # MI350 tuned sizes
 mi350_sizes = [
 (1024, 8192),
 (2112, 7168),
 (3072, 1536),
 (32768, 8192),
 (4096, 7168),
 (4608, 7168),
 (512, 7168),
 (7168, 2048),
 (7168, 256),
 (8192, 1024),
 (8192, 32768),
 ]

 # RDNA4 (gfx1201) specific sizes verified to work
 rdna4_sizes = [
 (1024, 1024), # K, V projections
 (2048, 1024), # Q projection
 (3072, 1024), # Gate, Up projections
 (1024, 3072), # Down projection
 (1024, 2048), # O projection (transposed)
 (512, 512), # Small models
 (1024, 512), # Asymmetric
 (512, 1024), # Asymmetric reverse
 (2048, 2048), # Medium models
 (4096, 4096), # 7B class models
 (8192, 8192), # 70B class models
 ]

 # Check architecture to include RDNA4 sizes
 import torch
 arch_name = torch.cuda.get_device_properties(0).gcnArchName

 if "gfx12" in arch_name:
 # For RDNA4, include both RDNA4 and MI350 sizes
 return (n, k) in (rdna4_sizes + mi350_sizes)
 else:
 # Other architectures use original MI350 sizes only
 return (n, k) in mi350_sizes
```

**Changes**:

* Refactored original list into `mi350_sizes` variable
* Added 11 RDNA4-specific matrix dimensions in `rdna4_sizes`
* Added architecture detection that returns combined list for RDNA4
* The sizes used were tested and found to work correctly

---

### 2. AITER Architecture Patch (Required)

**Why This Is Needed:**

While AITER's C++/ASM kernels don't work on RDNA4 (hence `VLLM_ROCM_USE_AITER=0`), vLLM still **imports AITER's Triton kernels**:

```python
from aiter.ops.triton.gemm_a8w8_blockscale import gemm_a8w8_blockscale
```

When these Triton kernels execute, they internally call `arch_info.get_device()` which looks up the GPU architecture in AITER's `_ARCH_TO_DEVICE` dictionary. Since `gfx1201` is not in this dictionary by default, it throws a `KeyError` and crashes.

**The solution:** Patch AITER's architecture mapping before vLLM starts.

<details>
<summary>Click to expand: AITER Patch Wrapper Script</summary>

#### `rdna4_aiter_wrapper.sh`

```bash
#!/bin/bash
# RDNA4 FP8 vLLM Wrapper
# Patches AITER architecture detection and disables AITER C++/ASM kernels

echo "=========================================="
echo "RDNA4 FP8 Startup"
echo "=========================================="

# Disable AITER's C++/ASM implementations (they don't work on RDNA4)
export VLLM_ROCM_USE_AITER=0

echo "Environment: VLLM_ROCM_USE_AITER=0"

# Patch AITER's architecture mapping BEFORE vLLM imports it
echo ""
echo "Patching AITER architecture detection for gfx1201..."
python3 -c "
import aiter.ops.triton.utils.arch_info as arch_info
if 'gfx1201' not in arch_info._ARCH_TO_DEVICE:
 arch_info._ARCH_TO_DEVICE['gfx1201'] = 'MI350X'
 print('[AITER Patch] ✓ Added gfx1201 -> MI350X mapping')
else:
 print('[AITER Patch] gfx1201 already mapped')
"

if [ $? -eq 0 ]; then
 echo "✓ AITER patch applied"
else
 echo "✗ AITER patch failed - vLLM will crash"
 exit 1
fi

echo ""
echo "Launching vLLM..."
exec vllm serve "$@"
```

**What This Does:**

1. **Disables AITER C++/ASM kernels** via `VLLM_ROCM_USE_AITER=0`
2. **Patches AITER's Triton code** by adding `gfx1201 -> MI350X` to the architecture mapping
3. **Launches vLLM** which can now successfully import and use AITER's Triton kernels

**Docker Integration:**

Set this script as your Docker ENTRYPOINT:

```dockerfile
COPY rdna4_aiter_wrapper.sh /workspace/
RUN chmod +x /workspace/rdna4_aiter_wrapper.sh
ENTRYPOINT ["/workspace/rdna4_aiter_wrapper.sh"]
```

Or mount and use it via systemd:

```systemd
-v /path/to/rdna4_aiter_wrapper.sh:/workspace/wrapper.sh \
--entrypoint /workspace/wrapper.sh \
```

</details>

**Alternative: Inline Patch**

If you don't want a wrapper script, add this to your Docker ENTRYPOINT or startup command:

```bash
python3 -c "import aiter.ops.triton.utils.arch_info as arch_info; arch_info._ARCH_TO_DEVICE['gfx1201'] = 'MI350X'" && \
export VLLM_ROCM_USE_AITER=0 && \
vllm serve "$@"
```

---

### 3. Kernel Configuration Files

16 JSON configuration files are required to optimize FP8 operations for RDNA4. These files tell vLLM’s Triton compiler how to tile and execute FP8 matrix multiplications. These an educated GUESS but they do result in massive performance uplift, there is significant room for improving throughput by correctly tuning these values.

**File Locations**:

* Linear layers: `vllm/model_executor/layers/quantization/utils/configs/`
* MoE layers: `vllm/model_executor/layers/fused_moe/configs/`

**File Naming**: `N={n},K={k},device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

* `device_name=0x7551` = AMD Radeon RX 9070 XT device ID
* `dtype=fp8_w8a8` = FP8 weights and activations
* `block_shape=[128,128]` = Quantization block size

---

## Kernel Configuration Files

### Linear Layer Configs (15 files)

All linear configs use this structure with batch size keys ("16", "32", "64"):

<details>
<summary>Click to expand all 15 linear config files</summary>

#### `N=1024,K=1024,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

```json
{
 "16": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "32": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 16,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "64": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 }
}
```

#### `N=1024,K=1536,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

```json
{
 "16": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "32": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 16,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "64": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 }
}
```

#### `N=1024,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

```json
{
 "16": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "32": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 16,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "64": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 }
}
```

#### `N=1024,K=3072,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

```json
{
 "16": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "32": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 16,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "64": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 }
}
```

#### `N=2048,K=1024,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

```json
{
 "16": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "32": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 16,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "64": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 }
}
```

#### `N=2048,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

```json
{
 "16": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "32": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "64": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 }
}
```

#### `N=2048,K=4096,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

```json
{
 "16": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "32": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "64": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 }
}
```

#### `N=2048,K=768,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

```json
{
 "16": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "32": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 16,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "64": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 }
}
```

#### `N=2560,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

```json
{
 "16": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "32": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "64": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 }
}
```

#### `N=256,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

```json
{
 "16": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "32": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 16,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "64": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 }
}
```

#### `N=3072,K=1024,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

```json
{
 "16": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "32": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 16,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "64": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 }
}
```

#### `N=384,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

```json
{
 "16": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "32": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 16,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "64": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 }
}
```

#### `N=4096,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

```json
{
 "16": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "32": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "64": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 }
}
```

#### `N=512,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

```json
{
 "16": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "32": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 16,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "64": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 }
}
```

#### `N=768,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

```json
{
 "16": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "32": {
 "BLOCK_SIZE_K": 32,
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "GROUP_SIZE_M": 16,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 },
 "64": {
 "BLOCK_SIZE_K": 64,
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "GROUP_SIZE_M": 8,
 "kpack": 1,
 "matrix_instr_nonkdim": 16,
 "num_warps": 4
 }
}
```

</details>

---

### MoE Layer Config (1 file)

MoE configs use `num_stages` instead of `kpack` and `matrix_instr_nonkdim`:

#### `E=128,N=384,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

```json
{
 "16": {
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "BLOCK_SIZE_K": 32,
 "GROUP_SIZE_M": 8,
 "num_warps": 4,
 "num_stages": 2
 },
 "32": {
 "BLOCK_SIZE_M": 32,
 "BLOCK_SIZE_N": 32,
 "BLOCK_SIZE_K": 32,
 "GROUP_SIZE_M": 16,
 "num_warps": 4,
 "num_stages": 2
 },
 "64": {
 "BLOCK_SIZE_M": 64,
 "BLOCK_SIZE_N": 64,
 "BLOCK_SIZE_K": 64,
 "GROUP_SIZE_M": 8,
 "num_warps": 4,
 "num_stages": 2
 }
}
```

---

## Configuration Parameters Explained

| Parameter | Description | RDNA4 Values |
|----|----|----|
| `BLOCK_SIZE_M/N/K` | Tile size for GEMM operations | 32, 64 (multiples of 16) |
| `GROUP_SIZE_M` | Batch grouping for processing | 8, 16 |
| `num_warps` | Warps per thread block | 4 (aligns with CU SIMDs) |
| `matrix_instr_nonkdim` | WMMA instruction size | 16 (fixed for RDNA4) |
| `kpack` | K-dimension packing factor | 1 (conservative) |
| `num_stages` | Pipeline stages (MoE only) | 2 |

**Notes**:

* All dimensions must be multiples of 16 (WMMA tile size)
* Configs are conservative; significant tuning potential remains
* `num_warps=4` aligns well with RDNA4’s compute unit structure
* Larger block sizes (128x128) could be explored for batch operations but caused performance instability/inconsistency during prefill so were removed to fall back to default values

---

## Key Technical Details

Triton automatically detects and uses WMMA instructions when all these conditions are met:

1. **GPU Architecture**: gfx1201 (RDNA4) is detected
2. **Data Types**: FP8 (float8_e4m3fn) input tensors
3. **Operation**: Matrix multiplication via `tl.dot()`
4. **Tile Sizes**: Dimensions are multiples of 16 (WMMA tile size)

When vLLM imports `aiter.ops.triton.gemm_a8w8_blockscale`, it’s just importing Triton code. The Triton compiler examines the target architecture, sees FP8 operations on gfx1201, and automatically generates assembly code using RDNA4’s 16x16x16 WMMA instructions (`v_wmma_f32_16x16x16_fp8_fp8`).

**No runtime patching or AITER library support is needed aside from the above mentioned AITER loading insertion of gfx1201** - it’s pure Triton compilation doing the heavy lifting.

### Execution Path

```
FP8 Model Load
 ↓
on_mi3xx() returns True (gfx1201 recognized)
 ↓
rocm_aiter_gemm_w8a8_blockscale_impl() [fp8_utils.py:71]
 ↓
is_aiter_triton_kernel_tuned(n, k) returns True
 ↓
Imports aiter.ops.triton.gemm_a8w8_blockscale (Triton code)
 ↓
Triton compiler detects gfx1201 + FP8 types
 ↓
Auto-generates WMMA instructions (v_wmma_f32_16x16x16_fp8_fp8)
 ↓
✓ Native FP8 WMMA execution!
```

---

## Verification

### Check Platform Detection

```python
from vllm.platforms import rocm
import torch
print(f"GPU: {torch.cuda.get_device_properties(0).gcnArchName}") # Should be gfx1201
print(f"Detected as MI3xx: {rocm.on_mi3xx()}") # Should be True
```

### Check AITER is Disabled

```bash
echo $VLLM_ROCM_USE_AITER # Should output: 0
```

### Look for Log Messages

```
✅ GOOD: "Using configuration from .../N=2048,K=1024,device_name=0x7551,dtype=fp8_w8a8..."
❌ BAD: "Using default W8A8 Block FP8 kernel config"
```

---

## Known Limitations

1. **Minimum Batch Size**: RDNA4 requires M ≥ 16 for FP8 operations. Single-token generation may need padding, it showed failures in testing.
2. **AITER Runtime Patch Required**: AITER's C++/ASM kernels don't work on RDNA4, but we still need AITER's Triton kernels. This requires patching AITER's architecture mapping at runtime before vLLM starts, otherwise you'll get a `KeyError: 'gfx1201'` crash.
3. **Kernel Configs**: Current configs are conservative **guesses**. Significant performance improvements possible with tuning.
4. **Model Coverage**: Configs cover common LLM dimensions. New models may need additional configs.

---

## Future Optimization Opportunities

1. **Kernel Tuning**:

 * Larger block sizes (128x128) for batch operations
 * Experiment with `kpack=2,4` for better memory bandwidth
 * Profile optimal `GROUP_SIZE_M` values per matrix size

2. **Additional Matrix Sizes**:

 * Cover more model architectures (Llama, Mistral, etc.)
 * Add configs for larger models (70B+, 405B)
 * Support for different TP splits (TP=4, TP=8)

3. **Batch Size Handling**:

 * Implement automatic padding for M < 16
 * Optimize for common batch sizes (1, 4, 8, 16, 32)

---

## Contributing

This work enables native FP8 support on RDNA4 for the first time in vLLM. If you use these modifications:

1. **Test other models**: Try different FP8 models and report results
2. **Tune configs**: Experiment with kernel parameters and share improvements
3. **Add matrix sizes**: Contribute configs for models not yet covered
4. **Upstream**: Help prepare patches for vLLM mainline

---

## Acknowledgments

This implementation follows the MI350X Triton kernel path and builds upon AMD’s AITER library and vLLM’s FP8 quantization framework.

**Key insight**: RDNA4 uses the same standard FP8 E4M3FN format as MI350X (not FNUZ like MI300), making it compatible with MI350X’s Triton kernels after proper platform detection.

---

**Status**: Production-ready with ongoing optimization opportunities.

**Last Updated**: 2025-11-13

<details>
<summary>Click to expand tuneableOP results that further improved performance</summary>
/vllm-tunableop/2025-11-08$ cat tunableop_results0.csv
Validator,PT_VERSION,2.9.0
Validator,ROCM_VERSION,7.0.0.0-38-9428210
Validator,HIPBLASLT_VERSION,100000-976b9c4a87
Validator,GCN_ARCH_NAME,gfx1201
Validator,ROCBLAS_VERSION,5.0.0.976b9c4a87
GemmTunableOp_BFloat16_TN,tn_128_48_2048_ld_2048_2048_128,Gemm_Rocblas_48090,0.0158741
GemmTunableOp_BFloat16_TN,tn_128_24_2048_ld_2048_2048_128,Gemm_Hipblaslt_47905,0.0160022
GemmTunableOp_BFloat16_TN,tn_128_192_2048_ld_2048_2048_128,Gemm_Hipblaslt_48249,0.0161586
GemmTunableOp_BFloat16_TN,tn_75968_4_2048_ld_2048_2048_75968,Gemm_Hipblaslt_48359,0.614931
GemmTunableOp_BFloat16_TN,tn_75968_1_2048_ld_2048_2048_75968,Gemm_Rocblas_48360,0.617321
GemmTunableOp_BFloat16_TN,tn_128_8192_2048_ld_2048_2048_128,Gemm_Hipblaslt_47847,0.0357815
GemmTunableOp_BFloat16_TN,tn_128_384_2048_ld_2048_2048_128,Gemm_Hipblaslt_47758,0.0159846
GemmTunableOp_BFloat16_TN,tn_75968_8_2048_ld_2048_2048_75968,Gemm_Rocblas_48360,0.612155
GemmTunableOp_BFloat16_TN,tn_128_4096_2048_ld_2048_2048_128,Gemm_Hipblaslt_47652,0.0236782
GemmTunableOp_BFloat16_TN,tn_128_96_2048_ld_2048_2048_128,Gemm_Hipblaslt_48089,0.0157361
GemmTunableOp_BFloat16_TN,tn_128_2048_2048_ld_2048_2048_128,Gemm_Rocblas_47754,0.020133
GemmTunableOp_BFloat16_TN,tn_128_1_2048_ld_2048_2048_128,Gemm_Rocblas_-9,0.0129801
GemmTunableOp_BFloat16_TN,tn_128_2_2048_ld_2048_2048_128,Gemm_Hipblaslt_47852,0.0159458
GemmTunableOp_BFloat16_TN,tn_128_4_2048_ld_2048_2048_128,Gemm_Hipblaslt_48089,0.015883
GemmTunableOp_BFloat16_TN,tn_128_3072_2048_ld_2048_2048_128,Default,0.0215866
GemmTunableOp_BFloat16_TN,tn_128_8_2048_ld_2048_2048_128,Gemm_Rocblas_48019,0.0165829
GemmTunableOp_BFloat16_TN,tn_128_256_2048_ld_2048_2048_128,Gemm_Rocblas_48275,0.0158254
GemmTunableOp_BFloat16_TN,tn_128_16_2048_ld_2048_2048_128,Gemm_Hipblaslt_48093,0.0155757
GemmTunableOp_BFloat16_TN,tn_75968_16_2048_ld_2048_2048_75968,Gemm_Rocblas_48360,0.616733
GemmTunableOp_BFloat16_TN,tn_128_32_2048_ld_2048_2048_128,Gemm_Hipblaslt_47762,0.0152749
GemmTunableOp_BFloat16_TN,tn_128_128_2048_ld_2048_2048_128,Gemm_Hipblaslt_48093,0.0156526
GemmTunableOp_BFloat16_TN,tn_128_64_2048_ld_2048_2048_128,Gemm_Hipblaslt_47762,0.0155398
GemmTunableOp_BFloat16_TN,tn_128_512_2048_ld_2048_2048_128,Gemm_Hipblaslt_47760,0.0158461
GemmTunableOp_BFloat16_TN,tn_128_1024_2048_ld_2048_2048_128,Gemm_Rocblas_48360,0.0173334
GemmTunableOp_BFloat16_TN,tn_128_767_2048_ld_2048_2048_128,Gemm_Hipblaslt_48260,0.0172693
GemmTunableOp_BFloat16_TN,tn_128_1536_2048_ld_2048_2048_128,Gemm_Hipblaslt_47753,0.019419
GemmTunableOp_BFloat16_TN,tn_128_1013_2048_ld_2048_2048_128,Gemm_Hipblaslt_48164,0.0179382
GemmTunableOp_float_TN,tn_128_48_2048_ld_2048_2048_128,Gemm_Hipblaslt_61377,0.0465673
GemmTunableOp_float_TN,tn_128_24_2048_ld_2048_2048_128,Gemm_Rocblas_61378,0.0472677
GemmTunableOp_float_TN,tn_128_192_2048_ld_2048_2048_128,Gemm_Rocblas_61376,0.0508242
GemmTunableOp_float_TN,tn_75968_4_2048_ld_2048_2048_75968,Gemm_Hipblaslt_61377,1.32029
GemmTunableOp_float_TN,tn_75968_1_2048_ld_2048_2048_75968,Gemm_Rocblas_-9,1.05431
GemmTunableOp_float_TN,tn_128_8192_2048_ld_2048_2048_128,Gemm_Hipblaslt_61382,0.348349
GemmTunableOp_float_TN,tn_128_384_2048_ld_2048_2048_128,Gemm_Hipblaslt_61376,0.0598218
GemmTunableOp_float_TN,tn_75968_8_2048_ld_2048_2048_75968,Gemm_Hipblaslt_61377,1.34172
GemmTunableOp_float_TN,tn_128_4096_2048_ld_2048_2048_128,Gemm_Rocblas_61383,0.167474
GemmTunableOp_float_TN,tn_128_96_2048_ld_2048_2048_128,Default,0.0472857
GemmTunableOp_float_TN,tn_128_2048_2048_ld_2048_2048_128,Gemm_Hipblaslt_61382,0.117973
GemmTunableOp_float_TN,tn_128_1_2048_ld_2048_2048_128,Gemm_Rocblas_-9,0.00946049
GemmTunableOp_float_TN,tn_128_2_2048_ld_2048_2048_128,Gemm_Rocblas_61378,0.0464592
GemmTunableOp_float_TN,tn_128_4_2048_ld_2048_2048_128,Gemm_Hipblaslt_61377,0.0456692
GemmTunableOp_float_TN,tn_128_3072_2048_ld_2048_2048_128,Gemm_Hipblaslt_61382,0.123511
GemmTunableOp_float_TN,tn_128_8_2048_ld_2048_2048_128,Gemm_Hipblaslt_61377,0.0447937
GemmTunableOp_float_TN,tn_128_256_2048_ld_2048_2048_128,Default,0.057404
GemmTunableOp_float_TN,tn_128_16_2048_ld_2048_2048_128,Gemm_Rocblas_61378,0.0444132
GemmTunableOp_float_TN,tn_75968_16_2048_ld_2048_2048_75968,Gemm_Rocblas_61383,1.37875
GemmTunableOp_float_TN,tn_128_32_2048_ld_2048_2048_128,Gemm_Rocblas_61378,0.0454013
GemmTunableOp_float_TN,tn_128_128_2048_ld_2048_2048_128,Default,0.0487681
GemmTunableOp_float_TN,tn_128_64_2048_ld_2048_2048_128,Gemm_Hipblaslt_61377,0.0460332
GemmTunableOp_float_TN,tn_128_512_2048_ld_2048_2048_128,Gemm_Rocblas_61383,0.0819472
GemmTunableOp_float_TN,tn_128_1024_2048_ld_2048_2048_128,Gemm_Rocblas_61383,0.0825464
GemmTunableOp_float_TN,tn_128_767_2048_ld_2048_2048_128,Gemm_Rocblas_61377,0.104956
GemmTunableOp_float_TN,tn_128_1536_2048_ld_2048_2048_128,Gemm_Hipblaslt_61382,0.129388
GemmTunableOp_float_TN,tn_128_1013_2048_ld_2048_2048_128,Gemm_Rocblas_61377,0.124408
GemmTunableOp_Half_TN,tn_128_48_2048_ld_2048_2048_128,Gemm_Rocblas_58331,0.0138765
GemmTunableOp_Half_TN,tn_128_24_2048_ld_2048_2048_128,Gemm_Rocblas_58329,0.0139045
GemmTunableOp_Half_TN,tn_128_192_2048_ld_2048_2048_128,Gemm_Rocblas_58334,0.0139913
GemmTunableOp_Half_TN,tn_75968_4_2048_ld_2048_2048_75968,Gemm_Rocblas_58329,0.609519
GemmTunableOp_Half_TN,tn_75968_1_2048_ld_2048_2048_75968,Gemm_Rocblas_58329,0.608598
GemmTunableOp_Half_TN,tn_128_8192_2048_ld_2048_2048_128,Gemm_Rocblas_58149,0.0350798
GemmTunableOp_Half_TN,tn_128_384_2048_ld_2048_2048_128,Gemm_Hipblaslt_58325,0.0146849
GemmTunableOp_Half_TN,tn_75968_8_2048_ld_2048_2048_75968,Gemm_Rocblas_58331,0.602502
GemmTunableOp_Half_TN,tn_128_4096_2048_ld_2048_2048_128,Gemm_Hipblaslt_57525,0.0239766
GemmTunableOp_Half_TN,tn_128_96_2048_ld_2048_2048_128,Gemm_Hipblaslt_58323,0.0139522
GemmTunableOp_Half_TN,tn_128_2048_2048_ld_2048_2048_128,Gemm_Hipblaslt_57524,0.0194073
GemmTunableOp_Half_TN,tn_128_3072_2048_ld_2048_2048_128,Default,0.021287
GemmTunableOp_Half_TN,tn_128_8_2048_ld_2048_2048_128,Gemm_Hipblaslt_58325,0.0141572
GemmTunableOp_Half_TN,tn_128_4_2048_ld_2048_2048_128,Gemm_Rocblas_58325,0.0143933
GemmTunableOp_Half_TN,tn_128_2_2048_ld_2048_2048_128,Gemm_Rocblas_58324,0.0139865
GemmTunableOp_Half_TN,tn_128_1_2048_ld_2048_2048_128,Gemm_Rocblas_-9,0.00908485
GemmTunableOp_Half_TN,tn_128_16_2048_ld_2048_2048_128,Gemm_Hipblaslt_58322,0.0141397
GemmTunableOp_Half_TN,tn_128_256_2048_ld_2048_2048_128,Gemm_Rocblas_58324,0.0140341
GemmTunableOp_Half_TN,tn_75968_16_2048_ld_2048_2048_75968,Gemm_Rocblas_58329,0.606477
GemmTunableOp_Half_TN,tn_128_32_2048_ld_2048_2048_128,Gemm_Hipblaslt_58328,0.0134837
GemmTunableOp_Half_TN,tn_128_128_2048_ld_2048_2048_128,Gemm_Hipblaslt_58324,0.0138293
GemmTunableOp_Half_TN,tn_128_64_2048_ld_2048_2048_128,Gemm_Hipblaslt_58330,0.0137585
GemmTunableOp_Half_TN,tn_128_512_2048_ld_2048_2048_128,Gemm_Hipblaslt_58333,0.0150417
GemmTunableOp_Half_TN,tn_128_1024_2048_ld_2048_2048_128,Gemm_Rocblas_58328,0.0173309
GemmTunableOp_Half_TN,tn_128_767_2048_ld_2048_2048_128,Gemm_Rocblas_57801,0.0174229
GemmTunableOp_Half_TN,tn_128_1536_2048_ld_2048_2048_128,Gemm_Rocblas_57496,0.0187534
GemmTunableOp_Half_TN,tn_128_1013_2048_ld_2048_2048_128,Gemm_Hipblaslt_57891,0.0178402
</details>

Model	Before (TPS)	After (TPS)	Improvement
Qwen3-0.6B	~160 decode	~200 decode	25% faster
Qwen3-30B-2507	~52 decode	~85 decode	63% faster

Parameter	Description	RDNA4 Values
`BLOCK_SIZE_M/N/K`	Tile size for GEMM operations	32, 64 (multiples of 16)
`GROUP_SIZE_M`	Batch grouping for processing	8, 16
`num_warps`	Warps per thread block	4 (aligns with CU SIMDs)
`matrix_instr_nonkdim`	WMMA instruction size	16 (fixed for RDNA4)
`kpack`	K-dimension packing factor	1 (conservative)
`num_stages`	Pipeline stages (MoE only)	2

Uh oh!

[Feature]: Someone please upstream this gfx1201/RDNA4 FP8 Patch into vllm-rocm #28649

Description

🚀 The feature, motivation and pitch

Disclosure:

Native FP8 WMMA Support for AMD RDNA4 (RX 9070 XT / R9700) in vLLM

Summary

Performance Results

System Environment

The Problem

The Solution

Implementation Details

1. vLLM Code Modifications

File: vllm/platforms/rocm.py

File: vllm/model_executor/layers/quantization/utils/fp8_utils.py

2. AITER Architecture Patch (Required)

rdna4_aiter_wrapper.sh

3. Kernel Configuration Files

Kernel Configuration Files

Linear Layer Configs (15 files)

N=1024,K=1024,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json

N=1024,K=1536,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json

N=1024,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json

N=1024,K=3072,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json

N=2048,K=1024,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json

N=2048,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json

N=2048,K=4096,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json

N=2048,K=768,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json

N=2560,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json

N=256,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json

N=3072,K=1024,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json

N=384,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json

N=4096,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json

N=512,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json

N=768,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json

MoE Layer Config (1 file)

E=128,N=384,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json

Configuration Parameters Explained

Key Technical Details

Execution Path

Verification

Check Platform Detection

Check AITER is Disabled

Look for Log Messages

Known Limitations

Future Optimization Opportunities

Contributing

Acknowledgments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

File: `vllm/platforms/rocm.py`

File: `vllm/model_executor/layers/quantization/utils/fp8_utils.py`

`rdna4_aiter_wrapper.sh`

`N=1024,K=1024,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

`N=1024,K=1536,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

`N=1024,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

`N=1024,K=3072,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

`N=2048,K=1024,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

`N=2048,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

`N=2048,K=4096,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

`N=2048,K=768,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

`N=2560,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

`N=256,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

`N=3072,K=1024,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

`N=384,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

`N=4096,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

`N=512,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

`N=768,K=2048,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`

`E=128,N=384,device_name=0x7551,dtype=fp8_w8a8,block_shape=[128,128].json`