Skip to content

Conversation

@sunxxuns
Copy link

@sunxxuns sunxxuns commented Jan 22, 2026

AMD MI350 Benchmark - OpenPI Pi0 (3.5B)

Full Policy Inference (batch=1)

python scripts/benchmark_policy_inference.py
Metric AMD MI350 NVIDIA H200 Ratio
Latency 57.6 ms 29.6 ms 1.9x
Throughput 17.36 Hz 33.74 Hz 0.51x
Memory 7.10 GB 7.03 GB ~same

Uses torch.compile(mode="max-autotune") + Aiter Flash Attention.

Kernel Breakdown

Category MI350 MI350 % H200 %
Triton Fused 39.55 ms 72.3% 77.2%
GEMM 5.56 ms 10.2% 18.2%
Flash Attention 2.43 ms 4.4% 2.3%
Other 7.19 ms 13.1% 2.3%

Precision Verification

python scripts/verify_precision.py
Metric Value
Cosine Similarity 1.000000
Result PASSED

Branches

  • amd-bench - MI350 development
  • h200-bench - H200 reference

@sunxxuns
Copy link
Author

AMD MI350 Benchmark Results

Single-GPU Inference (3.3B Model)

python scripts/benchmark_mi350.py
Batch Seq Samples/s Latency
1 512 150 6.7 ms
8 1024 132 7.6 ms
16 512 266 3.8 ms
32 256 534 1.9 ms

8-GPU DDP Training (3.3B Model)

torchrun --nproc_per_node=8 scripts/benchmark_mi350_ddp.py
Batch/GPU Total Batch Seq Samples/s Step Time
4 32 512 225 142 ms
8 64 512 329 195 ms
8 64 1024 196 327 ms
16 128 512 407 315 ms

Training Convergence (8 GPUs)

Step Loss
0 10.56
20 2.22
50 0.007
100 0.005

Loss reduction: 99.95%

Accuracy

Comparison Cosine Similarity
Eager vs Optimized 0.999999

Enable Optimizations

from transformers.models.gemma.modeling_gemma import set_use_aiter_attention
set_use_aiter_attention(True)

@sunxxuns sunxxuns changed the title amd-demo: MI350 optimized kernels benchmark [test] amd-demo: MI350 optimized kernels benchmark Jan 22, 2026
- Aiter Flash Attention for AMD GPUs
- Triton kernels for RMSNorm, GELU+Mul
- Full policy inference: 142ms latency, 7Hz (Pi0 3.5B, batch=1)
- 8-GPU DDP training: 407 samples/s (3.3B model)
- Training convergence verified
- Perfetto traces included
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant