[test] amd-demo: MI350 optimized kernels benchmark #858

sunxxuns · 2026-01-22T09:44:04Z

AMD MI350 Benchmark - OpenPI Pi0 (3.5B)

Full Policy Inference (batch=1)

python scripts/benchmark_policy_inference.py

Metric	AMD MI350	NVIDIA H200	Ratio
Latency	57.6 ms	29.6 ms	1.9x
Throughput	17.36 Hz	33.74 Hz	0.51x
Memory	7.10 GB	7.03 GB	~same

Uses torch.compile(mode="max-autotune") + Aiter Flash Attention.

Kernel Breakdown

Category	MI350	MI350 %	H200 %
Triton Fused	39.55 ms	72.3%	77.2%
GEMM	5.56 ms	10.2%	18.2%
Flash Attention	2.43 ms	4.4%	2.3%
Other	7.19 ms	13.1%	2.3%

Precision Verification

python scripts/verify_precision.py

Metric	Value
Cosine Similarity	1.000000
Result	PASSED

Branches

amd-bench - MI350 development
h200-bench - H200 reference

sunxxuns · 2026-01-22T09:56:40Z

AMD MI350 Benchmark Results

Single-GPU Inference (3.3B Model)

python scripts/benchmark_mi350.py

Batch	Seq	Samples/s	Latency
1	512	150	6.7 ms
8	1024	132	7.6 ms
16	512	266	3.8 ms
32	256	534	1.9 ms

8-GPU DDP Training (3.3B Model)

torchrun --nproc_per_node=8 scripts/benchmark_mi350_ddp.py

Batch/GPU	Total Batch	Seq	Samples/s	Step Time
4	32	512	225	142 ms
8	64	512	329	195 ms
8	64	1024	196	327 ms
16	128	512	407	315 ms

Training Convergence (8 GPUs)

Step	Loss
0	10.56
20	2.22
50	0.007
100	0.005

Loss reduction: 99.95%

Accuracy

Comparison	Cosine Similarity
Eager vs Optimized	0.999999

Enable Optimizations

from transformers.models.gemma.modeling_gemma import set_use_aiter_attention
set_use_aiter_attention(True)

- Aiter Flash Attention for AMD GPUs - Triton kernels for RMSNorm, GELU+Mul - Full policy inference: 142ms latency, 7Hz (Pi0 3.5B, batch=1) - 8-GPU DDP training: 407 samples/s (3.3B model) - Training convergence verified - Perfetto traces included

sunxxuns requested review from Michael-Equi, jimmyt857, kvablack and uzhilinsky as code owners January 22, 2026 09:44

sunxxuns force-pushed the main branch from 9a85c8d to 293194c Compare January 22, 2026 09:54

sunxxuns changed the title ~~amd-demo: MI350 optimized kernels benchmark~~ [test] amd-demo: MI350 optimized kernels benchmark Jan 22, 2026

jimmyt857 removed request for jimmyt857 and uzhilinsky January 22, 2026 16:11

sunxxuns force-pushed the main branch from 293194c to 5c2bb0a Compare January 23, 2026 19:09

sunxxuns force-pushed the main branch from 5c2bb0a to 8ab0e14 Compare January 23, 2026 20:27

sunxxuns closed this Jan 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[test] amd-demo: MI350 optimized kernels benchmark #858

[test] amd-demo: MI350 optimized kernels benchmark #858

sunxxuns commented Jan 22, 2026 •

edited

Loading

Uh oh!

sunxxuns commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[test] amd-demo: MI350 optimized kernels benchmark #858

[test] amd-demo: MI350 optimized kernels benchmark #858

Conversation

sunxxuns commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AMD MI350 Benchmark - OpenPI Pi0 (3.5B)

Full Policy Inference (batch=1)

Kernel Breakdown

Precision Verification

Branches

Uh oh!

sunxxuns commented Jan 22, 2026

AMD MI350 Benchmark Results

Single-GPU Inference (3.3B Model)

8-GPU DDP Training (3.3B Model)

Training Convergence (8 GPUs)

Accuracy

Enable Optimizations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sunxxuns commented Jan 22, 2026 •

edited

Loading