You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
torchao.float8: update with AMD MI300X benchmark results (#2736)
I got a devgpu with 8 AMD MI300X GPUs, ran the torchtitan benchmarks (without any performance debugging), and adding the numbers I saw to the readme.
The tensorwise number looks lower than expected, we can debug/fix this in a future PR.
Copy file name to clipboardExpand all lines: torchao/float8/README.md
+22-15Lines changed: 22 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,6 +14,7 @@ and up to [**1.25x at 8 GPU / 8B parameter count scale**](#training-benchmarks),
14
14
* seamless composability with [DTensor](https://docs.pytorch.org/docs/stable/distributed.tensor.html), including [FSDP2 with float8 weight all-gather](https://dev-discuss.pytorch.org/t/enabling-float8-all-gather-in-fsdp2/2359) and [Async TP](https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487)
15
15
* seamless composability with [PyTorch Activation Checkpointing](https://pytorch.org/blog/activation-checkpointing-techniques/)
16
16
* three different scaling recipes to trade off performance vs accuracy: tensorwise (fastest), rowwise, rowwise_with_gw_hp (most accurate)
17
+
* supports both NVIDIA and AMD hardware
17
18
18
19
ℹ️ <em>See the [feature tracker](https://github.com/pytorch/ao/issues/556) for upcoming features.</em>
- E2E speedups increase as M,K,N (GEMM dimensions) increase. Speedups as high as 1.5x have been measured with larger shapes ([example](https://pytorch.org/blog/training-using-float8-fsdp2/)).
@@ -210,7 +217,7 @@ and tensorwise scaling. The training benchmarks were all run using:
210
217
**Reproducing training benchmarks**
211
218
To reproduce these benchmarks, you can follow these steps:
212
219
213
-
1. On a machine with 8 H100 GPUs, clone torchtitan and follow local installation [steps](https://github.com/pytorch/torchtitan?tab=readme-ov-file#installation),
220
+
1. On a machine with compatible GPUs, clone torchtitan and follow local installation [steps](https://github.com/pytorch/torchtitan?tab=readme-ov-file#installation),
214
221
including [downloading a tokenizer](https://github.com/pytorch/torchtitan?tab=readme-ov-file#downloading-a-tokenizer).
215
222
2. Install torchao following these [steps](https://github.com/pytorch/ao/tree/main?tab=readme-ov-file#installation).
216
223
3. From the `torchao/benchmarks/float8/training/` directory, you can run the following commands to reproduce the benchmarks above:
0 commit comments