Update float8 README.md (#2774)

vkuzo · web-flow · commit 927cbfb9127a · 2025-08-15T12:09:28.000-04:00
1. move e2e benchmarks closer to the top
2. simplify some of the key features text to make it more concise
diff --git a/torchao/float8/README.md b/torchao/float8/README.md
@@ -10,15 +10,55 @@ and composable with key systems such as autograd, ```torch.compile``` and distri
 
 * e2e pretraining speedups of up to [**1.5x at 512 GPU / 405B parameter count scale**](https://pytorch.org/blog/training-using-float8-fsdp2/),
 and up to [**1.25x at 8 GPU / 8B parameter count scale**](#training-benchmarks), with performance and accuracy validated on up to [**2k GPUs**](https://pytorch.org/blog/accelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s/), via [torchtitan's float8 integration](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md)
-* seamless composability with [torch.compile](https://docs.pytorch.org/docs/stable/torch.compiler.html)
-* seamless composability with [DTensor](https://docs.pytorch.org/docs/stable/distributed.tensor.html), including [FSDP2 with float8 weight all-gather](https://dev-discuss.pytorch.org/t/enabling-float8-all-gather-in-fsdp2/2359) and [Async TP](https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487)
-* seamless composability with [PyTorch Activation Checkpointing](https://pytorch.org/blog/activation-checkpointing-techniques/)
-* three different scaling recipes to trade off performance vs accuracy: tensorwise (fastest), rowwise, rowwise_with_gw_hp (most accurate)
+* seamless composability with [torch.compile](https://docs.pytorch.org/docs/stable/torch.compiler.html), [DTensor](https://docs.pytorch.org/docs/stable/distributed.tensor.html), [FSDP2 with float8 weight all-gather](https://dev-discuss.pytorch.org/t/enabling-float8-all-gather-in-fsdp2/2359), [Async TP](https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487), and [PyTorch AC](https://pytorch.org/blog/activation-checkpointing-techniques/)
+* three recipes to trade off performance vs accuracy: `tensorwise` (fastest), `rowwise`, `rowwise_with_gw_hp` (most accurate)
 * supports both NVIDIA and AMD hardware
 
 ℹ️ <em>See the [feature tracker](https://github.com/pytorch/ao/issues/556) for upcoming features.</em>
 
-ℹ️ <em>These APIs are training-only and float8-only, and we plan to [unify them with the rest of torchao](https://github.com/pytorch/ao/issues/894) in the future.</em>
+# e2e training benchmarks
+
+[Torchtitan](https://github.com/pytorch/torchtitan) was used to benchmark float8 training performance.
+
+#### NVIDIA H100
+
+- Single-node training on 8xH100 GPUs, batch size 1, sequence length 8192, steps 100, `torch.compile`, FSDP2, per-op SAC
+- pytorch version: `2.7.0a0+gitb98af95`, torchao version: `0.10.0+git890e0ac8`, torchtitan version: `0.0.2`
+
+| Model         | Scaling                            | Peak Memory (GB)  | Median tokens/second | Speedup over baseline
+| ------------- | ---------------------------------- | ------------------| -------------------- | ---------------------
+| Llama3-8b     |  none (bfloat16)                   | 47.65             |  6150                | -
+| Llama3-8b     |  tensorwise with float8 all-gather | 47.77             |  7689.5              | 25.03%
+| Llama3-8b     |  rowwise with bfloat16 all-gather  | 47.79             |  6768                | 10.05%
+
+#### AMD MI300x
+
+- Single-node training on 8xMI300X GPUs, batch size 1, sequence length 8192, steps 100, `torch.compile`, FSDP2, per-op SAC
+- pytorch version: `2.9.0.dev20250811+rocm6.4`, torchao version `0.13.0+git4fc4068d6`, torchtitan commit `2c8b5947991239913d67e2f7d22a255c3e2a9694`
+
+| Model         | Scaling                            | Peak Memory (GB)  | Median tokens/second | Speedup over baseline
+| ------------- | ---------------------------------- | ------------------| -------------------- | ---------------------
+| Llama3-8b     |  none (bfloat16)                   | 39.09             |  5376.5              | -
+| Llama3-8b     |  tensorwise with float8 all-gather | 39.07             |  6166.0              | 14.68%
+| Llama3-8b     |  rowwise_with_gw_hp with bfloat16 all-gather  | 39.32             |  6100.0                | 13.46%
+| Llama3-8b     |  rowwise with bfloat16 all-gather  | 39.32             |  5891.0              | 9.57%
+
+**Important notes**:
+- E2E speedups increase as M,K,N (GEMM dimensions) increase. Speedups as high as 1.5x have been measured with larger shapes ([example](https://pytorch.org/blog/training-using-float8-fsdp2/)).
+- Rowwise scaling is better at handling outliers than tensorwise scaling, so these recipes are different points on the accuracy vs performance curve.
+
+**Reproducing training benchmarks**
+To reproduce these benchmarks, you can follow these steps:
+
+1. On a machine with compatible GPUs, clone torchtitan and follow local installation [steps](https://github.com/pytorch/torchtitan?tab=readme-ov-file#installation),
+including [downloading a tokenizer](https://github.com/pytorch/torchtitan?tab=readme-ov-file#downloading-a-tokenizer).
+2. Install torchao following these [steps](https://github.com/pytorch/ao/tree/main?tab=readme-ov-file#installation).
+3. From the `torchao/benchmarks/float8/training/` directory, you can run the following commands to reproduce the benchmarks above:
+   - bf16 + compile: `TORCHTITAN_ROOT=<path> ./torchtitan_benchmark.sh`
+   - float8 tensorwise with float8 all-gather + compile: `TORCHTITAN_ROOT=<path> FLOAT8_RECIPE_WITH_BEST_SETTINGS="tensorwise" ./torchtitan_benchmark.sh`
+   - float8 rowwise with bf16 all-gather + compile: `TORCHTITAN_ROOT=<path> FLOAT8_RECIPE_WITH_BEST_SETTINGS="rowwise" ./torchtitan_benchmark.sh`
+
+See the float8 training benchmarking [guide](.torchao/benchmarks/float8/training/README.md) for more details.
 
 # Single GPU User API
 
@@ -167,62 +207,6 @@ python test/float8/test_fsdp2/test_fsdp2.py
 ./test/float8/test_everything.sh
 ```
 
-# Benchmarking
-
-```bash
-# benchmark the torch._scaled_mm function on LLaMa 2 70B shapes
-./benchmarks/float8/bench_matmul.py
-
-# benchmark fw/bw of `Linear` and `Float8Linear` on LLaMa 2 70B shapes
-# make sure to turn on torch.compile to get the best performance
-./benchmarks/float8/bench_linear_float8.py -o ../tmp/test.txt --compile
-```
-
-### Training benchmarks
-
-[Torchtitan](https://github.com/pytorch/torchtitan) was used to benchmark float8 training performance, for both rowwise
-and tensorwise scaling. The training benchmarks were all run using:
-
-#### NVIDIA H100
-
-- Single-node training on 8xH100 GPUs, batch size 1, sequence length 8192, steps 100, `torch.compile`, FSDP2, per-op SAC
-- pytorch version: `2.7.0a0+gitb98af95`, torchao version: `0.10.0+git890e0ac8`, torchtitan version: `0.0.2`
-
-| Model         | Scaling                            | Peak Memory (GB)  | Median tokens/second | Speedup over baseline
-| ------------- | ---------------------------------- | ------------------| -------------------- | ---------------------
-| Llama3-8b     |  none (bfloat16)                   | 47.65             |  6150                | -
-| Llama3-8b     |  tensorwise with float8 all-gather | 47.77             |  7689.5              | 25.03%
-| Llama3-8b     |  rowwise with bfloat16 all-gather  | 47.79             |  6768                | 10.05%
-
-#### AMD MI300x
-
-- Single-node training on 8xMI300X GPUs, batch size 1, sequence length 8192, steps 100, `torch.compile`, FSDP2, per-op SAC
-- pytorch version: `2.9.0.dev20250811+rocm6.4`, torchao version `0.13.0+git4fc4068d6`, torchtitan commit `2c8b5947991239913d67e2f7d22a255c3e2a9694`
-
-| Model         | Scaling                            | Peak Memory (GB)  | Median tokens/second | Speedup over baseline
-| ------------- | ---------------------------------- | ------------------| -------------------- | ---------------------
-| Llama3-8b     |  none (bfloat16)                   | 39.09             |  5376.5              | -
-| Llama3-8b     |  tensorwise with float8 all-gather | 39.07             |  6166.0              | 14.68%
-| Llama3-8b     |  rowwise_with_gw_hp with bfloat16 all-gather  | 39.32             |  6100.0                | 13.46%
-| Llama3-8b     |  rowwise with bfloat16 all-gather  | 39.32             |  5891.0              | 9.57%
-
-**Important notes**:
-- E2E speedups increase as M,K,N (GEMM dimensions) increase. Speedups as high as 1.5x have been measured with larger shapes ([example](https://pytorch.org/blog/training-using-float8-fsdp2/)).
-- Rowwise scaling is better at handling outliers than tensorwise scaling, so these recipes are different points on the accuracy vs performance curve.
-
-**Reproducing training benchmarks**
-To reproduce these benchmarks, you can follow these steps:
-
-1. On a machine with compatible GPUs, clone torchtitan and follow local installation [steps](https://github.com/pytorch/torchtitan?tab=readme-ov-file#installation),
-including [downloading a tokenizer](https://github.com/pytorch/torchtitan?tab=readme-ov-file#downloading-a-tokenizer).
-2. Install torchao following these [steps](https://github.com/pytorch/ao/tree/main?tab=readme-ov-file#installation).
-3. From the `torchao/benchmarks/float8/training/` directory, you can run the following commands to reproduce the benchmarks above:
-   - bf16 + compile: `TORCHTITAN_ROOT=<path> ./torchtitan_benchmark.sh`
-   - float8 tensorwise with float8 all-gather + compile: `TORCHTITAN_ROOT=<path> FLOAT8_RECIPE_WITH_BEST_SETTINGS="tensorwise" ./torchtitan_benchmark.sh`
-   - float8 rowwise with bf16 all-gather + compile: `TORCHTITAN_ROOT=<path> FLOAT8_RECIPE_WITH_BEST_SETTINGS="rowwise" ./torchtitan_benchmark.sh`
-
-See the float8 training benchmarking [guide](.torchao/benchmarks/float8/training/README.md) for more details.
-
 # E2E training + inference flow
 
 The first step in the E2E is to train your model and save a checkpoint. The second step is to load the checkpoint and optionally apply inference quantization before serving the model.