Skip to content

Commit 927cbfb

Browse files
authored
Update float8 README.md (#2774)
1. move e2e benchmarks closer to the top 2. simplify some of the key features text to make it more concise
1 parent 2db4c76 commit 927cbfb

File tree

1 file changed

+45
-61
lines changed

1 file changed

+45
-61
lines changed

torchao/float8/README.md

Lines changed: 45 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -10,15 +10,55 @@ and composable with key systems such as autograd, ```torch.compile``` and distri
1010

1111
* e2e pretraining speedups of up to [**1.5x at 512 GPU / 405B parameter count scale**](https://pytorch.org/blog/training-using-float8-fsdp2/),
1212
and up to [**1.25x at 8 GPU / 8B parameter count scale**](#training-benchmarks), with performance and accuracy validated on up to [**2k GPUs**](https://pytorch.org/blog/accelerating-large-scale-training-and-convergence-with-pytorch-float8-rowwise-on-crusoe-2k-h200s/), via [torchtitan's float8 integration](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md)
13-
* seamless composability with [torch.compile](https://docs.pytorch.org/docs/stable/torch.compiler.html)
14-
* seamless composability with [DTensor](https://docs.pytorch.org/docs/stable/distributed.tensor.html), including [FSDP2 with float8 weight all-gather](https://dev-discuss.pytorch.org/t/enabling-float8-all-gather-in-fsdp2/2359) and [Async TP](https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487)
15-
* seamless composability with [PyTorch Activation Checkpointing](https://pytorch.org/blog/activation-checkpointing-techniques/)
16-
* three different scaling recipes to trade off performance vs accuracy: tensorwise (fastest), rowwise, rowwise_with_gw_hp (most accurate)
13+
* seamless composability with [torch.compile](https://docs.pytorch.org/docs/stable/torch.compiler.html), [DTensor](https://docs.pytorch.org/docs/stable/distributed.tensor.html), [FSDP2 with float8 weight all-gather](https://dev-discuss.pytorch.org/t/enabling-float8-all-gather-in-fsdp2/2359), [Async TP](https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487), and [PyTorch AC](https://pytorch.org/blog/activation-checkpointing-techniques/)
14+
* three recipes to trade off performance vs accuracy: `tensorwise` (fastest), `rowwise`, `rowwise_with_gw_hp` (most accurate)
1715
* supports both NVIDIA and AMD hardware
1816

1917
ℹ️ <em>See the [feature tracker](https://github.com/pytorch/ao/issues/556) for upcoming features.</em>
2018

21-
ℹ️ <em>These APIs are training-only and float8-only, and we plan to [unify them with the rest of torchao](https://github.com/pytorch/ao/issues/894) in the future.</em>
19+
# e2e training benchmarks
20+
21+
[Torchtitan](https://github.com/pytorch/torchtitan) was used to benchmark float8 training performance.
22+
23+
#### NVIDIA H100
24+
25+
- Single-node training on 8xH100 GPUs, batch size 1, sequence length 8192, steps 100, `torch.compile`, FSDP2, per-op SAC
26+
- pytorch version: `2.7.0a0+gitb98af95`, torchao version: `0.10.0+git890e0ac8`, torchtitan version: `0.0.2`
27+
28+
| Model | Scaling | Peak Memory (GB) | Median tokens/second | Speedup over baseline
29+
| ------------- | ---------------------------------- | ------------------| -------------------- | ---------------------
30+
| Llama3-8b | none (bfloat16) | 47.65 | 6150 | -
31+
| Llama3-8b | tensorwise with float8 all-gather | 47.77 | 7689.5 | 25.03%
32+
| Llama3-8b | rowwise with bfloat16 all-gather | 47.79 | 6768 | 10.05%
33+
34+
#### AMD MI300x
35+
36+
- Single-node training on 8xMI300X GPUs, batch size 1, sequence length 8192, steps 100, `torch.compile`, FSDP2, per-op SAC
37+
- pytorch version: `2.9.0.dev20250811+rocm6.4`, torchao version `0.13.0+git4fc4068d6`, torchtitan commit `2c8b5947991239913d67e2f7d22a255c3e2a9694`
38+
39+
| Model | Scaling | Peak Memory (GB) | Median tokens/second | Speedup over baseline
40+
| ------------- | ---------------------------------- | ------------------| -------------------- | ---------------------
41+
| Llama3-8b | none (bfloat16) | 39.09 | 5376.5 | -
42+
| Llama3-8b | tensorwise with float8 all-gather | 39.07 | 6166.0 | 14.68%
43+
| Llama3-8b | rowwise_with_gw_hp with bfloat16 all-gather | 39.32 | 6100.0 | 13.46%
44+
| Llama3-8b | rowwise with bfloat16 all-gather | 39.32 | 5891.0 | 9.57%
45+
46+
**Important notes**:
47+
- E2E speedups increase as M,K,N (GEMM dimensions) increase. Speedups as high as 1.5x have been measured with larger shapes ([example](https://pytorch.org/blog/training-using-float8-fsdp2/)).
48+
- Rowwise scaling is better at handling outliers than tensorwise scaling, so these recipes are different points on the accuracy vs performance curve.
49+
50+
**Reproducing training benchmarks**
51+
To reproduce these benchmarks, you can follow these steps:
52+
53+
1. On a machine with compatible GPUs, clone torchtitan and follow local installation [steps](https://github.com/pytorch/torchtitan?tab=readme-ov-file#installation),
54+
including [downloading a tokenizer](https://github.com/pytorch/torchtitan?tab=readme-ov-file#downloading-a-tokenizer).
55+
2. Install torchao following these [steps](https://github.com/pytorch/ao/tree/main?tab=readme-ov-file#installation).
56+
3. From the `torchao/benchmarks/float8/training/` directory, you can run the following commands to reproduce the benchmarks above:
57+
- bf16 + compile: `TORCHTITAN_ROOT=<path> ./torchtitan_benchmark.sh`
58+
- float8 tensorwise with float8 all-gather + compile: `TORCHTITAN_ROOT=<path> FLOAT8_RECIPE_WITH_BEST_SETTINGS="tensorwise" ./torchtitan_benchmark.sh`
59+
- float8 rowwise with bf16 all-gather + compile: `TORCHTITAN_ROOT=<path> FLOAT8_RECIPE_WITH_BEST_SETTINGS="rowwise" ./torchtitan_benchmark.sh`
60+
61+
See the float8 training benchmarking [guide](.torchao/benchmarks/float8/training/README.md) for more details.
2262

2363
# Single GPU User API
2464

@@ -167,62 +207,6 @@ python test/float8/test_fsdp2/test_fsdp2.py
167207
./test/float8/test_everything.sh
168208
```
169209

170-
# Benchmarking
171-
172-
```bash
173-
# benchmark the torch._scaled_mm function on LLaMa 2 70B shapes
174-
./benchmarks/float8/bench_matmul.py
175-
176-
# benchmark fw/bw of `Linear` and `Float8Linear` on LLaMa 2 70B shapes
177-
# make sure to turn on torch.compile to get the best performance
178-
./benchmarks/float8/bench_linear_float8.py -o ../tmp/test.txt --compile
179-
```
180-
181-
### Training benchmarks
182-
183-
[Torchtitan](https://github.com/pytorch/torchtitan) was used to benchmark float8 training performance, for both rowwise
184-
and tensorwise scaling. The training benchmarks were all run using:
185-
186-
#### NVIDIA H100
187-
188-
- Single-node training on 8xH100 GPUs, batch size 1, sequence length 8192, steps 100, `torch.compile`, FSDP2, per-op SAC
189-
- pytorch version: `2.7.0a0+gitb98af95`, torchao version: `0.10.0+git890e0ac8`, torchtitan version: `0.0.2`
190-
191-
| Model | Scaling | Peak Memory (GB) | Median tokens/second | Speedup over baseline
192-
| ------------- | ---------------------------------- | ------------------| -------------------- | ---------------------
193-
| Llama3-8b | none (bfloat16) | 47.65 | 6150 | -
194-
| Llama3-8b | tensorwise with float8 all-gather | 47.77 | 7689.5 | 25.03%
195-
| Llama3-8b | rowwise with bfloat16 all-gather | 47.79 | 6768 | 10.05%
196-
197-
#### AMD MI300x
198-
199-
- Single-node training on 8xMI300X GPUs, batch size 1, sequence length 8192, steps 100, `torch.compile`, FSDP2, per-op SAC
200-
- pytorch version: `2.9.0.dev20250811+rocm6.4`, torchao version `0.13.0+git4fc4068d6`, torchtitan commit `2c8b5947991239913d67e2f7d22a255c3e2a9694`
201-
202-
| Model | Scaling | Peak Memory (GB) | Median tokens/second | Speedup over baseline
203-
| ------------- | ---------------------------------- | ------------------| -------------------- | ---------------------
204-
| Llama3-8b | none (bfloat16) | 39.09 | 5376.5 | -
205-
| Llama3-8b | tensorwise with float8 all-gather | 39.07 | 6166.0 | 14.68%
206-
| Llama3-8b | rowwise_with_gw_hp with bfloat16 all-gather | 39.32 | 6100.0 | 13.46%
207-
| Llama3-8b | rowwise with bfloat16 all-gather | 39.32 | 5891.0 | 9.57%
208-
209-
**Important notes**:
210-
- E2E speedups increase as M,K,N (GEMM dimensions) increase. Speedups as high as 1.5x have been measured with larger shapes ([example](https://pytorch.org/blog/training-using-float8-fsdp2/)).
211-
- Rowwise scaling is better at handling outliers than tensorwise scaling, so these recipes are different points on the accuracy vs performance curve.
212-
213-
**Reproducing training benchmarks**
214-
To reproduce these benchmarks, you can follow these steps:
215-
216-
1. On a machine with compatible GPUs, clone torchtitan and follow local installation [steps](https://github.com/pytorch/torchtitan?tab=readme-ov-file#installation),
217-
including [downloading a tokenizer](https://github.com/pytorch/torchtitan?tab=readme-ov-file#downloading-a-tokenizer).
218-
2. Install torchao following these [steps](https://github.com/pytorch/ao/tree/main?tab=readme-ov-file#installation).
219-
3. From the `torchao/benchmarks/float8/training/` directory, you can run the following commands to reproduce the benchmarks above:
220-
- bf16 + compile: `TORCHTITAN_ROOT=<path> ./torchtitan_benchmark.sh`
221-
- float8 tensorwise with float8 all-gather + compile: `TORCHTITAN_ROOT=<path> FLOAT8_RECIPE_WITH_BEST_SETTINGS="tensorwise" ./torchtitan_benchmark.sh`
222-
- float8 rowwise with bf16 all-gather + compile: `TORCHTITAN_ROOT=<path> FLOAT8_RECIPE_WITH_BEST_SETTINGS="rowwise" ./torchtitan_benchmark.sh`
223-
224-
See the float8 training benchmarking [guide](.torchao/benchmarks/float8/training/README.md) for more details.
225-
226210
# E2E training + inference flow
227211

228212
The first step in the E2E is to train your model and save a checkpoint. The second step is to load the checkpoint and optionally apply inference quantization before serving the model.

0 commit comments

Comments
 (0)