Skip to content

Commit 09011f5

Browse files
regissskashif
authored andcommitted
Update BridgeTower blog post with H100 benchmark (huggingface#1426)
1 parent 219b914 commit 09011f5

File tree

1 file changed

+41
-32
lines changed

1 file changed

+41
-32
lines changed

bridgetower.md

Lines changed: 41 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,11 @@ authors:
1212
<!-- {blog_metadata} -->
1313
<!-- {authors} -->
1414

15-
[Optimum Habana v1.6](https://github.com/huggingface/optimum-habana/tree/main) on Habana Gaudi2 achieves **almost x3 speedups compared to A100** when fine-tuning BridgeTower, a state-of-the-art vision-language model. Two new features contribute to the performance improvement: hardware-accelerated data loading and a fast DDP implementation.
15+
*Update (29/08/2023): A benchmark on H100 was added to this blog post. Also, all performance numbers have been updated with newer versions of software.*
1616

17-
*These techniques apply to any other workloads constrained by data loading, which is frequently the case for many types of vision models.* This post will take you through the process and benchmark we used to compare BridgeTower fine-tuning on Habana Gaudi2 and Nvidia A100 80GB. It also demonstrates how easy it is to take advantage of these features in transformers-based models.
17+
[Optimum Habana v1.7](https://github.com/huggingface/optimum-habana/tree/main) on Habana Gaudi2 achieves **x2.5 speedups compared to A100 and x1.4 compared to H100** when fine-tuning BridgeTower, a state-of-the-art vision-language model. This performance improvement relies on hardware-accelerated data loading to make the most of your devices.
18+
19+
*These techniques apply to any other workloads constrained by data loading, which is frequently the case for many types of vision models.* This post will take you through the process and benchmark we used to compare BridgeTower fine-tuning on Habana Gaudi2, Nvidia H100 and Nvidia A100 80GB. It also demonstrates how easy it is to take advantage of these features in transformers-based models.
1820

1921

2022
## BridgeTower
@@ -26,7 +28,9 @@ Pre-trained with only 4M images (see the detail [below](#benchmark)), BridgeTowe
2628

2729
## Hardware
2830

29-
[Nvidia A100 Tensor Core GPU](https://www.nvidia.com/en-us/data-center/a100/) includes the 3rd generation of the [Tensor Core technology](https://www.nvidia.com/en-us/data-center/tensor-cores/). Although a newer generation got released recently (H100), this is still the fastest GPU that you will find at most cloud providers. We use here the 80GB-memory variant which also offers faster memory bandwidth than the 40GB one.
31+
[NVIDIA H100 Tensor Core GPU](https://www.nvidia.com/en-us/data-center/h100/) is the latest and fastest generation of Nvidia GPUs. It includes a dedicated Transformer Engine that enables to perform fp8 mixed-precision runs. One device has 80GB of memory.
32+
33+
[Nvidia A100 Tensor Core GPU](https://www.nvidia.com/en-us/data-center/a100/) includes the 3rd generation of the [Tensor Core technology](https://www.nvidia.com/en-us/data-center/tensor-cores/). This is still the fastest GPU that you will find at most cloud providers. We use here the 80GB-memory variant which also offers faster memory bandwidth than the 40GB one.
3034

3135
[Habana Gaudi2](https://habana.ai/products/gaudi2/) is the second-generation AI hardware accelerator designed by Habana Labs. A single server contains 8 accelerator devices called HPUs with 96GB of memory each. Check out [our previous blog post](https://huggingface.co/blog/habana-gaudi-2-bloom#habana-gaudi2) for a more in-depth introduction and a guide showing how to access it through the [Intel Developer Cloud](https://www.intel.com/content/www/us/en/secure/developer/devcloud/cloud-launchpad.html). Unlike many AI accelerators in the market, advanced features are very easy to apply to make the most of Gaudi2 with [Optimum Habana](https://huggingface.co/docs/optimum/habana/index), which enables users to port Transformers-compatible scripts to Gaudi with just a 2-line change.
3236

@@ -37,7 +41,7 @@ To benchmark training, we are going to fine-tune a [BridgeTower Large checkpoint
3741

3842
We will further fine-tune this checkpoint on the [New Yorker Caption Contest dataset](https://huggingface.co/datasets/jmhessel/newyorker_caption_contest) which consists of cartoons from The New Yorker and the most voted captions.
3943

40-
Hyperparameters are all the same for both accelerators, except the batch size: we managed to fit 40 samples on Gaudi2 against 32 on A100. You can check them out [here](https://huggingface.co/regisss/bridgetower-newyorker-gaudi2-8x#training-hyperparameters) for Gaudi2 and [there](https://huggingface.co/regisss/bridgetower-newyorker-a100-8x#training-hyperparameters) for A100.
44+
Hyperparameters are the same for all accelerators. We used a batch size of 48 samples for each device. You can check hyperparameters out [here](https://huggingface.co/regisss/bridgetower-newyorker-gaudi2-8x#training-hyperparameters) for Gaudi2 and [there](https://huggingface.co/regisss/bridgetower-newyorker-a100-8x#training-hyperparameters) for A100.
4145

4246
**When dealing with datasets involving images, data loading is frequently a bottleneck** because many costly operations are computed on CPU (image decoding, image augmentations) and then full images are sent to the training devices. Ideally, *we would like to send only raw bytes to devices and then perform decoding and various image transformations on device*. But let's see first how to *easily* allocate more resources to data loading for accelerating your runs.
4347

@@ -48,28 +52,30 @@ When image loading is done on CPU, a quick way to speed it up would be to alloca
4852

4953
The default is 0, which means that data is loaded in the main process. This may not be optimal as the main process has many things to manage. We can set it to 1 to have one fully dedicated subprocess for data loading. When several subprocesses are allocated, each one of them will be responsible for preparing a batch. This means that RAM consumption will increase with the number of workers. One recommendation would be to set it to the number of CPU cores, but those cores may not be fully free so you will have to try it out to find the best configuration.
5054

51-
Let's run the two following experiments:
52-
- a mixed-precision (*bfloat16*/*float*) run distributed across 8 devices where data loading is performed by the same process as everything else (i.e. `dataloader_num_workers=0`)
53-
- a mixed-precision (*bfloat16*/*float*) run distributed across 8 devices with 1 dedicated subprocess for data loading (i.e. `dataloader_num_workers=1`)
55+
Let's run the three following experiments:
56+
- a mixed-precision (*bfloat16*/*float32*) run distributed across 8 devices where data loading is performed by the same process as everything else (i.e. `dataloader_num_workers=0`)
57+
- a mixed-precision (*bfloat16*/*float32*) run distributed across 8 devices with 1 dedicated subprocess for data loading (i.e. `dataloader_num_workers=1`)
58+
- same run with `dataloader_num_workers=2`
5459

55-
Here are the throughputs we got on Gaudi2 and A100:
60+
Here are the throughputs we got on Gaudi2, H100 and A100:
5661

57-
| Device | `dataloader_num_workers=0` | `dataloader_num_workers=1` |
58-
|:----------:|:--------------------------:|:--------------------------:|
59-
| Gaudi2 HPU | 532.4 samples/s | 639.7 samples/s |
60-
| A100 GPU | 210.5 samples/s | 296.6 samples/s |
62+
| Device | `dataloader_num_workers=0` | `dataloader_num_workers=1` | `dataloader_num_workers=2` |
63+
|:----------:|:--------------------------:|:--------------------------:|:--------------------------:|
64+
| Gaudi2 HPU | 601.5 samples/s | 747.4 samples/s | 768.7 samples/s |
65+
| H100 GPU | 336.5 samples/s | 580.1 samples/s | 602.1 samples/s |
66+
| A100 GPU | 227.5 samples/s | 339.7 samples/s | 345.4 samples/s |
6167

62-
We first see that **Gaudi2 is x2.16 faster than A100** with `dataloader_num_workers=1` and x2.53 faster with `dataloader_num_workers=0`, which is on par with [the speedups we previously reported](https://huggingface.co/blog/habana-gaudi-2-benchmark)!
68+
We first see that **Gaudi2 is x1.28 faster than H100** with `dataloader_num_workers=2`, x1.29 faster with `dataloader_num_workers=1` and x1.79 faster with `dataloader_num_workers=0`. Gaudi2 is also much faster than the previous generation since it is **x2.23 faster than A100** with `dataloader_num_workers=2`, x2.20 faster with `dataloader_num_workers=1` and x2.64 faster with `dataloader_num_workers=0`, which is even better than [the speedups we previously reported](https://huggingface.co/blog/habana-gaudi-2-benchmark)!
6369

64-
Second, we see that **allocating more resources for data loading can lead to easy speedups**: x1.20 on Gaudi2 and x1.41 on A100.
70+
Second, we see that **allocating more resources for data loading can lead to easy speedups**: x1.28 on Gaudi2, x1.79 on H100 and x1.52 on A100.
6571

66-
We also ran experiments with several dedicated subprocesses for data loading but performance was not better than with `dataloader_num_workers=1` for both Gaudi2 and A100.
67-
Thus, **using `dataloader_num_workers=1` is usually a good first way of accelerating your runs involving images!**
72+
We also ran experiments with several dedicated subprocesses for data loading but performance was not better than with `dataloader_num_workers=2` for all accelerators.
73+
Thus, **using `dataloader_num_workers>0` is usually a good first way of accelerating your runs involving images!**
6874

6975
Tensorboard logs can be visualized [here](https://huggingface.co/regisss/bridgetower-newyorker-gaudi2-8x/tensorboard) for Gaudi2 and [there](https://huggingface.co/regisss/bridgetower-newyorker-a100-8x/tensorboard) for A100.
7076

7177

72-
### Optimum Habana's fast DDP
78+
<!-- ### Optimum Habana's fast DDP
7379
7480
Before delving into how to perform hardware-accelerated data loading, let's look at another very easy way of speeding up your distributed runs on Gaudi. The new release of Optimum Habana, version 1.6.0, introduced a new feature that allows users to choose the distribution strategy to use:
7581
- `distribution_strategy="ddp"` to use PyTorch [`DistributedDataParallel`](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) (DDP)
@@ -79,12 +85,12 @@ Optimum Habana's fast DDP does not split parameter gradients into buckets as [DD
7985
8086
Simply using `distribution_strategy="fast_ddp"` (and keeping `dataloader_num_workers=1`) on Gaudi2 gives us 705.9 samples/s. **This is x1.10 faster than with DDP and x2.38 faster than A100!**
8187
82-
So adding just two training arguments (`dataloader_num_workers=1` and `distribution_strategy="fast_ddp"`) led to a x1.33 speedup on Gaudi2 and to a x2.38 speedup compared to A100 with `dataloader_num_workers=1`.
88+
So adding just two training arguments (`dataloader_num_workers=1` and `distribution_strategy="fast_ddp"`) led to a x1.33 speedup on Gaudi2 and to a x2.38 speedup compared to A100 with `dataloader_num_workers=1`. -->
8389

8490

8591
### Hardware-accelerated data loading with Optimum Habana
8692

87-
For even larger speedups, we are now going to move as many data loading operations as possible from the CPU to the accelerator devices (i.e. HPUs on Gaudi2 or GPUs on A100). This can be done on Gaudi2 using Habana's [media pipeline](https://docs.habana.ai/en/latest/Media_Pipeline/index.html).
93+
For even larger speedups, we are now going to move as many data loading operations as possible from the CPU to the accelerator devices (i.e. HPUs on Gaudi2 or GPUs on A100/H100). This can be done on Gaudi2 using Habana's [media pipeline](https://docs.habana.ai/en/latest/Media_Pipeline/index.html).
8894

8995
Given a dataset, most dataloaders follow the following recipe:
9096

@@ -115,22 +121,23 @@ To implement this on Gaudi2, we have got you covered: the [contrastive image-tex
115121

116122
For interested readers, a lower-level overview is given in the documentation of Gaudi [here](https://docs.habana.ai/en/latest/Media_Pipeline/index.html) and the list of all supported operators is available [there](https://docs.habana.ai/en/latest/Media_Pipeline/Operators.html).
117123

118-
We are now going to benchmark a run with `dataloader_num_workers=1`, `distribution_strategy="fast_ddp"` and `mediapipe_dataloader` since all these optimizations are compatible with each other:
124+
We are now going to re-run the previous experiments adding the `mediapipe_dataloader` argument since it is compatible with `dataloader_num_workers`:
119125

120-
| Device | `dataloader_num_workers=0` | `dataloader_num_workers=1` | `dataloader_num_workers=1` + `distribution_strategy="fast_ddp"` | `dataloader_num_workers=1` + `distribution_strategy="fast_ddp"` + `mediapipe_dataloader` |
121-
|:----------:|:--------------------------:|:--------------------------:|:---------------:|:---------------:|
122-
| Gaudi2 HPU | 532.4 samples/s | 639.7 samples/s | 705.9 samples/s | 802.1 samples/s |
123-
| A100 GPU | 210.5 samples/s | 296.6 samples/s | / | / |
126+
| Device | `dataloader_num_workers=0` | `dataloader_num_workers=2` | `dataloader_num_workers=2` + `mediapipe_dataloader` |
127+
|:----------:|:--------------------------:|:--------------------------------------------:|:---------------:|
128+
| Gaudi2 HPU | 601.5 samples/s | 768.7 samples/s | 847.7 samples/s |
129+
| H100 GPU | 336.5 samples/s | 602.1 samples/s | / |
130+
| A100 GPU | 227.5 samples/s | 345.4 samples/s | / |
124131

125-
We got an additional x1.14 speedup compared to the previous run with `dataloader_num_workers=1` and `distribution_strategy="fast_ddp"`.
126-
This final run is thus x1.51 faster than our base run on Gaudi2 **simply adding 3 ready-to-use training arguments.** It is also **x2.70 faster than A100 with `dataloader_num_workers=1`!**
132+
We got an additional x1.10 speedup compared to the previous run with `dataloader_num_workers=2` only.
133+
This final run is thus x1.41 faster than our base run on Gaudi2 **simply adding 2 ready-to-use training arguments.** It is also **x1.41 faster than H100** and **x2.45 faster than A100** with `dataloader_num_workers=2`!
127134

128135

129136
### Reproducing this benchmark
130137

131138
To reproduce this benchmark, you first need to get access to Gaudi2 through the [Intel Developer Cloud](https://www.intel.com/content/www/us/en/secure/developer/devcloud/cloud-launchpad.html) (see [this guide](https://huggingface.co/blog/habana-gaudi-2-benchmark#how-to-get-access-to-gaudi2) for more information).
132139

133-
Then, you need to install the latest version of Optimum Habana and run `run_bridgetower.py` that you can find [here](https://github.com/huggingface/optimum-habana/blob/main/examples/contrastive-image-text/run_bridgetower.py). Here is how to do it:
140+
Then, you need to install the latest version of Optimum Habana and run `run_bridgetower.py` which you can find [here](https://github.com/huggingface/optimum-habana/blob/main/examples/contrastive-image-text/run_bridgetower.py). Here is how to do it:
134141

135142
```bash
136143
pip install optimum[habana]
@@ -157,27 +164,29 @@ python ../gaudi_spawn.py --use_mpi --world_size 8 run_bridgetower.py \
157164
--throughput_warmup_steps 3 \
158165
--logging_steps 10
159166
```
160-
which corresponds to the case `--dataloader_num_workers 0`. You can then add `--dataloader_num_workers 1`, `--distribution_strategy fast_ddp` and `--mediapipe_dataloader` to test other configurations.
167+
which corresponds to the case `--dataloader_num_workers 0`. You can then add `--dataloader_num_workers N` and `--mediapipe_dataloader` to test other configurations.
161168

162169
To push your model and Tensorboard logs to the Hugging Face Hub, you will have to log in to your account beforehand with:
163170
```bash
164171
huggingface-cli login
165172
```
166173

167-
For A100, you can use the same `run_bridgetower.py` script with a few small changes:
174+
For A100 and H100, you can use the same `run_bridgetower.py` script with a few small changes:
168175
- Replace `GaudiTrainer` and `GaudiTrainingArguments` with `Trainer` and `TrainingArguments` from Transformers
169176
- Remove references to `GaudiConfig`, `gaudi_config` and `HabanaDataloaderTrainer`
170177
- Import `set_seed` directly from Transformers: `from transformers import set_seed`
171178

172-
The results displayed in this benchmark were obtained with a Nvidia A100 80GB GCP instance with 8 GPUS.
179+
The results displayed in this benchmark were obtained with a Nvidia H100 Lambda instance and a Nvidia A100 80GB GCP instance both with 8 devices using [Nvidia's Docker images](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/index.html).
180+
181+
Note that `--mediapipe_dataloader` is compatible with Gaudi2 only and will not work with A100/H100.
173182

174-
Note that `--distribution_strategy fast_ddp` and `--mediapipe_dataloader` are compatible with Gaudi2 only and will not work with A100.
183+
Regarding fp8 results on H100 using [Transformer Engine](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html), they are not available because the code crashes and would require modifying the modeling of BridgeTower in Transformers. We will revisit this comparison when fp8 is supported on Gaudi2.
175184

176185

177186
## Conclusion
178187

179188
When dealing with images, we presented two solutions to speed up your training workflows: allocating more resources to the dataloader, and decoding and augmenting images directly on accelerator devices rather than on CPU.
180-
We showed that it leads to dramatic speedups when training a SOTA vision-language model like BridgeTower: **Habana Gaudi2 with Optimum Habana is almost 3x faster than Nvidia A100 80GB with Transformers!**
189+
We showed that it leads to dramatic speedups when training a SOTA vision-language model like BridgeTower: **Habana Gaudi2 with Optimum Habana is about x1.4 faster than Nvidia H100 and x2.5 faster than Nvidia A100 80GB with Transformers!**
181190
And this is super easy to use as you just need to provide a few additional training arguments.
182191

183192
To go further, we are looking forward to using HPU graphs for training models even faster and to presenting how to use DeepSpeed ZeRO-3 on Gaudi2 to accelerate the training of your LLMs. Stay tuned!

0 commit comments

Comments
 (0)