Mixed Precision Training #336

sanchit-gandhi · 2022-04-22T09:36:46Z

sanchit-gandhi
Apr 22, 2022

The Adam optimiser, or one of its commonly used variants such as AdamW, hold two accumulators (mu and nu) per model weight. For a model of size 1GB, this results in an optimiser of size 2GB. Thus, before even loading a batch of training data, there is already 3GB of memory allocation required on an accelerator device, of which 2/3 comes from the optimiser. This is a potentially limiting factor in determining the maximum permissible batch size per-device, and naturally raises the question of how one might reduce the memory requirements of the optimiser. Of course, one might switch to alternative optimiser that only holds one accumulator per weight and factors the second order estimate, such as Adafactor, but the factored estimate of the second order moments might negatively impact training results compared to the unfactored approach taken in Adam. An alternative approach is mixed precision training. Here, the model weights are kept in full precision (float32), but the optimiser states are saved in half precision (bfloat16). Since each of the two accumulators require half the number of bytes as one model parameter to which they are bound, the optimiser memory is halved. As a result, for a full precision model of size 1GB, the optimiser now only has a size of 1GB. It is generally not advised to train using full half precision training, in which the model parameters are also kept in half precision, as this leads to poor numerical stability. Consequently, for mixed precision training, it is evident that one must keep the parameters in full precision and the optimiser states in half precision. However, it is less clear what data types (dtypes) are appropriate for other variables in a train step.

Model parameters: as mentioned above, the model weights should be kept in full float32 precision to avoid potential numerical instability.
Model dtype (the data type of the computation): setting this to bfloat16 halves the memory requirement for any intermediate and returned values in the forward pass.
Gradients: should these be computed in float32 to ensure numerical stability? We can reduce memory by computing these in bfloat16 by down-casting the loss and parameters to bfloat16 before taking gradients, but this might be numerically unstable.
Optimiser update step: should this be performed in bfloat16 or float32? If we perform the update step in float32, the the updates and the new optimiser state will both be in float32. We will not need to upcast the updates, but will need to downcast the new optimiser states to bfloat16 before the next train step. If we conduct this in bfloat16, the updates and the new optimiser state will both be in bfloat16. Hence, the updates will require upcasting to float32 before being applied to the parameters to ensure that they remain in float32.
Parameter update step: should most definitely be in float32 to ensure the parameters are updated and kept in float32.

Option 1:

Initialise the optimiser in bfloat16 precision
Take gradients in float32 precision
Perform the optimiser update step in float32 - this involves upcasting the optimiser states to float32, however this should not be an issue as we're not dealing with any input data at this point
Perform the weight updates in float32
Downcast the optimiser states to bfloat16 prior to the next training step.

Option 2:

Initialise the optimiser in bfloat16 precision
Take gradients in bfloat16
Perform the optimiser update step in bfloat16
Upcast the weight updates to float32 and then apply to the weight parameters (so that the weight update step is in float32)

Here is a Colab that implements both options for a simple network consisting of a single linear layer: https://drive.google.com/file/d/1xqq24YP_MPrf2j3u2i0MOypCDZOOpQJK/view?usp=sharing

sanchit-gandhi · 2022-04-22T11:13:19Z

sanchit-gandhi
Apr 22, 2022
Author

In the example at https://github.com/google/flax/tree/main/examples/wmt, the model dtype is specified as bfloat16 (if running on TPU): https://github.com/google/flax/blob/2ac765a5c056dc57bcaa70ba5e0bd2f4933d2ed0/examples/wmt/train.py#L455

However, the optimiser states are entirely in float32, matching the dtype of the params (float32) and grads (float32): https://github.com/google/flax/blob/2ac765a5c056dc57bcaa70ba5e0bd2f4933d2ed0/examples/wmt/train.py#L497-L508
In which case the optimiser is still in full precision, and only the model's computations that are performed in half precision. For full mixed precision training, do we not wish to store the optimiser states in bfloat16 as well?

cc @marcvanzee

2 replies

marcvanzee Apr 22, 2022

Hi @sanchit-gandhi, in Flax we allow mixed precision by allowing users to set very explicitly which computations should be performed in half precision and which in full precision (through the dtype argument). For some known cases where half precision is known to lead to bad results, we always avoid it (see e.g., normalization).

In our WMT example we implemented mixed precision by passing bfloat16 (on TPU) or float16 (on GPU) as the models dtype. As you also note, the model parameters are still in float32 to avoid numerical issues.

This means that the model inputs are cast to half precision (except in problematic cases, as I mentioned before), and the outputs are also in half precision. As a result, the gradient computation is also performed in half precision. For float16 (on GPU), it is therefore necessary to scale the losses dynamically to avoid numerical issues (see DynamicScale).

I agree we could be even more memory efficient if the optimizer would use half precision. I am not sure what the possibilities are in Optax, but I am also curious!

sanchit-gandhi Apr 26, 2022
Author

Brilliant, thanks for that expert insight @marcvanzee! The dtype argument is very useful indeed: it's super easy to switch the model dtype from float32 to bfloat16 with one line of change to the code, especially with the knowledge that the dtype of potentially unstable layers (e.g. normalisation) are always kept in full precision. This is a good first step towards implementing a form of mixed precision training. If it doesn't work with this change, then you have little chance of making it work with computing the gradients in bfloat16!

Following an offline discussion with @jekbradbury and @patil-suraj, it was established that there are different stages of mixed precision training:

Full precision: fprop, bprop and optimiser states all in float32.
Half mixed precision: fprop in bfloat16, bprop and optimiser states in float32. This corresponds to setting the dtype argument to bfloat16.
Full mixed precision: fprop, bprop and optimiser states all in bfloat16. The dtype argument is set to bfloat16 for the fprop, and the gradients computed with respect to the bfloat16 parameters in the bprop (giving bfloat16 gradients). The new optimiser states and parameter updates are computed in float32 by upcasting the bfloat16 gradients and optimiser states to float32 prior to the optimiser update step. The optimiser states are returned in float32 (but not saved to memory) and then downcasted to bfloat16 (saved to memory) for the subsequent train step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mixed Precision Training #336

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Mixed Precision Training #336

Uh oh!

Uh oh!

sanchit-gandhi Apr 22, 2022

Replies: 1 comment · 2 replies

Uh oh!

sanchit-gandhi Apr 22, 2022 Author

Uh oh!

marcvanzee Apr 22, 2022

Uh oh!

Uh oh!

sanchit-gandhi Apr 26, 2022 Author

sanchit-gandhi
Apr 22, 2022

Replies: 1 comment 2 replies

sanchit-gandhi
Apr 22, 2022
Author

sanchit-gandhi Apr 26, 2022
Author