EasyLM provides a number of optimizers for training neural language models. The optimizers are implemented in the optimizer.py
Currently, the following optimizers are supported:
- AdamW
- PaLM: the optimizer described in the PaLM paper
In addition to optimizer configurations, the optimizer module also provides support for gradient accumulation.
Optimizer type can be selected by setting the type field in the optimizer
configuration. For example, to use the AdamW optimizer, we can set the type to
adamw and configuring the adamw_optimizer subfields:
python train.py --optimizer.type=adamw --optimizer.adamw_optimizer.lr=1e-4To use gradient accumulation, we can set the accumulate_gradient_steps field
in the optimizer configuration. For example, to use gradient accumulation with
step size 2, we can set the accumulate_gradient_steps to 2:
python train.py --optimizer.accumulate_gradient_steps=2The following options are supported for the optimizer module:
type: the optimizer type. Currently,adamwandpalmare supported.adamw_optimizer: the configuration for the AdamW optimizerpalm_optimizer: the configuration for the PaLM optimizeraccumulate_gradient_steps: the number of steps for gradient accumulation
The AdamW optimizer implements AdamW with liear learning rate warmup and cosine learning rate decay. The following options are supported for the AdamW optimizer:
init_lr: the initial learning rateend_lr: the final learning rate after decaylr: the peak learning ratelr_warmup_steps: the number of steps for linear learning rate warmuplr_decay_steps: the number of steps for cosine learning rate decayb1: the beta1 parameter for AdamWb2: the beta2 parameter for AdamWclip_gradient: the gradient clipping thresholdweight_decay: the weight decay parameter for AdamWbf16_momentum: whether to use bf16 for momentum to save memorymultiply_by_parameter_scale: whether to multiply the gradient by parameter scale (as in adafactor)
The PaLM optimizer implements the optimizer described in the PaLM paper. The optimizer is essential adafactor with no-factoring and a inverse square root learning rate decay and weight decay schedule. The following options are supported for the PaLM optimizer:
lr: the initial learning ratelr_warmup_steps: the number of steps for constant learning rate warmupb1: beta1 parameter for adafactorb2: beta2 parameter for adafactorclip_gradient: the gradient clipping thresholdweight_decay: the weight decay parameterbf16_momentum: whether to use bf16 for momentum to save memory