Performance experiments over AdamW #28
Replies: 5 comments 1 reply
-
|
Experiments detailing 1e-5, 2e-5, and 3e-5: Performed worse in each case. Total batch size 192. ~1.6 million tokens a batch. Using a decoupled weight decay of 0.1 for all runs. Tried with a cosine warmup scheduler. |
Beta Was this translation helpful? Give feedback.
-
|
Optimizer setup for PaLM: def decoupled_optimizer(
model, learning_rate, weight_decay, beta_1, beta_2, use_lion=True,
):
# Create an empty dictionary called param_dict to store the model's named parameters.
param_dict = {}
# Iterate over the model's named parameters and populate the param_dict with key-value pairs.
for param_name, param in model.named_parameters():
param_dict[param_name] = param
# Separate the model's named modules into two groups: decay and no_decay.
# Create an empty list to store the names of the LayerNorm and Embedding layer weights with no weight decay.
no_decay = []
# Iterate through the named modules of the model.
for module_name, module in model.named_modules():
# Check if the current module is an instance of any of the desired types (LayerNorm or torch.nn.Embedding).
for ndim in [LayerNorm, torch.nn.Embedding]:
if isinstance(module, ndim):
# If torch.nn.Embedding, append its name with a ".weight" suffix to the no_decay list.
if module_name == "token_emb":
no_decay.append(f"{module_name}.weight")
else:
# If the module is an instance of LayerNorm
no_decay.append(f"{module_name}.gamma")
# Exit the inner loop since the desired module has been found.
break
# Create an empty list to store the names of the Linear layer weights with weight decay.
decay = []
# Iterate through the named modules of the model.
for module_name, module in model.named_modules():
# Check if the current module is an instance of the desired type (torch.nn.Linear).
for ndim in [torch.nn.Linear]:
if isinstance(module, ndim):
# If the module is an instance of torch.nn.Linear, append its name with a ".weight" suffix to the decay list.
decay.append(f"{module_name}.weight")
# Exit the inner loop since the desired module has been found.
break
# Create two separate lists of model parameters: decay_param and no_decay_param.
# The decay_param list contains the parameters that should have weight decay applied.
# The no_decay_param list contains the parameters that should not have weight decay applied, excluding the 'to_logits.weight' parameter.
# Create an empty list called decay_param to store the parameters with weight decay.
decay_param = []
# Iterate over the decay list, which contains the names of the parameters with weight decay.
for param in decay:
# Check if the current parameter is not 'to_logits.weight'.
# Append the corresponding parameter from param_dict to the decay_param list.
if param != "to_logits.weight":
decay_param.append(param_dict[param])
# Create an empty list called no_decay_param to store the parameters without weight decay.
no_decay_param = []
# Iterate over the no_decay list, which contains the names of the parameters without weight decay.
for param in no_decay:
# Append the corresponding parameter from param_dict to the no_decay_param list.
no_decay_param.append(param_dict[param])
# Create a list called grouped_params that contains two dictionaries.
# The first dictionary has the decay_param list and the corresponding weight_decay value.
# The second dictionary has the no_decay_param list and a weight_decay value of 0.0.
grouped_params = [
{"params": decay_param, "weight_decay": weight_decay},
{"params": no_decay_param, "weight_decay": 0.0},
]
# Create a variable called optimizer that stores an instance of the optimizer.
if use_lion:
optimizer = Lion(
grouped_params,
lr=learning_rate,
betas=(beta_1, beta_2),
)
# Return the optimizer.
return optimizer
``` |
Beta Was this translation helpful? Give feedback.
-
|
Experiments detailing 1e-6 and 3e-6: Performed worse in each case. Total batch size 192. ~1.6 million tokens a batch. Using a decoupled weight decay of 0.1 for all runs. Tried with a cosine warmup scheduler. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @xiangning-chen , I appreciate your great research. I am testing numerous language models of varying scales based on Phil's PaLM model. I was wondering if you could provide any input on incorporating Lion into natural language experiments. Thank you, Enrico |
Beta Was this translation helpful? Give feedback.
-
|
Imagine this as being one experiment to include https://github.com/KellerJordan/modded-nanogpt/blob/master/README.md |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Phil,
I have been testing some different Lion hyperparameters with PaLM at the 1B scale (Total batch size 192. ~1.6 million tokens a batch). Using a decoupled weight decay of 0.1 for all runs. And a linear warmup scheduler. So far the best configuration was:
This had about a 0.2 loss improvement over AdamW. The memory consumption was ~4% lower. There was an increase in speed of about 0.14. Lowering the iteration time from 1.65 to 1.51.
Wandb logs:
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-09-21-43-11---Vmlldzo0MzE0MTcy
https://wandb.ai/a_man_chooses/palm/reports/loss-23-05-09-21-48-41---Vmlldzo0MzE0MjAz
I am going to be testing at the 2B scale next and report the results. I am going to try adjusting the learning rate and betas more as well. I was wondering if you had noticed a significant difference in performance as you increased the size of the model?
Thank you,
Enrico
Beta Was this translation helpful? Give feedback.
All reactions