-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Hi,
It was not clear to me from the article what are your final PPL results for each model.
Can you share them too?
From a first look I actually thought that you achieve same or comparable PPL results. I am not sure about it now. Can you clarify?
Do you have a baseline model with comparable PPL to the original base model?
Can someone use what you did as a baseline for smaller scale research? (4-8 "commodity" GPUs for example?).
Extra detail on total training time:
I noticed that you count in tokens instead of steps,
were tokens_per_global_batch=global_batch_size*seq_len.
Using the parameters in the script, a simple calculation yields, in steps:
| config | num gpus | max tokens | seq len | base batch | global batch size | tokens per batch | required steps | PPL |
|---|---|---|---|---|---|---|---|---|
| single machine | 1 | 1.8B | 128 | 32 | 32 | 4096 | 439453.125 | ? |
| single machine | 2 | 1.8B | 128 | 32 | 64 | 8192 | 219726.5625 | ? |
| single machine | 4 | 1.8B | 128 | 32 | 128 | 16384 | 109863.2813 | ? |
Comparing the the base_wiki103 config from the original repo
(they used only data parallel) we get:
| config | num gpus | tokens | seq len | base batch | global batch size | tokens per batch | steps | PPL |
|---|---|---|---|---|---|---|---|---|
| original-base-wt103 | don't care | 1.92B | 150 | don't care | 64 | 9600 | 200000 | 24 |
=>They trained on much more tokens.
If your results are really comparable, the model you present here is worth using as a baseline for future transfomerXL experiments because its faster. Right?