Skip to content

Share PPL results #21

@saareliad

Description

@saareliad

Hi,
It was not clear to me from the article what are your final PPL results for each model.
Can you share them too?

From a first look I actually thought that you achieve same or comparable PPL results. I am not sure about it now. Can you clarify?

Do you have a baseline model with comparable PPL to the original base model?

Can someone use what you did as a baseline for smaller scale research? (4-8 "commodity" GPUs for example?).

Extra detail on total training time:
I noticed that you count in tokens instead of steps,
were tokens_per_global_batch=global_batch_size*seq_len.
Using the parameters in the script, a simple calculation yields, in steps:

config num gpus max tokens seq len base batch global batch size tokens per batch required steps PPL
single machine 1 1.8B 128 32 32 4096 439453.125 ?
single machine 2 1.8B 128 32 64 8192 219726.5625 ?
single machine 4 1.8B 128 32 128 16384 109863.2813 ?

Comparing the the base_wiki103 config from the original repo
(they used only data parallel) we get:

config num gpus tokens seq len base batch global batch size tokens per batch steps PPL
original-base-wt103 don't care 1.92B 150 don't care 64 9600 200000 24

=>They trained on much more tokens.
If your results are really comparable, the model you present here is worth using as a baseline for future transfomerXL experiments because its faster. Right?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions