Replies: 2 comments
-
|
A possible reason is that the local mcore model does not support flash-attn. |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
Marking as stale. No activity in 60 days. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Your question
I run pretrain_gpt on same arch, data, training hyperparams and same hardware, with and without using megatron_core when build the model.
I notice clearly worse wall clock time and memory usage:
Environment:
For the data I use c4_en data from huggingface and tokenize it using gpt2 tokenizer. I use the first 3.6e7(first 10%) document to conduct the experiments.
To Reproduce
megatron-lm commit hash: 9de386d
I customize a script from pretrain_gpt_distributed.sh and rename it as
pretrain_gpt_cli.shTo reproduce the experiment, please run following bash command:
Is there any reason behind this?
Beta Was this translation helpful? Give feedback.
All reactions