-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Closed
Description
Hi Megatron team,
Could I check if the following is expected:
I saw that in your README, the throughput for 70B using TP8 PP2 and DP 48 on 768GPUs was 420.5 TFLOPs/s/GPU
However, running on 70B on TP8 PP8 and DP2 on 128 GPUs (H100s) with activation checkpointing was only 296.5 TFLOPs/s/GPU for me.
May I check what are the optimizations that you all have used to maximise your throughput as reported in the table in the README? I have also used --overlap-grad-reduce and --overlap-param-gather
Thank you!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels