[QUESTION] Throughput low for 70B training

Hi Megatron team, 

Could I check if the following is expected: 

I saw that in your README, the throughput for 70B using TP8 PP2 and DP 48 on 768GPUs was 420.5 TFLOPs/s/GPU
However, running on 70B on TP8 PP8 and DP2 on 128 GPUs (H100s)  with activation checkpointing was only 296.5 TFLOPs/s/GPU for me.  

May I check what are the optimizations that you all have used to maximise your throughput as reported in the table in the README? I have also used --overlap-grad-reduce and --overlap-param-gather

Thank you! 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] Throughput low for 70B training #1031

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QUESTION] Throughput low for 70B training #1031

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions