Skip to content

[QUESTION] Throughput low for 70B training #1031

@clarence-lee-sheng

Description

@clarence-lee-sheng

Hi Megatron team,

Could I check if the following is expected:

I saw that in your README, the throughput for 70B using TP8 PP2 and DP 48 on 768GPUs was 420.5 TFLOPs/s/GPU
However, running on 70B on TP8 PP8 and DP2 on 128 GPUs (H100s) with activation checkpointing was only 296.5 TFLOPs/s/GPU for me.

May I check what are the optimizations that you all have used to maximise your throughput as reported in the table in the README? I have also used --overlap-grad-reduce and --overlap-param-gather

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions