LM Head FLOPs

Why are we not multiplying the LM Head flops per iteration with the `checkpoint_activations_factor`?
https://github.com/bigcode-project/Megatron-LM/blob/bd0aaba3492b441d7f186bb1159fc21e1dcd7a72/megatron/utils.py#L253


Afaik the factor of 4 means 1 forward, 2 backward & 1 forward, where the last forward is needed for ckpt acts. Don't we also need all 4 for the LM Head? cc @RaymondLi0 @NouamaneTazi 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LM Head FLOPs #78

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LM Head FLOPs #78

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions