Skip to content

[QUESTION] how to control GPU memory layout for 70B LLM model? #1059

@wangdaw2023

Description

@wangdaw2023

I am training 70B Megatron LLM on A800 with 32 nodes cluster. The cluster is composed of 32 nodes with 8 * A800 and 4 * RoCE 200Gb/s. I find 70B MFU 20% is quite lower than 32B model MFU 47%. Besides, I find some node GPU memory usage is 70GB, while other node memory usage is 50GB. I would like to tune memory usage to the same level to use bigger micro batch size to improve MFU. It involves to place/layout which LLM layer to which rank. Any document for this topic?

32B LLM, TP=8, PP=1, MFU=47%

70B LLM, TP=8, PP=2, MFU=20%

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions