[QUESTION] how to control GPU memory layout for 70B LLM model?

I am training 70B Megatron LLM on A800 with 32 nodes cluster. The cluster is composed of 32 nodes with 8 * A800 and 4 * RoCE 200Gb/s. I find 70B MFU 20% is quite lower than 32B model MFU 47%.  Besides, I find some node GPU memory usage is 70GB, while other node memory usage is 50GB. I would like to tune memory usage to the same level to use bigger micro batch size to improve MFU.  It involves to place/layout which LLM layer to which rank. Any document for this topic?


32B LLM, TP=8, PP=1, MFU=47%

70B LLM, TP=8, PP=2, MFU=20%


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION] how to control GPU memory layout for 70B LLM model? #1059

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[QUESTION] how to control GPU memory layout for 70B LLM model? #1059

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions