I am training 70B Megatron LLM on A800 with 32 nodes cluster. The cluster is composed of 32 nodes with 8 * A800 and 4 * RoCE 200Gb/s. I find 70B MFU 20% is quite lower than 32B model MFU 47%. Besides, I find some node GPU memory usage is 70GB, while other node memory usage is 50GB. I would like to tune memory usage to the same level to use bigger micro batch size to improve MFU. It involves to place/layout which LLM layer to which rank. Any document for this topic?
32B LLM, TP=8, PP=1, MFU=47%
70B LLM, TP=8, PP=2, MFU=20%