Skip to content

Conversation

lshpku
Copy link

@lshpku lshpku commented Aug 21, 2025

PR types

Bug fixes

PR changes

Models

Description

让 pp_stream 等待 attn_backward_dx,解决开启 overlap_p2p_comm 时遇到的 loss 下降速度慢的问题

下图显示了修复前和修复后的等待关系
图片 1

其实我也不知道为什么加这条等待就行,我只是通过二分法定位到是 PP(F) 的问题,然后试着加了等待,然后 loss 就正常了,估计跟跨 stream 分配显存有关,我通过单测发现 Paddle 的跨 stream 分配显存有一些不安全的情况,虽然模型里看起来没有不安全的用法,但也不好说,所以还是保守一点

对性能有一定影响,因为把 PP(F) 推后了,该 PR 还需要改进

正常情况下,单机配置(29 Decoder + 1 MTP),跑200个step,loss应该下降到7.3;在本PR之前开启 overlap_p2p_comm,loss 只能降到8.7;现在开不开都能降到7.3

Copy link

paddle-bot bot commented Aug 21, 2025

Thanks for your contribution!

@lshpku lshpku force-pushed the fix-pp-event-wait branch from b78af3d to b6e9841 Compare August 22, 2025 06:56
@lshpku lshpku force-pushed the fix-pp-event-wait branch from b6e9841 to 1b1e63a Compare August 22, 2025 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant