Skip to content

[AutoParallel] GPT support shared parameters#10783

Closed
waliwali777 wants to merge 7 commits intoPaddlePaddle:developfrom
waliwali777:dynamic_sync_param
Closed

[AutoParallel] GPT support shared parameters#10783
waliwali777 wants to merge 7 commits intoPaddlePaddle:developfrom
waliwali777:dynamic_sync_param

Conversation

@waliwali777
Copy link
Contributor

@waliwali777 waliwali777 commented Jul 1, 2025

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

Other

PR changes

Other

Description

对 gpt-13b 动半模型进行如下优化:

  1. 支持共享参数优化,在 pp 下支持对 embedding weight 和 lm_head weight 共享优化 ,修改它们切分状态相同且都为列切
  2. 修改 embedding weight 的切分状态为 [dist.Relicate(), dist.Shard(0)],与 lmhead weight 的切分状态保持一致,减少共享参数时因为状态不一致引入的通信
  3. 修改 embedding auto 层输出状态为 [dist.Shard(0), dist.Relicate()],减少最优策略下 encoder 层引入的 allgather 通信

@paddle-bot
Copy link

paddle-bot bot commented Jul 1, 2025

Thanks for your contribution!

@codecov
Copy link

codecov bot commented Jul 1, 2025

Codecov Report

❌ Patch coverage is 11.11111% with 64 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.73%. Comparing base (44eff1f) to head (2682061).
⚠️ Report is 40 commits behind head on develop.

⚠️ Current head 2682061 differs from pull request most recent head 20c3b84

Please upload reports for the commit 20c3b84 to get more accurate results.

Files with missing lines Patch % Lines
paddlenlp/transformers/gpt/modeling_auto_pp.py 5.26% 36 Missing ⚠️
paddlenlp/trainer/auto_trainer.py 0.00% 16 Missing ⚠️
paddlenlp/trainer/trainer.py 33.33% 12 Missing ⚠️

❌ Your patch status has failed because the patch coverage (11.11%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (46.73%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop   #10783      +/-   ##
===========================================
- Coverage    46.75%   46.73%   -0.03%     
===========================================
  Files          802      802              
  Lines       133882   133795      -87     
===========================================
- Hits         62603    62529      -74     
+ Misses       71279    71266      -13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@waliwali777 waliwali777 changed the title [AutoParallel] init sync param [AutoParallel] GPT support shared parameters Jul 18, 2025
)
self.word_embeddings.weight = dist.shard_tensor(
self.word_embeddings.weight, get_mesh(), [dist.Replicate(), dist.Replicate()]
self.word_embeddings.weight, get_mesh(), [dist.Replicate(), dist.Shard(0)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

疑问:这里为什么要行切? embedding 算子是不支持行切的,所以实际上还是要 Allgather 的
如果要行切,是否能替换成 c_embedding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为 lmhead 层为了支持 parallel_cross_centropy,所以 weight 必须是 行切;因为共享参数需要切分状态是一致的,所以参数同步时会引入一些通信。
经过测试 lmhead weight 和 embedding weight 不同切分状态的组合,发现同为行切时,此时性能最佳;虽然embedding 计算时会引入 allgather,但该该通信总耗时是最少的
替换成 c_embedding 是可行的,可以之后 PR 中再支持

mem=-1
echo "result: loss=$loss ips=$ips mem=$mem loss_md5=$loss_md5"
loss_base=10.5657959 # output of dropout is different after supporting spmd
loss_base=10.49585533 # output of dropout is different after supporting spmd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用了共享参数后,loss 差异这么大吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我这边是发现隐藏层层数较少时,共享参数 + 其他切分状态修改, loss 受影响比较大

@waliwali777 waliwali777 closed this Sep 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants