[AutoParallel] GPT support shared parameters#10783
[AutoParallel] GPT support shared parameters#10783waliwali777 wants to merge 7 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Please upload reports for the commit 20c3b84 to get more accurate results. ❌ Your patch status has failed because the patch coverage (11.11%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## develop #10783 +/- ##
===========================================
- Coverage 46.75% 46.73% -0.03%
===========================================
Files 802 802
Lines 133882 133795 -87
===========================================
- Hits 62603 62529 -74
+ Misses 71279 71266 -13 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
bdc3f5a to
90db0e7
Compare
694512e to
20c3b84
Compare
| ) | ||
| self.word_embeddings.weight = dist.shard_tensor( | ||
| self.word_embeddings.weight, get_mesh(), [dist.Replicate(), dist.Replicate()] | ||
| self.word_embeddings.weight, get_mesh(), [dist.Replicate(), dist.Shard(0)] |
There was a problem hiding this comment.
疑问:这里为什么要行切? embedding 算子是不支持行切的,所以实际上还是要 Allgather 的
如果要行切,是否能替换成 c_embedding
There was a problem hiding this comment.
因为 lmhead 层为了支持 parallel_cross_centropy,所以 weight 必须是 行切;因为共享参数需要切分状态是一致的,所以参数同步时会引入一些通信。
经过测试 lmhead weight 和 embedding weight 不同切分状态的组合,发现同为行切时,此时性能最佳;虽然embedding 计算时会引入 allgather,但该该通信总耗时是最少的
替换成 c_embedding 是可行的,可以之后 PR 中再支持
| mem=-1 | ||
| echo "result: loss=$loss ips=$ips mem=$mem loss_md5=$loss_md5" | ||
| loss_base=10.5657959 # output of dropout is different after supporting spmd | ||
| loss_base=10.49585533 # output of dropout is different after supporting spmd |
There was a problem hiding this comment.
我这边是发现隐藏层层数较少时,共享参数 + 其他切分状态修改, loss 受影响比较大
Before submitting
testsfolder. If there are codecov issues, please add tests cases first.PR types
Other
PR changes
Other
Description
对 gpt-13b 动半模型进行如下优化: