[Distributed] Use Tensor Parallel instead of Sequence Parallel #1160

kwen2501 · 2024-09-18T06:23:25Z

SP is one step further than TP in that it further distributes the layer norm computation.

There are a couple reason to turn the dial back to pure TP:
(1) LLM inference has a prefill phase and a decoding phase which have different seqlen. The decoding phase has a seqlen of 1, to which SP cannot be applied. We don't want to create two models and apply SP and TP separately.
(2) The major motivation of SP is to reduce activation envelope. While this is important in training (bc backward needs those intermediates), we are using torch.no_grad() in inference, in which case activations are not kept anyway.
(3) While it is true that AllReduce = AllGather + ReduceScatter, in small-size region latency dominates, so launching 1 collective may be better than launching two collectives.

Removed SP feature from model
Removed // sp_degree in PP activation size

pytorch-bot · 2024-09-18T06:23:28Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1160

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 611de83 with merge base 4774eaf ():

NEW FAILURE - The following job has failed:

pull / test-mps / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

lessw2020

Thanks for the detailed justification for this. Looks good!

kwen2501 requested a review from lessw2020 September 18, 2024 06:23

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 18, 2024

kwen2501 force-pushed the tp_not_sp branch 2 times, most recently from b2a6bb1 to 3b896db Compare September 18, 2024 08:35

kwen2501 changed the base branch from main to arg_change September 18, 2024 08:37

lessw2020 approved these changes Sep 18, 2024

View reviewed changes

[Distributed] Use Tensor Parallel instead of Sequence Parallel

611de83

kwen2501 force-pushed the tp_not_sp branch from 3b896db to 611de83 Compare September 20, 2024 05:25

kwen2501 changed the base branch from arg_change to main September 20, 2024 05:25

kwen2501 merged commit e27e162 into main Sep 20, 2024
50 of 51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Distributed] Use Tensor Parallel instead of Sequence Parallel #1160

[Distributed] Use Tensor Parallel instead of Sequence Parallel #1160

Uh oh!

kwen2501 commented Sep 18, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 18, 2024 •

edited

Loading

Uh oh!

lessw2020 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[Distributed] Use Tensor Parallel instead of Sequence Parallel #1160

[Distributed] Use Tensor Parallel instead of Sequence Parallel #1160

Uh oh!

Conversation

kwen2501 commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1160

❌ 1 New Failure

Uh oh!

lessw2020 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kwen2501 commented Sep 18, 2024 •

edited

Loading

pytorch-bot bot commented Sep 18, 2024 •

edited

Loading