Skip to content
This repository was archived by the owner on Sep 10, 2025. It is now read-only.

Conversation

@kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented Sep 18, 2024

SP is one step further than TP in that it further distributes the layer norm computation.

There are a couple reason to turn the dial back to pure TP:
(1) LLM inference has a prefill phase and a decoding phase which have different seqlen. The decoding phase has a seqlen of 1, to which SP cannot be applied. We don't want to create two models and apply SP and TP separately.
(2) The major motivation of SP is to reduce activation envelope. While this is important in training (bc backward needs those intermediates), we are using torch.no_grad() in inference, in which case activations are not kept anyway.
(3) While it is true that AllReduce = AllGather + ReduceScatter, in small-size region latency dominates, so launching 1 collective may be better than launching two collectives.

  • Removed SP feature from model
  • Removed // sp_degree in PP activation size

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchchat/1160

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 611de83 with merge base 4774eaf (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@kwen2501 kwen2501 requested a review from lessw2020 September 18, 2024 06:23
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 18, 2024
@kwen2501 kwen2501 force-pushed the tp_not_sp branch 2 times, most recently from b2a6bb1 to 3b896db Compare September 18, 2024 08:35
@kwen2501 kwen2501 changed the base branch from main to arg_change September 18, 2024 08:37
Copy link
Contributor

@lessw2020 lessw2020 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed justification for this. Looks good!

@kwen2501 kwen2501 changed the base branch from arg_change to main September 20, 2024 05:25
@kwen2501 kwen2501 merged commit e27e162 into main Sep 20, 2024
50 of 51 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants