-
Notifications
You must be signed in to change notification settings - Fork 499
[BE] Move NoParallel to torchtitan.distributed #1641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought more and I would still recommend we put NoParallel
into tensor_parallel.py. Here's my reasoning:
We use NoParallel
in following cases:
- https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/llama4/infra/parallelize.py#L432-L444 In MoE layer
moe.router.gate
: together withmoe.shared_experts
, this is outside EP region (moe.experts
). The only place we are applyingNoParallel
is when TP is used for non-EP params (includingmoe.router.gate
,moe.shared_experts
, and attention / mlp layers). So it has nothing to do withExpertParallel
. - In DeepSeek V3 (https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/deepseek_v3/infra/parallelize.py#L208) and Qwen3 (https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/qwen3/infra/parallelize.py#L200), where this is used in TP plan.
Technically I wrote this NoParallel
class to deal with mesh mismatch between such modules with other TP'lized modules, so the reason we have them is purely because of TP.
For EP, mesh mismatch anyway exists and can't be solved by this NoParallel
class. In gradient clipping, we separate treatment of EP mesh and non-EP mesh where NoParallel
helps align all non-EP params on their TP mesh.
Just for my understanding- this is only needed because we break in and out of dtensor at the TP boundaries right? Remind me why we do this, was it to avoid missing dtensor operators, or performance, or ...? |
|
|
@fegin I'm OK either way. Up to you. |
NoParallel should not belong `expert_parallel`. This PR moves it to `torchtitan.distributed.__init__.py`.
2aab23c
to
d13025d
Compare
NoParallel should not belong
expert_parallel.py
. This PR moves it totorchtitan.distributed.__init__.py
.