Adopt Local Map Wrapper for Inner Attention by acisseJZhong · Pull Request #2557 · pytorch/torchtitan

acisseJZhong · 2026-03-11T23:58:54Z

This is a continuation of work in #2480 by @pianpwk

Summary

Add _InnerAttentionBase base class to attention.py — overrides __call__
to wrap nn.Module.__call__ with local_map, converting TP DTensor inputs
to local tensors before any forward_pre_hook fires, and wrapping
outputs back to DTensor after all forward_hooks complete. Placements
and device mesh are inferred from the input DTensors at runtime.
Enable TP+CP support for Qwen3 by having LocalMapModule handle the
TP/CP boundary. Qwen3 requires this because it uses use_local_output=False
on wq/wk/wv (needed for QK norms with SequenceParallel), producing
DTensors that CP hooks cannot directly consume.
Add integration test for Qwen3 FSDP+TP+CP on 8 GPUs.

Test

NCCL_NVLS_ENABLE=0 MODULE=qwen3 CONFIG=qwen3_debugmodel ./run_train.sh --parallelism.data_parallel_shard_degree 2 --parallelism.tensor_parallel_degree 2 --parallelism.context_parallel_degree 2 --compile.enable  --compile.backend eager

acisseJZhong · 2026-03-12T06:59:49Z

torchtitan/models/common/attention.py

+                v, DTensor
+            ), "q, k, v should all be DTensors"
+            if self._local_map_fn is None:
+                self._local_map_fn = local_map(


@fegin i had this local_map wrapping around CP pre-forward hook and post-forward hook, so that the local map will handle conversion of Dtensor to plain tensor before the CP pre-forward hook fires. If it's the other way around, i will hit assertion error https://github.com/pytorch/pytorch/blame/main/torch/distributed/tensor/experimental/_context_parallel/_attention.py#L1394

not sure if this is the best way; and maybe you already have some plan on rewriting the CP API for full DTensor

yes, this makes sense. Some comments

Instead of _InnerAttentionLocalMap, we should just implement this logic into Module. This is going to be something we need anyway for MoE. The implementation can be less generic at this point and leave some TODOs that once we integrate the config system with sharding spec, the implementation should be more generic.

We will have to ensure torch.compile doesn't break. So try with --compile.enable.

I realized that these inner wrappers are not Module yet. You can just wrap them with Module with Config to be empty at this moment.

Or can we inherit Module and call them LocalMapModule?

Compile is not working with this super().call approach, Claude tells me that

Why it breaks compile: When dynamo encounters self.inner_attention(xq, xk, xv, ...), it detects that
LocalMapModule has a custom call (not the standard nn.Module.call). Dynamo then tries to trace through
the custom call as a regular Python function. Inside call, the super().call(q, k, v, **kwargs) call
bypasses dynamo's special nn.Module call handling (which normally inlines forward() and properly handles hooks).
This causes FX nodes to be created without proper meta["val"] metadata, leading to the inductor KeyError: 'val'
error.

Not sure how six handles compile with this call override.

I think to_local and from_local on module boundary is fine, because we can still force a pair of them within clear boundary. Would be good to know if it works with spmd_types, which is a context manager.

for context managers, we can manually call with __enter__() and __exit__()

@xmfan there's some limitation due to torch_function, so we can't use enter and exit. This is the current entry point https://fburl.com/code/ypffryqp

@xmfan i changed to hook based approach here 26cdceb but compile still fails. P2234099693

Directly replacing forward with wrapped forward work, #2621.

tianyu-l · 2026-03-13T09:23:26Z

torchtitan/models/common/attention.py

+        placements = module._placements
+        mesh = module._device_mesh


may not exist if args[0] is not DTensor?

tianyu-l · 2026-03-13T09:24:13Z

torchtitan/models/common/attention.py



-class VarlenAttentionWrapper(torch.nn.Module):
+def _to_local(x: Any) -> Any:


can define inline? doesn't need to be at root level.

I mimic six's way D83451817, but moving inline sounds better.

tianyu-l · 2026-03-13T09:24:35Z

torchtitan/models/common/attention.py

+    def __init__(self) -> None:
+        super().__init__()
+        self.register_forward_pre_hook(
+            LocalMapModule._pre_hook, with_kwargs=True, prepend=True


any reason why prepend=True?

@acisseJZhong Is there a way to use __call__ with torch.compile? I am worried that hook will cause the order issues, which happened many times before. We didn't have a good way to do this because we could not change nn.Module. But since now we have our own Module, I would prefer using call if possible. torch.compile is definitely the key blocker we need to fix.

cc., @xmfan

@tianyu-l in fact this is the reason I also prefer to have local map wrapping super.call, and solve any compile issues on that. Otherwise for this forward pre hook, and for the CP post forward hook, I need to have prepend=True; to make sure that the hooks I added can wrap around CP hooks.

I would prefer rolling back to my last commit and fix compile problem. 26cdceb @xmfan

tianyu-l · 2026-03-13T09:38:12Z

torchtitan/models/common/attention.py

+    return x
+
+
+class LocalMapModule(Module):


Suggested change

class LocalMapModule(Module):

class LocalMapAttention(Module):

The implementation is very restricted, in the sense that

input placements and output placements must be the same

grad placements of inputs must stay the same. From typing perspective, this is only true for Shard.

Taking Flex all-gather based CP as an example

in TP mesh, things are Shard on head dim. This is fine.

on CP mesh, q is Shard and kv are Replicate. The reason you can assume kv grad are also Replicate is because CP API internally does this reduce-scatter for us https://github.com/pytorch/pytorch/blob/main/torch/distributed/tensor/experimental/_context_parallel/_cp_custom_ops.py#L37, which is fine for now

Let's restrict the scope to Attention, since this is not a general local map module.

Please add detailed docstring. (Ask me / claude if anything is not clear)

The scope of this LocalMapModule is restricted to Attention because right now we don't have a way to configure local map, which is why this module is put here and everything is fixed.

However, I think this is a good start though to demonstrate how to do module level local map and extend it to a general implementation once our config includes local map.

on CP mesh, q is Shard and kv are Replicate

When I specify local_map/to/from_local, q, k, v should all be Shard on TP mesh.
And later on CP mesh, q, k, v should all be Shard and this is because of CP pre hook and post hook? https://github.com/pytorch/pytorch/blob/4d01cdb5b2a633c45471bdaf8d8d544c4bb2572a/torch/distributed/tensor/experimental/_context_parallel/_attention.py#L1396

oh, that's right -- if that's the case, can we assert every placement to be Shard?

yes, this is what I am doing in the local map approach. Let me roll back and add explicitly assertion to Shard for now

tianyu-l · 2026-03-13T09:39:26Z

torchtitan/models/common/attention.py

+        self.register_forward_hook(LocalMapModule._post_hook)
+
+    @staticmethod
+    def _pre_hook(module, args, kwargs):


give it more meaningful name, e.g. _inputs_to_local, _outputs_from_local

tianyu-l · 2026-03-13T09:43:43Z

torchtitan/models/qwen3/parallelize.py


+        # TODO: cuDNN SDPA backward has a stride mismatch bug with CP.
+        # Exclude cuDNN until PyTorch fix lands. See https://github.com/pytorch/pytorch/issues/176915.
+        if attn_backend == "sdpa":


I don't know how I feel about this workaround.

If we have to do it, don't do per model. Instead, do it in SDPA module, e.g. maybe by override the pre forward hook and subtract cudnn from self.sdpa_backends

This reverts commit 26cdceb.

acisseJZhong requested review from fegin, tianyu-l, wconstab and wwwjn as code owners March 11, 2026 23:58

pytorch-bot bot added the ciflow/8gpu label Mar 11, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 11, 2026

acisseJZhong requested a review from pianpwk March 11, 2026 23:59

acisseJZhong commented Mar 12, 2026

View reviewed changes

tianyu-l reviewed Mar 13, 2026

View reviewed changes

acisseJZhong mentioned this pull request Mar 13, 2026

qwen3 TP + CP bug #2446

Open

acisseJZhong added this to the New Feature, Model, Misc milestone Mar 13, 2026

acisseJZhong added this to 26H1 TorchTitan Development Mar 13, 2026

github-project-automation bot moved this to Todo in 26H1 TorchTitan Development Mar 13, 2026

acisseJZhong self-assigned this Mar 13, 2026

acisseJZhong moved this from Todo to In Progress in 26H1 TorchTitan Development Mar 13, 2026

wwwjn linked an issue Mar 13, 2026 that may be closed by this pull request

qwen3 TP + CP bug #2446

Open

wwwjn removed this from 26H1 TorchTitan Development Mar 13, 2026

acisseJZhong force-pushed the cp_localmap branch 3 times, most recently from 9da4e39 to 844c107 Compare March 13, 2026 22:02

acisseJZhong added 9 commits March 17, 2026 17:07

test

9508ca6

test

5f3b80a

fix docstring

023d1d0

cp pre forward hook problem

dad18c7

fix lint

aca1bb5

update naming

a101fe7

update comments

379db97

address feedback

01e8a49

address feedback

ddf1ec5

acisseJZhong added 5 commits March 17, 2026 17:07

hit compile problem

2619ecc

swtich to hook

ef50ccc

Revert "swtich to hook"

fe7a0fe

This reverts commit 26cdceb.

udpate comment

a08aede

move cudnn backend

2d06f2e

acisseJZhong force-pushed the cp_localmap branch from 844c107 to 2d06f2e Compare March 18, 2026 16:50

add comment

2024a79



		class VarlenAttentionWrapper(torch.nn.Module):
		def _to_local(x: Any) -> Any:

	class LocalMapModule(Module):
	class LocalMapAttention(Module):

Conversation

acisseJZhong commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

acisseJZhong Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acisseJZhong Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xmfan Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

acisseJZhong Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

acisseJZhong commented Mar 11, 2026 •

edited

Loading

acisseJZhong Mar 12, 2026 •

edited

Loading

acisseJZhong Mar 12, 2026 •

edited

Loading

xmfan Mar 13, 2026 •

edited

Loading

acisseJZhong Mar 13, 2026 •

edited

Loading