Only apply grouped GEMM padding for MXFP8 and FP8 non-HybridEP cases by danielvegamyhre · Pull Request #2620 · pytorch/torchtitan

danielvegamyhre · 2026-03-18T04:53:31Z

Context

BF16 grouped GEMM no longer requires padding, we can remove it from the BF16 path and only use it for FP8 and MXFP8 grouped GEMMs
TorchTitan will now only only contain a torch native "rank major to expert major" permutation impl for BF16 grouped GEMM, and not any extra per group padding kernels/logic for FP8/MXFP8 (these will live in torchao, as the quantization library it is a better home for them).

Summary

There are 7 cases to handle:

Case 1: BF16 + NoEP
- (do nothing)
Case 2: BF16 + EP
- Torch native impl handles permute from rank major to expert major (no padding)
Case 3: MXFP8 + No EP
- Handled with pad/unpad kernels in torchao
Case 4: MXFP8 + Standard EP
- torchao permute_and_pad() if token_group_alignment_size > 0, in ExpertParallel implementation
Case 5: MXFP8 + HybridEP
- HybridEP handles token group padding for MXFP8 grouped GEMM as part of the all2all dispatch
Case 6: FP8 + No EP
- Same as case 3
Case 7: FP8 + EP
- Same as case 4

Misc changes

Delete kernels.py
Delete tests for those kernels
Remove pad_token_groups_for_grouped_mm option from MXFP8ConverterConfig, since we can set it correctly automatically
Added debug models for float8 and mxfp8 to config registry to speed up future development

Tests

BF16 + No EP (ok): https://www.internalfb.com/phabricator/paste/view/P2244567028
BF16 + EP (ok): https://www.internalfb.com/phabricator/paste/view/P2244568064
MXFP8 + No EP (issue): https://www.internalfb.com/phabricator/paste/view/P2244573156
- Important follow up: nan loss for this case! probably a bug in the new pad/unpad kernels we added for this in torchao, as they have unit tests but this was the first e2e training test
MXFP8 + EP (ok): https://www.internalfb.com/phabricator/paste/view/P2244569057
FP8 + No EP (ok): https://www.internalfb.com/phabricator/paste/view/P2244583497
FP8 + EP (ok): https://www.internalfb.com/phabricator/paste/view/P2244584583
I did not test Case 5 (Hybrid EP + MXFP8) because it was recently tested by others, and is a different code path not part of these changes.

FP8 tests were done with fp8 grouped mm only, not fp8 linear. Using both I get this weird tyro error?

[rank0]:│ model-converters.converters.0:config was not a match because:                │
[rank0]:│ • Default value Config(enable_fsdp_float8_all_gather=False,                  │
[rank0]:│   precompute_float8_dynamic_scale_for_fsdp=False, recipe_name=None,          │
[rank0]:│   filter_fqns=['output', 'router.gate'], emulate=False) with type Config     │
[rank0]:│   does not match type <class 'torchtitan.components.quantization.float8.Floa │
[rank0]:│   t8GroupedMMConverter.Config'>

danielvegamyhre · 2026-03-18T04:54:55Z

Have not tested yet because of devgpu issues but if you want to take a look feel free @tianyu-l

danielvegamyhre · 2026-03-20T04:33:04Z

I finished testing @tianyu-l this is ready for review

tianyu-l · 2026-03-20T17:25:55Z

torchtitan/models/common/moe/utils.py

 from torchtitan.tools.utils import _round_up

-from .kernels import generate_permute_indices
+TOKEN_GROUP_ALIGN_SIZE_M = 0


we should remove this -- setting global variables is error-prone

we should move logic to parallelize functions for various combinations

makes sense, i am working on this, the changes are straightforward for standard EP, but for Hybrid EP it seems like it will require (1) refactoring the custom ops, DispatchState etc to pass around the quantization type used, or (2) just a module level variable storing the quantization type, similar to _buffer. I think (2) is a less invasive change, wdyt?

tianyu-l · 2026-03-20T17:27:16Z

torchtitan/models/common/moe/utils.py

+def maybe_align_num_tokens_for_mxfp8(num_tokens: int) -> int:
+    """Round up token count only when MXFP8 group alignment is active."""
+    if TOKEN_GROUP_ALIGN_SIZE_M != MXFP8_GROUP_ALIGNMENT_SIZE:
+        return num_tokens
+    return _round_up(num_tokens, MXFP8_GROUP_ALIGNMENT_SIZE)


move this logic to hybridep.py, including _round_up (as an inline function) which is currently only used once in this repo

tianyu-l · 2026-03-20T17:29:51Z

torchtitan/distributed/expert_parallel.py

+        # FP8/MXFP8 require groups to be permuted to expert major order AND padded to
+        # `alignment_size`.
+        # Otherwise, we only need to permute to expert major order.
+        if self.token_group_alignment > 0:


IMO the proper way is to create e.g. FP8ExpertParallel and dispatch to it in parallelize function, instead of making if-else in existing ExpertParallel.

Also the condition should be whether quantization is used, not the token_group_alignment size set from somewhere.

That works, did a refactor

tianyu-l · 2026-03-20T17:30:37Z

torchtitan/models/common/moe/utils.py

+
+
+# Source: https://github.com/pytorch/torchtitan/pull/2255
+def _generate_permute_indices(


could you verify that before vs. after, we get bitwise identical results under same seed and determinism?

tianyu-l · 2026-03-22T00:49:08Z

torchtitan/models/common/moe/utils.py

-TOKEN_GROUP_ALIGN_SIZE_M = 8
-ValidTokenGroupAlignmentSize = Literal[8, 16, 32]

+def indices_padding_wrapper(func: Callable) -> Callable:


I don't think we need this function any more. Please remove this and simplify

torchtitan/torchtitan/models/common/moe/moe.py

Lines 120 to 131 in c0c0bf9

# NOTE: If EP is not used, we need to pad the indices

# to prepare for grouped_mm;

# otherwise, EP will handle the padding.

if (

not isinstance(self.w1, DTensor)

# pyrefly: ignore[not-iterable]

or "ep" not in self.w1.device_mesh.mesh_dim_names

):

run_experts_fn = indices_padding_wrapper(_run_experts_grouped_mm)

else:

run_experts_fn = _run_experts_grouped_mm

return run_experts_fn(w1, w2, w3, x, num_tokens_per_expert)

tianyu-l · 2026-03-22T00:49:47Z

torchtitan/models/gpt_oss/moe.py

@@ -45,10 +45,9 @@ def backward(ctx, grad_output):

 def indices_padding_wrapper(func: Callable) -> Callable:


tianyu-l · 2026-03-22T00:52:28Z

torchtitan/distributed/expert_parallel.py

+            num_tokens_per_expert_group,
+            ep_degree,
+            num_local_experts,
+            FLOAT8_GROUP_ALIGNMENT_SIZE,


why do you need two different classes? You could just init with different quantization type, which can be used to determine the alignment size, e.g. based on a static dict.

tianyu-l · 2026-03-22T00:53:12Z

torchtitan/components/quantization/__init__.py

+FLOAT8_GROUP_ALIGNMENT_SIZE = 16
 MXFP8_GROUP_ALIGNMENT_SIZE = 32


make this a dict from quantization type to alignment size

tianyu-l · 2026-03-22T00:56:35Z

torchtitan/components/quantization/utils.py

+    if find_float8_grouped_mm_config(model_converters):
+        return QuantizationType.FLOAT8
+    elif config := find_mxfp8_config(model_converters):
+        if routed_experts_in_fqns(config.fqns):


no need to modularize into multiple small functions which are not used elsewhere -- we can make everything in a single util function for now

tianyu-l · 2026-03-22T00:58:00Z

torchtitan/components/quantization/utils.py

+from torchtitan.protocols import ModelConverter
+
+
+class QuantizationType(Enum):


you already have https://github.com/pytorch/torchtitan/blob/main/torchtitan/components/quantization/mx.py#L35, can we just use the strings?

tianyu-l · 2026-03-22T00:59:48Z

torchtitan/components/quantization/mx.py

+        # as part of the EP implementation.
+        # Otherwise, if EP is not enabled, we need TorchAO to pad the token groups.
+        self.pad_token_groups_for_grouped_mm = not parallel_dims.ep_enabled
+        logger.warning(


why it's a warning? sounds like a comment to me, especially when both hybridEP is used this warning would still be there

tianyu-l · 2026-03-22T01:01:39Z

torchtitan/distributed/deepep/hybridep.py

    group: ProcessGroup,
    score_before_experts: bool = True,
    non_blocking_expert_capacity_factor: float | None = None,
+    quantization_type: QuantizationType | None = None,


hybridep module doesn't need to know the quantization_type. All it needs to know is pad multiple size.

tianyu-l · 2026-03-22T01:03:48Z

torchtitan/distributed/expert_parallel.py

+        )
+
+
+class Float8ExpertParallel(BaseExpertParallel):


can you inherit ExpertParallel instead of BaseExpertParallel, which can save a lot of code?

danielvegamyhre requested review from fegin, tianyu-l, wconstab and wwwjn as code owners March 18, 2026 04:53

pytorch-bot bot added the ciflow/8gpu label Mar 18, 2026

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 18, 2026

danielvegamyhre force-pushed the paddingupdate branch from 9a0669f to 744ce3a Compare March 18, 2026 04:56

This was referenced Mar 18, 2026

remove MoE token padding paths #2475

Open

Remove unnecessary token padding for MoE in BF16 mode #2255

Open

danielvegamyhre force-pushed the paddingupdate branch from 4161924 to df008b3 Compare March 20, 2026 02:57

danielvegamyhre added 2 commits March 20, 2026 03:44

grouped gemm padding only for mxfp8/fp8

03e73d4

fix bugs

8a80624

danielvegamyhre force-pushed the paddingupdate branch 3 times, most recently from f906b3d to f79fff4 Compare March 20, 2026 04:14

fix mxfp8 non-ep case

e980189

danielvegamyhre force-pushed the paddingupdate branch from f79fff4 to e980189 Compare March 20, 2026 04:31

tianyu-l reviewed Mar 20, 2026

View reviewed changes

refactor to address comments

03a1384

tianyu-l requested changes Mar 22, 2026

View reviewed changes



		# Source: https://github.com/pytorch/torchtitan/pull/2255
		def _generate_permute_indices(

	# NOTE: If EP is not used, we need to pad the indices
	# to prepare for grouped_mm;
	# otherwise, EP will handle the padding.
	if (
	not isinstance(self.w1, DTensor)
	# pyrefly: ignore[not-iterable]
	or "ep" not in self.w1.device_mesh.mesh_dim_names
	):
	run_experts_fn = indices_padding_wrapper(_run_experts_grouped_mm)
	else:
	run_experts_fn = _run_experts_grouped_mm
	return run_experts_fn(w1, w2, w3, x, num_tokens_per_expert)

		@@ -45,10 +45,9 @@ def backward(ctx, grad_output):

		def indices_padding_wrapper(func: Callable) -> Callable:

		FLOAT8_GROUP_ALIGNMENT_SIZE = 16
		MXFP8_GROUP_ALIGNMENT_SIZE = 32

		from torchtitan.protocols import ModelConverter


		class QuantizationType(Enum):

Conversation

danielvegamyhre commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Summary

Misc changes

Tests

Uh oh!

danielvegamyhre commented Mar 18, 2026

Uh oh!

danielvegamyhre commented Mar 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielvegamyhre commented Mar 18, 2026 •

edited

Loading