Make router_fusion to adapt for the large num_of_expert(>2048) #2582

Autumn1998 · 2026-01-09T03:19:48Z

Description

At some corner cases (number of experts > 2048), router_fusion throws an invalid argument error. This PR fixes that issue.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

for more information, see https://pre-commit.ci

greptile-apps

Greptile Overview

Greptile Summary

This PR fixes a CUDA "invalid argument" error that occurred when using router fusion with a large number of experts (>1024).

Root Cause: The router fusion kernels allocate dynamic shared memory proportional to the number of experts. CUDA has a default limit of 48KB for dynamic shared memory. When the number of experts exceeds approximately:

~1024 experts for backward kernels (which need ~52 bytes per expert)
~3072 experts for forward kernels (which need ~16 bytes per expert)

The kernel launches would fail with an "invalid argument" error.

Solution: The fix adds cudaFuncSetAttribute calls with cudaFuncAttributeMaxDynamicSharedMemorySize before launching each kernel. This explicitly requests the CUDA runtime to allow the calculated amount of dynamic shared memory, even when it exceeds the 48KB default limit.

Changes:

Added cudaFuncSetAttribute calls in 4 kernel launchers (forward/backward for both fused_score_for_moe_aux_loss and fused_topk_with_score_function)
Updated test to use 3000 experts (previously 128) to validate the fix works for large expert counts

The fix follows the established CUDA best practice for requesting dynamic shared memory beyond the default limit.

Confidence Score: 5/5

Safe to merge - fix correctly addresses the root cause of CUDA shared memory limit errors
The fix correctly identifies and resolves the CUDA shared memory limitation issue by calling cudaFuncSetAttribute before kernel launches. The approach is technically sound and follows standard CUDA practices. The test validates the fix works with 3000 experts. Only minor style improvements suggested (adding conditional checks) that don't affect correctness.
No files require special attention - all changes are straightforward and correct

Important Files Changed

File Analysis

Filename	Score	Overview
transformer_engine/common/fused_router/fused_score_for_moe_aux_loss.cu	5/5	Adds cudaFuncSetAttribute calls to enable >48KB dynamic shared memory for forward and backward kernels
transformer_engine/common/fused_router/fused_topk_with_score_function.cu	5/5	Adds cudaFuncSetAttribute calls to enable >48KB dynamic shared memory for forward and backward kernels
tests/pytorch/test_fused_router.py	5/5	Updates test to use 3000 experts (was 128) to validate fix for large expert counts

Sequence Diagram

sequenceDiagram
    participant User
    participant Router as Router Fusion Kernel
    participant CUDA as CUDA Runtime
    
    User->>Router: Launch with num_experts > 1024
    Note over Router: Calculate shared memory:<br/>backward: ~52 bytes * num_experts<br/>forward: ~16 bytes * num_experts
    
    alt Before Fix (shared_mem > 48KB)
        Router->>CUDA: Launch kernel with dynamic shared memory
        CUDA-->>Router: Error: Invalid argument<br/>(exceeds default 48KB limit)
        Router-->>User: Failure
    end
    
    alt After Fix
        Router->>CUDA: cudaFuncSetAttribute(..., MaxDynamicSharedMemorySize, shared_mem)
        CUDA-->>Router: OK (limit increased)
        Router->>CUDA: Launch kernel with dynamic shared memory
        CUDA-->>Router: Success
        Router-->>User: Success
    end

greptile-apps · 2026-01-09T03:24:59Z

transformer_engine/common/fused_router/fused_score_for_moe_aux_loss.cu

+  cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>,
+                       cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);


Following the pattern used elsewhere in the codebase (e.g., ln_fwd_cuda_kernel.cu), this call should be conditional:

Suggested change

cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>,

cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);

if (shared_memory_size >= 48 * 1024) {

cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>,

cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);

}

This attribute only needs to be set when exceeding the 48KB default limit.

greptile-apps · 2026-01-09T03:25:00Z

transformer_engine/common/fused_router/fused_score_for_moe_aux_loss.cu

+  cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>,
+                       cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);


Following the pattern used elsewhere in the codebase (e.g., ln_fwd_cuda_kernel.cu), this call should be conditional:

Suggested change

cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>,

cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);

if (shared_memory_size >= 48 * 1024) {

cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>,

cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);

}

This attribute only needs to be set when exceeding the 48KB default limit.

greptile-apps · 2026-01-09T03:25:01Z

transformer_engine/common/fused_router/fused_topk_with_score_function.cu

+  cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>,
+                       cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);


Following the pattern used elsewhere in the codebase (e.g., ln_fwd_cuda_kernel.cu), this call should be conditional:

Suggested change

cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>,

cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);

if (shared_memory_size >= 48 * 1024) {

cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>,

cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);

}

This attribute only needs to be set when exceeding the 48KB default limit.

greptile-apps · 2026-01-09T03:25:01Z

transformer_engine/common/fused_router/fused_topk_with_score_function.cu

+  cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>,
+                       cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);


Following the pattern used elsewhere in the codebase (e.g., ln_fwd_cuda_kernel.cu), this call should be conditional:

Suggested change

cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>,

cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);

if (shared_memory_size >= 48 * 1024) {

cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>,

cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);

}

This attribute only needs to be set when exceeding the 48KB default limit.

greptile-apps

Greptile Overview

Greptile Summary

This PR fixes an "invalid argument" error that occurs when using the router fusion functions with a large number of experts (>2048).

Root Cause

When num_experts exceeds ~2048, the CUDA kernels require more than 48 KB of dynamic shared memory (the default CUDA limit). For example, with 3000 experts:

Forward kernels need ~47-48 KB
Backward kernels need ~152 KB

Without explicitly setting the maximum dynamic shared memory size via cudaFuncSetAttribute, kernel launches fail with an invalid argument error.

Changes Made

The fix adds cudaFuncSetAttribute calls before launching 4 CUDA kernels:

fused_score_for_moe_aux_loss_forward_kernel
fused_score_for_moe_aux_loss_backward_kernel
fused_topk_with_score_function_forward_kernel
fused_topk_with_score_function_backward_kernel

This tells CUDA to allow dynamic shared memory allocation beyond the default 48 KB limit, up to the device's maximum (typically 96-164 KB depending on compute capability).

Test Coverage

The test file was updated to test with num_experts=3000, but only exercises test_topk_softmax. The test_fused_scores_for_aux_loss function is not tested with large expert counts, leaving the backward kernel in fused_score_for_moe_aux_loss.cu without explicit test coverage for this scenario.

Issues Identified

Missing error handling: All 4 cudaFuncSetAttribute calls lack NVTE_CHECK_CUDA() wrappers for error handling, which is inconsistent with the codebase pattern seen in other files (e.g., ln_fwd_cuda_kernel.cu, rmsnorm_fwd_cuda_kernel.cu)
Incomplete test coverage: Only one of the two affected code paths is tested with large expert counts

Confidence Score: 4/5

This PR is safe to merge with minor improvements needed for error handling and test coverage
The core fix is correct and addresses the root cause (dynamic shared memory limits for large expert counts). The implementation is consistent across all 4 affected kernel launchers. However, the score is 4/5 rather than 5/5 because: (1) all cudaFuncSetAttribute calls lack error handling wrappers (NVTE_CHECK_CUDA) that are standard in the codebase, and (2) test coverage only validates one of two affected code paths with large expert counts. These are best practice improvements rather than critical bugs.
All three files need minor attention: the two CUDA files need error handling wrappers added, and the test file could benefit from more comprehensive test coverage for the aux_loss functions with large expert counts

Important Files Changed

File Analysis

Filename	Score	Overview
transformer_engine/common/fused_router/fused_score_for_moe_aux_loss.cu	4/5	Added cudaFuncSetAttribute calls for forward and backward kernels to support large expert counts (>2048). Missing NVTE_CHECK_CUDA error handling wrappers.
transformer_engine/common/fused_router/fused_topk_with_score_function.cu	4/5	Added cudaFuncSetAttribute calls for forward and backward kernels to support large expert counts (>2048). Missing NVTE_CHECK_CUDA error handling wrappers.
tests/pytorch/test_fused_router.py	3/5	Updated test to use num_experts=3000, but only tests one code path (topk_softmax). Missing test coverage for aux_loss functions with large expert counts.

Sequence Diagram

sequenceDiagram
    participant User as PyTorch User Code
    participant Router as Router Function
    participant Launcher as Kernel Launcher
    participant CUDA as CUDA Runtime
    participant Kernel as CUDA Kernel
    
    User->>Router: Call fused_topk_with_score_function<br/>or fused_compute_score_for_moe_aux_loss<br/>(with num_experts > 2048)
    Router->>Launcher: Invoke kernel_launcher<br/>(num_experts, num_tokens, topk, etc.)
    
    Note over Launcher: Calculate shared_memory_size<br/>= num_experts * num_token_per_block * sizeof(DataType) * N<br/>(N=1-4 buffers depending on kernel)
    
    Note over Launcher: For num_experts=3000:<br/>Forward: ~47 KB<br/>Backward: ~152 KB
    
    Launcher->>CUDA: cudaFuncSetAttribute(<br/>kernel,<br/>cudaFuncAttributeMaxDynamicSharedMemorySize,<br/>shared_memory_size)
    
    Note over CUDA: Sets max dynamic shared memory limit<br/>(default 48 KB → up to 164 KB based on device)
    
    CUDA-->>Launcher: Success (or error if exceeds device limit)
    
    Launcher->>CUDA: Launch kernel<<<grid, block, shared_memory_size>>>
    CUDA->>Kernel: Execute kernel with allocated shared memory
    
    Note over Kernel: Each warp processes 1 token<br/>Uses allocated shared memory for:<br/>- logits/scores buffer<br/>- topk results<br/>- intermediate computations
    
    Kernel-->>CUDA: Kernel completes
    CUDA->>Launcher: cudaGetLastError()
    Launcher-->>Router: Return results
    Router-->>User: Return probs/scores and routing_map

greptile-apps · 2026-01-09T03:29:31Z

transformer_engine/common/fused_router/fused_score_for_moe_aux_loss.cu

+  cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>,
+                       cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);


Missing error handling for cudaFuncSetAttribute. The codebase pattern is to wrap CUDA API calls with NVTE_CHECK_CUDA() for proper error handling (see examples in ln_fwd_cuda_kernel.cu, rmsnorm_fwd_cuda_kernel.cu, and other files).

If cudaFuncSetAttribute fails (e.g., if the requested shared memory size exceeds device limits), the error should be caught and reported.

Suggested change

cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>,

cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);

NVTE_CHECK_CUDA(cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>,

cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size));

greptile-apps · 2026-01-09T03:29:32Z

transformer_engine/common/fused_router/fused_score_for_moe_aux_loss.cu

+  cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>,
+                       cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);


Missing error handling for cudaFuncSetAttribute. The codebase pattern is to wrap CUDA API calls with NVTE_CHECK_CUDA() for proper error handling.

Suggested change

cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>,

cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);

NVTE_CHECK_CUDA(cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>,

cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size));

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-09T03:29:34Z

transformer_engine/common/fused_router/fused_topk_with_score_function.cu

+  cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>,
+                       cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);


Missing error handling for cudaFuncSetAttribute. The codebase pattern is to wrap CUDA API calls with NVTE_CHECK_CUDA() for proper error handling.

Suggested change

cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>,

cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);

NVTE_CHECK_CUDA(cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>,

cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size));

greptile-apps · 2026-01-09T03:29:34Z

transformer_engine/common/fused_router/fused_topk_with_score_function.cu

+  cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>,
+                       cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);


Missing error handling for cudaFuncSetAttribute. The codebase pattern is to wrap CUDA API calls with NVTE_CHECK_CUDA() for proper error handling.

Suggested change

cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>,

cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);

NVTE_CHECK_CUDA(cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>,

cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size));

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2026-01-09T03:29:41Z

Additional Comments (1)

tests/pytorch/test_fused_router.py
The test in __main__ only exercises test_topk_softmax with num_experts=3000, but doesn't test test_fused_scores_for_aux_loss with a large expert count.

Since the PR fixes apply to both fused_topk_with_score_function.cu AND fused_score_for_moe_aux_loss.cu, the test coverage should include both code paths to verify the fix works for all affected kernels.

Consider adding a test call for test_fused_scores_for_aux_loss with num_experts=3000 to ensure the backward kernel in fused_score_for_moe_aux_loss.cu is also properly tested (which requires ~152 KB of shared memory for 3000 experts).

Autumn1998 and others added 2 commits January 8, 2026 19:16

fix router

8b77d1c

[pre-commit.ci] auto fixes from pre-commit.com hooks

273a247

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Jan 9, 2026

View reviewed changes

ptrendx requested a review from denera January 9, 2026 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make router_fusion to adapt for the large num_of_expert(>2048) #2582

Make router_fusion to adapt for the large num_of_expert(>2048) #2582

Autumn1998 commented Jan 9, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 9, 2026

Uh oh!

greptile-apps bot Jan 9, 2026

Uh oh!

greptile-apps bot Jan 9, 2026

Uh oh!

greptile-apps bot Jan 9, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 9, 2026

Uh oh!

greptile-apps bot Jan 9, 2026

Uh oh!

greptile-apps bot Jan 9, 2026

Uh oh!

greptile-apps bot Jan 9, 2026

Uh oh!

greptile-apps bot commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>,
		cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);

		cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>,
		cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);

Make router_fusion to adapt for the large num_of_expert(>2048) #2582

Are you sure you want to change the base?

Make router_fusion to adapt for the large num_of_expert(>2048) #2582

Conversation

Autumn1998 commented Jan 9, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Overview

Greptile Summary

Root Cause

Changes Made

Test Coverage

Issues Identified

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant