-
Notifications
You must be signed in to change notification settings - Fork 603
Make router_fusion to adapt for the large num_of_expert(>2048) #2582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes a CUDA "invalid argument" error that occurred when using router fusion with a large number of experts (>1024).
Root Cause: The router fusion kernels allocate dynamic shared memory proportional to the number of experts. CUDA has a default limit of 48KB for dynamic shared memory. When the number of experts exceeds approximately:
- ~1024 experts for backward kernels (which need ~52 bytes per expert)
- ~3072 experts for forward kernels (which need ~16 bytes per expert)
The kernel launches would fail with an "invalid argument" error.
Solution: The fix adds cudaFuncSetAttribute calls with cudaFuncAttributeMaxDynamicSharedMemorySize before launching each kernel. This explicitly requests the CUDA runtime to allow the calculated amount of dynamic shared memory, even when it exceeds the 48KB default limit.
Changes:
- Added
cudaFuncSetAttributecalls in 4 kernel launchers (forward/backward for bothfused_score_for_moe_aux_lossandfused_topk_with_score_function) - Updated test to use 3000 experts (previously 128) to validate the fix works for large expert counts
The fix follows the established CUDA best practice for requesting dynamic shared memory beyond the default limit.
Confidence Score: 5/5
- Safe to merge - fix correctly addresses the root cause of CUDA shared memory limit errors
- The fix correctly identifies and resolves the CUDA shared memory limitation issue by calling cudaFuncSetAttribute before kernel launches. The approach is technically sound and follows standard CUDA practices. The test validates the fix works with 3000 experts. Only minor style improvements suggested (adding conditional checks) that don't affect correctness.
- No files require special attention - all changes are straightforward and correct
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| transformer_engine/common/fused_router/fused_score_for_moe_aux_loss.cu | 5/5 | Adds cudaFuncSetAttribute calls to enable >48KB dynamic shared memory for forward and backward kernels |
| transformer_engine/common/fused_router/fused_topk_with_score_function.cu | 5/5 | Adds cudaFuncSetAttribute calls to enable >48KB dynamic shared memory for forward and backward kernels |
| tests/pytorch/test_fused_router.py | 5/5 | Updates test to use 3000 experts (was 128) to validate fix for large expert counts |
Sequence Diagram
sequenceDiagram
participant User
participant Router as Router Fusion Kernel
participant CUDA as CUDA Runtime
User->>Router: Launch with num_experts > 1024
Note over Router: Calculate shared memory:<br/>backward: ~52 bytes * num_experts<br/>forward: ~16 bytes * num_experts
alt Before Fix (shared_mem > 48KB)
Router->>CUDA: Launch kernel with dynamic shared memory
CUDA-->>Router: Error: Invalid argument<br/>(exceeds default 48KB limit)
Router-->>User: Failure
end
alt After Fix
Router->>CUDA: cudaFuncSetAttribute(..., MaxDynamicSharedMemorySize, shared_mem)
CUDA-->>Router: OK (limit increased)
Router->>CUDA: Launch kernel with dynamic shared memory
CUDA-->>Router: Success
Router-->>User: Success
end
| cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>, | ||
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the pattern used elsewhere in the codebase (e.g., ln_fwd_cuda_kernel.cu), this call should be conditional:
| cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>, | |
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); | |
| if (shared_memory_size >= 48 * 1024) { | |
| cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>, | |
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); | |
| } |
This attribute only needs to be set when exceeding the 48KB default limit.
| cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>, | ||
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the pattern used elsewhere in the codebase (e.g., ln_fwd_cuda_kernel.cu), this call should be conditional:
| cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>, | |
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); | |
| if (shared_memory_size >= 48 * 1024) { | |
| cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>, | |
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); | |
| } |
This attribute only needs to be set when exceeding the 48KB default limit.
| cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>, | ||
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the pattern used elsewhere in the codebase (e.g., ln_fwd_cuda_kernel.cu), this call should be conditional:
| cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>, | |
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); | |
| if (shared_memory_size >= 48 * 1024) { | |
| cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>, | |
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); | |
| } |
This attribute only needs to be set when exceeding the 48KB default limit.
| cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>, | ||
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the pattern used elsewhere in the codebase (e.g., ln_fwd_cuda_kernel.cu), this call should be conditional:
| cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>, | |
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); | |
| if (shared_memory_size >= 48 * 1024) { | |
| cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>, | |
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); | |
| } |
This attribute only needs to be set when exceeding the 48KB default limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
This PR fixes an "invalid argument" error that occurs when using the router fusion functions with a large number of experts (>2048).
Root Cause
When num_experts exceeds ~2048, the CUDA kernels require more than 48 KB of dynamic shared memory (the default CUDA limit). For example, with 3000 experts:
- Forward kernels need ~47-48 KB
- Backward kernels need ~152 KB
Without explicitly setting the maximum dynamic shared memory size via cudaFuncSetAttribute, kernel launches fail with an invalid argument error.
Changes Made
The fix adds cudaFuncSetAttribute calls before launching 4 CUDA kernels:
fused_score_for_moe_aux_loss_forward_kernelfused_score_for_moe_aux_loss_backward_kernelfused_topk_with_score_function_forward_kernelfused_topk_with_score_function_backward_kernel
This tells CUDA to allow dynamic shared memory allocation beyond the default 48 KB limit, up to the device's maximum (typically 96-164 KB depending on compute capability).
Test Coverage
The test file was updated to test with num_experts=3000, but only exercises test_topk_softmax. The test_fused_scores_for_aux_loss function is not tested with large expert counts, leaving the backward kernel in fused_score_for_moe_aux_loss.cu without explicit test coverage for this scenario.
Issues Identified
- Missing error handling: All 4
cudaFuncSetAttributecalls lackNVTE_CHECK_CUDA()wrappers for error handling, which is inconsistent with the codebase pattern seen in other files (e.g.,ln_fwd_cuda_kernel.cu,rmsnorm_fwd_cuda_kernel.cu) - Incomplete test coverage: Only one of the two affected code paths is tested with large expert counts
Confidence Score: 4/5
- This PR is safe to merge with minor improvements needed for error handling and test coverage
- The core fix is correct and addresses the root cause (dynamic shared memory limits for large expert counts). The implementation is consistent across all 4 affected kernel launchers. However, the score is 4/5 rather than 5/5 because: (1) all cudaFuncSetAttribute calls lack error handling wrappers (NVTE_CHECK_CUDA) that are standard in the codebase, and (2) test coverage only validates one of two affected code paths with large expert counts. These are best practice improvements rather than critical bugs.
- All three files need minor attention: the two CUDA files need error handling wrappers added, and the test file could benefit from more comprehensive test coverage for the aux_loss functions with large expert counts
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| transformer_engine/common/fused_router/fused_score_for_moe_aux_loss.cu | 4/5 | Added cudaFuncSetAttribute calls for forward and backward kernels to support large expert counts (>2048). Missing NVTE_CHECK_CUDA error handling wrappers. |
| transformer_engine/common/fused_router/fused_topk_with_score_function.cu | 4/5 | Added cudaFuncSetAttribute calls for forward and backward kernels to support large expert counts (>2048). Missing NVTE_CHECK_CUDA error handling wrappers. |
| tests/pytorch/test_fused_router.py | 3/5 | Updated test to use num_experts=3000, but only tests one code path (topk_softmax). Missing test coverage for aux_loss functions with large expert counts. |
Sequence Diagram
sequenceDiagram
participant User as PyTorch User Code
participant Router as Router Function
participant Launcher as Kernel Launcher
participant CUDA as CUDA Runtime
participant Kernel as CUDA Kernel
User->>Router: Call fused_topk_with_score_function<br/>or fused_compute_score_for_moe_aux_loss<br/>(with num_experts > 2048)
Router->>Launcher: Invoke kernel_launcher<br/>(num_experts, num_tokens, topk, etc.)
Note over Launcher: Calculate shared_memory_size<br/>= num_experts * num_token_per_block * sizeof(DataType) * N<br/>(N=1-4 buffers depending on kernel)
Note over Launcher: For num_experts=3000:<br/>Forward: ~47 KB<br/>Backward: ~152 KB
Launcher->>CUDA: cudaFuncSetAttribute(<br/>kernel,<br/>cudaFuncAttributeMaxDynamicSharedMemorySize,<br/>shared_memory_size)
Note over CUDA: Sets max dynamic shared memory limit<br/>(default 48 KB → up to 164 KB based on device)
CUDA-->>Launcher: Success (or error if exceeds device limit)
Launcher->>CUDA: Launch kernel<<<grid, block, shared_memory_size>>>
CUDA->>Kernel: Execute kernel with allocated shared memory
Note over Kernel: Each warp processes 1 token<br/>Uses allocated shared memory for:<br/>- logits/scores buffer<br/>- topk results<br/>- intermediate computations
Kernel-->>CUDA: Kernel completes
CUDA->>Launcher: cudaGetLastError()
Launcher-->>Router: Return results
Router-->>User: Return probs/scores and routing_map
| cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>, | ||
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for cudaFuncSetAttribute. The codebase pattern is to wrap CUDA API calls with NVTE_CHECK_CUDA() for proper error handling (see examples in ln_fwd_cuda_kernel.cu, rmsnorm_fwd_cuda_kernel.cu, and other files).
If cudaFuncSetAttribute fails (e.g., if the requested shared memory size exceeds device limits), the error should be caught and reported.
| cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>, | |
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); | |
| NVTE_CHECK_CUDA(cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>, | |
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size)); |
| cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>, | ||
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for cudaFuncSetAttribute. The codebase pattern is to wrap CUDA API calls with NVTE_CHECK_CUDA() for proper error handling.
| cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>, | |
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); | |
| NVTE_CHECK_CUDA(cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>, | |
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size)); |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>, | ||
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for cudaFuncSetAttribute. The codebase pattern is to wrap CUDA API calls with NVTE_CHECK_CUDA() for proper error handling.
| cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>, | |
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); | |
| NVTE_CHECK_CUDA(cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>, | |
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size)); |
| cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>, | ||
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for cudaFuncSetAttribute. The codebase pattern is to wrap CUDA API calls with NVTE_CHECK_CUDA() for proper error handling.
| cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>, | |
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size); | |
| NVTE_CHECK_CUDA(cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>, | |
| cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size)); |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Additional Comments (1)
Since the PR fixes apply to both Consider adding a test call for |
Description
At some corner cases (number of experts > 2048), router_fusion throws an invalid argument error. This PR fixes that issue.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: