Skip to content

Conversation

@Autumn1998
Copy link
Contributor

Description

At some corner cases (number of experts > 2048), router_fusion throws an invalid argument error. This PR fixes that issue.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR fixes a CUDA "invalid argument" error that occurred when using router fusion with a large number of experts (>1024).

Root Cause: The router fusion kernels allocate dynamic shared memory proportional to the number of experts. CUDA has a default limit of 48KB for dynamic shared memory. When the number of experts exceeds approximately:

  • ~1024 experts for backward kernels (which need ~52 bytes per expert)
  • ~3072 experts for forward kernels (which need ~16 bytes per expert)

The kernel launches would fail with an "invalid argument" error.

Solution: The fix adds cudaFuncSetAttribute calls with cudaFuncAttributeMaxDynamicSharedMemorySize before launching each kernel. This explicitly requests the CUDA runtime to allow the calculated amount of dynamic shared memory, even when it exceeds the 48KB default limit.

Changes:

  • Added cudaFuncSetAttribute calls in 4 kernel launchers (forward/backward for both fused_score_for_moe_aux_loss and fused_topk_with_score_function)
  • Updated test to use 3000 experts (previously 128) to validate the fix works for large expert counts

The fix follows the established CUDA best practice for requesting dynamic shared memory beyond the default limit.

Confidence Score: 5/5

  • Safe to merge - fix correctly addresses the root cause of CUDA shared memory limit errors
  • The fix correctly identifies and resolves the CUDA shared memory limitation issue by calling cudaFuncSetAttribute before kernel launches. The approach is technically sound and follows standard CUDA practices. The test validates the fix works with 3000 experts. Only minor style improvements suggested (adding conditional checks) that don't affect correctness.
  • No files require special attention - all changes are straightforward and correct

Important Files Changed

File Analysis

Filename Score Overview
transformer_engine/common/fused_router/fused_score_for_moe_aux_loss.cu 5/5 Adds cudaFuncSetAttribute calls to enable >48KB dynamic shared memory for forward and backward kernels
transformer_engine/common/fused_router/fused_topk_with_score_function.cu 5/5 Adds cudaFuncSetAttribute calls to enable >48KB dynamic shared memory for forward and backward kernels
tests/pytorch/test_fused_router.py 5/5 Updates test to use 3000 experts (was 128) to validate fix for large expert counts

Sequence Diagram

sequenceDiagram
    participant User
    participant Router as Router Fusion Kernel
    participant CUDA as CUDA Runtime
    
    User->>Router: Launch with num_experts > 1024
    Note over Router: Calculate shared memory:<br/>backward: ~52 bytes * num_experts<br/>forward: ~16 bytes * num_experts
    
    alt Before Fix (shared_mem > 48KB)
        Router->>CUDA: Launch kernel with dynamic shared memory
        CUDA-->>Router: Error: Invalid argument<br/>(exceeds default 48KB limit)
        Router-->>User: Failure
    end
    
    alt After Fix
        Router->>CUDA: cudaFuncSetAttribute(..., MaxDynamicSharedMemorySize, shared_mem)
        CUDA-->>Router: OK (limit increased)
        Router->>CUDA: Launch kernel with dynamic shared memory
        CUDA-->>Router: Success
        Router-->>User: Success
    end
Loading

Comment on lines +150 to +151
cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the pattern used elsewhere in the codebase (e.g., ln_fwd_cuda_kernel.cu), this call should be conditional:

Suggested change
cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
if (shared_memory_size >= 48 * 1024) {
cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
}

This attribute only needs to be set when exceeding the 48KB default limit.

Comment on lines +288 to +289
cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the pattern used elsewhere in the codebase (e.g., ln_fwd_cuda_kernel.cu), this call should be conditional:

Suggested change
cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
if (shared_memory_size >= 48 * 1024) {
cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
}

This attribute only needs to be set when exceeding the 48KB default limit.

Comment on lines +256 to +257
cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the pattern used elsewhere in the codebase (e.g., ln_fwd_cuda_kernel.cu), this call should be conditional:

Suggested change
cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
if (shared_memory_size >= 48 * 1024) {
cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
}

This attribute only needs to be set when exceeding the 48KB default limit.

Comment on lines +449 to +450
cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the pattern used elsewhere in the codebase (e.g., ln_fwd_cuda_kernel.cu), this call should be conditional:

Suggested change
cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
if (shared_memory_size >= 48 * 1024) {
cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
}

This attribute only needs to be set when exceeding the 48KB default limit.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR fixes an "invalid argument" error that occurs when using the router fusion functions with a large number of experts (>2048).

Root Cause

When num_experts exceeds ~2048, the CUDA kernels require more than 48 KB of dynamic shared memory (the default CUDA limit). For example, with 3000 experts:

  • Forward kernels need ~47-48 KB
  • Backward kernels need ~152 KB

Without explicitly setting the maximum dynamic shared memory size via cudaFuncSetAttribute, kernel launches fail with an invalid argument error.

Changes Made

The fix adds cudaFuncSetAttribute calls before launching 4 CUDA kernels:

  1. fused_score_for_moe_aux_loss_forward_kernel
  2. fused_score_for_moe_aux_loss_backward_kernel
  3. fused_topk_with_score_function_forward_kernel
  4. fused_topk_with_score_function_backward_kernel

This tells CUDA to allow dynamic shared memory allocation beyond the default 48 KB limit, up to the device's maximum (typically 96-164 KB depending on compute capability).

Test Coverage

The test file was updated to test with num_experts=3000, but only exercises test_topk_softmax. The test_fused_scores_for_aux_loss function is not tested with large expert counts, leaving the backward kernel in fused_score_for_moe_aux_loss.cu without explicit test coverage for this scenario.

Issues Identified

  • Missing error handling: All 4 cudaFuncSetAttribute calls lack NVTE_CHECK_CUDA() wrappers for error handling, which is inconsistent with the codebase pattern seen in other files (e.g., ln_fwd_cuda_kernel.cu, rmsnorm_fwd_cuda_kernel.cu)
  • Incomplete test coverage: Only one of the two affected code paths is tested with large expert counts

Confidence Score: 4/5

  • This PR is safe to merge with minor improvements needed for error handling and test coverage
  • The core fix is correct and addresses the root cause (dynamic shared memory limits for large expert counts). The implementation is consistent across all 4 affected kernel launchers. However, the score is 4/5 rather than 5/5 because: (1) all cudaFuncSetAttribute calls lack error handling wrappers (NVTE_CHECK_CUDA) that are standard in the codebase, and (2) test coverage only validates one of two affected code paths with large expert counts. These are best practice improvements rather than critical bugs.
  • All three files need minor attention: the two CUDA files need error handling wrappers added, and the test file could benefit from more comprehensive test coverage for the aux_loss functions with large expert counts

Important Files Changed

File Analysis

Filename Score Overview
transformer_engine/common/fused_router/fused_score_for_moe_aux_loss.cu 4/5 Added cudaFuncSetAttribute calls for forward and backward kernels to support large expert counts (>2048). Missing NVTE_CHECK_CUDA error handling wrappers.
transformer_engine/common/fused_router/fused_topk_with_score_function.cu 4/5 Added cudaFuncSetAttribute calls for forward and backward kernels to support large expert counts (>2048). Missing NVTE_CHECK_CUDA error handling wrappers.
tests/pytorch/test_fused_router.py 3/5 Updated test to use num_experts=3000, but only tests one code path (topk_softmax). Missing test coverage for aux_loss functions with large expert counts.

Sequence Diagram

sequenceDiagram
    participant User as PyTorch User Code
    participant Router as Router Function
    participant Launcher as Kernel Launcher
    participant CUDA as CUDA Runtime
    participant Kernel as CUDA Kernel
    
    User->>Router: Call fused_topk_with_score_function<br/>or fused_compute_score_for_moe_aux_loss<br/>(with num_experts > 2048)
    Router->>Launcher: Invoke kernel_launcher<br/>(num_experts, num_tokens, topk, etc.)
    
    Note over Launcher: Calculate shared_memory_size<br/>= num_experts * num_token_per_block * sizeof(DataType) * N<br/>(N=1-4 buffers depending on kernel)
    
    Note over Launcher: For num_experts=3000:<br/>Forward: ~47 KB<br/>Backward: ~152 KB
    
    Launcher->>CUDA: cudaFuncSetAttribute(<br/>kernel,<br/>cudaFuncAttributeMaxDynamicSharedMemorySize,<br/>shared_memory_size)
    
    Note over CUDA: Sets max dynamic shared memory limit<br/>(default 48 KB → up to 164 KB based on device)
    
    CUDA-->>Launcher: Success (or error if exceeds device limit)
    
    Launcher->>CUDA: Launch kernel<<<grid, block, shared_memory_size>>>
    CUDA->>Kernel: Execute kernel with allocated shared memory
    
    Note over Kernel: Each warp processes 1 token<br/>Uses allocated shared memory for:<br/>- logits/scores buffer<br/>- topk results<br/>- intermediate computations
    
    Kernel-->>CUDA: Kernel completes
    CUDA->>Launcher: cudaGetLastError()
    Launcher-->>Router: Return results
    Router-->>User: Return probs/scores and routing_map
Loading

Comment on lines +150 to +151
cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for cudaFuncSetAttribute. The codebase pattern is to wrap CUDA API calls with NVTE_CHECK_CUDA() for proper error handling (see examples in ln_fwd_cuda_kernel.cu, rmsnorm_fwd_cuda_kernel.cu, and other files).

If cudaFuncSetAttribute fails (e.g., if the requested shared memory size exceeds device limits), the error should be caught and reported.

Suggested change
cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
NVTE_CHECK_CUDA(cudaFuncSetAttribute(fused_score_for_moe_aux_loss_forward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size));

Comment on lines +288 to +289
cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for cudaFuncSetAttribute. The codebase pattern is to wrap CUDA API calls with NVTE_CHECK_CUDA() for proper error handling.

Suggested change
cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
NVTE_CHECK_CUDA(cudaFuncSetAttribute(fused_score_for_moe_aux_loss_backward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size));

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +256 to +257
cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for cudaFuncSetAttribute. The codebase pattern is to wrap CUDA API calls with NVTE_CHECK_CUDA() for proper error handling.

Suggested change
cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
NVTE_CHECK_CUDA(cudaFuncSetAttribute(fused_topk_with_score_function_forward_kernel<DataType, BiasType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size));

Comment on lines +449 to +450
cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for cudaFuncSetAttribute. The codebase pattern is to wrap CUDA API calls with NVTE_CHECK_CUDA() for proper error handling.

Suggested change
cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size);
NVTE_CHECK_CUDA(cudaFuncSetAttribute(fused_topk_with_score_function_backward_kernel<DataType>,
cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size));

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 9, 2026

Additional Comments (1)

tests/pytorch/test_fused_router.py
The test in __main__ only exercises test_topk_softmax with num_experts=3000, but doesn't test test_fused_scores_for_aux_loss with a large expert count.

Since the PR fixes apply to both fused_topk_with_score_function.cu AND fused_score_for_moe_aux_loss.cu, the test coverage should include both code paths to verify the fix works for all affected kernels.

Consider adding a test call for test_fused_scores_for_aux_loss with num_experts=3000 to ensure the backward kernel in fused_score_for_moe_aux_loss.cu is also properly tested (which requires ~152 KB of shared memory for 3000 experts).

@ptrendx ptrendx requested a review from denera January 9, 2026 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant