Skip to content

Add tests for fusedmoe 2694938553#2566

Draft
shaunkotek wants to merge 11 commits intoflashinfer-ai:mainfrom
shaunkotek:add-tests-for-fusedmoe-2694938553
Draft

Add tests for fusedmoe 2694938553#2566
shaunkotek wants to merge 11 commits intoflashinfer-ai:mainfrom
shaunkotek:add-tests-for-fusedmoe-2694938553

Conversation

@shaunkotek
Copy link

📌 Description

Add more tests cases to Cutlass fusedMoE kernel to match Nemotron usescases, in order to make sure it will not break.

🚀 Pull Request Checklist

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

For larger cases, i got a lot of rounding error in the quantized tests, so i scaled all the numbers down even further (was by 10, changed to by 100). I Checked and this does not create degenerate matrixes, but if you have any suggestions on how to do it better im open.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 15, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @shaunkotek, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the fusedMoE kernel by expanding its test suite. The changes introduce new test configurations and activation types, specifically Relu2, to better align with Nemotron use cases. Furthermore, it addresses numerical precision issues in quantized tests by fine-tuning scaling factors, ensuring the kernel's stability and accuracy under various operational conditions.

Highlights

  • Expanded Test Coverage: Added comprehensive test cases for the fusedMoE kernel, specifically incorporating Relu2 activation type support across various quantization and parallelism scenarios.
  • Test Configuration Refactoring: Introduced a namedtuple (MoeConfig) to streamline and organize test parameters, making test definitions cleaner and more scalable for different MoE configurations.
  • Quantization Precision Improvements: Adjusted scaling factors in FP8 and NVFP4 quantized tests to mitigate rounding errors, enhancing the numerical stability and accuracy of the tests.
  • Device Placement Consistency: Ensured all newly created tensors within the test suite are explicitly placed on the CUDA device, promoting consistent and correct GPU utilization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • tests/moe/test_trtllm_cutlass_fused_moe.py
    • Introduced namedtuple for structured MoE test configurations.
    • Refactored test parameterization to use MoeConfig objects, consolidating batch_size, hidden_size, num_experts, top_k, and intermediate_size.
    • Expanded test coverage to include Relu2 activation type across various MoE tests (FP8, NVFP4, expert parallel, tensor parallel, and combined parallel).
    • Modified the compute_with_experts reference function to support different activation_type and fc2_quant parameters.
    • Adjusted scaling factors for weights in FP8 and NVFP4 quantization tests to improve numerical stability and reduce rounding errors.
    • Ensured explicit CUDA device placement for newly created tensors in test setups.
    • Updated parallelism tests (expert_parallel, tensor_parallel, tensor_expert_parallel) to correctly handle Relu2 activation type and use the new configuration structure.
    • Added conditional logic for TP_SIZES to include 8 if enough CUDA devices are available.
Activity
  • The author has completed all pre-commit checks.
  • All existing and new tests are passing.
  • The author identified and addressed rounding errors in larger quantized tests by scaling down numbers, confirming this does not create degenerate matrices.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request significantly improves the test coverage for the fusedmoe kernel by adding support for various configurations and activation types (SwiGLU and ReLU2). It also addresses numerical stability issues in quantized tests by adjusting the scaling factors. My feedback focuses on improving the efficiency and correctness of quantization scale calculations in the tests, as well as removing redundant operations and commented-out code.

else:
w1_scales = w31_scales
x_quant, hidden_states_scale = dynamic_per_tensor_fp8_quant(x)
hidden_states_scale = torch.tensor(hidden_states_scale[0], device="cuda")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The hidden_states_scale tensor returned by dynamic_per_tensor_fp8_quant is already on the same device as the input x (which is CUDA). Re-creating it with torch.tensor and explicitly moving it to CUDA is redundant and slightly less efficient.

Suggested change
hidden_states_scale = torch.tensor(hidden_states_scale[0], device="cuda")
hidden_states_scale = hidden_states_scale[0]

Comment on lines 597 to 600
a1_gs = (FLOAT8_E4M3_MAX * FLOAT4_E2M1_MAX) / torch.abs(x).max().to(
torch.float32
).cuda()
a1_gs = torch.tensor(1.0, device="cuda", dtype=torch.float32)
# a1_gs = torch.tensor(1.0, device="cuda", dtype=torch.float32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The .cuda() call is redundant here as x is already on the CUDA device. Additionally, the commented-out code should be removed to keep the test file clean.

Suggested change
a1_gs = (FLOAT8_E4M3_MAX * FLOAT4_E2M1_MAX) / torch.abs(x).max().to(
torch.float32
).cuda()
a1_gs = torch.tensor(1.0, device="cuda", dtype=torch.float32)
# a1_gs = torch.tensor(1.0, device="cuda", dtype=torch.float32)
a1_gs = (FLOAT8_E4M3_MAX * FLOAT4_E2M1_MAX) / torch.abs(x).max().to(
torch.float32
)

@shaunkotek shaunkotek force-pushed the add-tests-for-fusedmoe-2694938553 branch from 48a9c59 to 2460343 Compare February 15, 2026 09:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant