Skip to content

feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron#2304

Merged
yzh119 merged 25 commits intoflashinfer-ai:mainfrom
amitz-nv:fused-moe-non-gated-fp8
Jan 30, 2026
Merged

feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron#2304
yzh119 merged 25 commits intoflashinfer-ai:mainfrom
amitz-nv:fused-moe-non-gated-fp8

Conversation

@amitz-nv
Copy link
Contributor

@amitz-nv amitz-nv commented Jan 7, 2026

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • Expanded activation options (Gelu, Relu, Silu, Swiglu, Geglu, SwigluBias, Relu2, Identity) and exposed ActivationType throughout the CLI and APIs.
    • DeepSeek routing supports larger top‑K and a configurable top‑experts dimension.
    • Added post‑GEMM element‑wise activation option and a CLI flag to select activation type.
  • Breaking Changes

    • ActivationType replaces the previous gated-activation enum in public APIs and tests; callers must use ActivationType values.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

Replaces GatedActType with ActivationType and threads activation configuration through Python APIs, benchmarks, tests, C++/CUDA launchers, routing kernels, and batched-GEMM runners; extends DeepSeek top-K/top-expert handling and adds eltwise activation and related options.

Changes

Cohort / File(s) Summary
Python Public API & Core
flashinfer/__init__.py, flashinfer/fused_moe/__init__.py, flashinfer/fused_moe/core.py
Expose ActivationType, remove GatedActType; update MoERunner and op signatures to accept activation_type; propagate enum through Python→C++ callsites and update defaults/docs.
Benchmarks & CLI
benchmarks/bench_trtllm_gen_fused_moe_autotuner.py, benchmarks/routines/moe.py, benchmarks/routines/flashinfer_benchmark_utils.py
Add --activation-type CLI, enum_type() argparse helper, thread activation_type into FP8/FP4 autotuner paths and add runtime validation for incompatible combos.
Tests & Test Utils
tests/moe/*, tests/moe/utils.py
Replace GatedActType imports/uses with ActivationType; add is_gated_activation() and NON_GATED_ACTIVATION_SUPPORTED_QUANT_MODES; update skip_checks() signature and test parametrizations.
C++ MoE Launchers & Runners
csrc/trtllm_fused_moe_kernel_launcher.cu, csrc/trtllm_fused_moe_runner.cu, csrc/trtllm_fused_moe_routing_deepseek.cu
Replace gated-act wiring with ActivationType across launcher/runner APIs; add activation→gated/eltwise conversion helpers; increase DeepSeek top-K and introduce Max/Default top-expert constants; thread new top-experts parameter through routing.
C++ Kernel, Macros & Headers
include/.../KernelRunner.h, include/.../DevKernel.h, include/.../RoutingKernel.h, include/.../runner.h
Add EltwiseActType and eltwiseActType option; rename useShuffledMatrixAuseShuffledMatrix; add MaxNumTopExperts_ template param and constexpr; expand DEEPSEEK launch macros to accept numTopExperts; introduce ActivationType enum and helpers.
Batched GEMM Runner
csrc/trtllm_batched_gemm_runner.cu, include/.../KernelRunner.h
Add eltwise activation type consistency checks, consolidate per-config filters, log activation-type fields, and pass scaleAct via scaleGateC.
MoE Kernel Entrypoints & Launch Signatures
csrc/... and include/... trtllm_fp4/trtllm_fp8 entrypoints
Update entry signatures to accept act_type / activation_type integers; thread activation enum values into CUDA kernels and launcher config resolution.

Sequence Diagram(s)

sequenceDiagram
    participant CLI
    participant PyCore as Python Core
    participant Launcher as C++ Launcher/Runner
    participant CUDA as CUDA Kernel
    participant Device as Device/Experts

    CLI->>PyCore: parse args (--activation-type, quant_mode, numTopExperts)
    PyCore->>Launcher: build config (activation_type.value, useShuffledMatrix, numTopExperts)
    Launcher->>Launcher: select valid configs / getValidConfigs(act_type, numTopExperts)
    Launcher->>CUDA: launch kernel(config_index, act_type, topK)
    CUDA->>Device: route tokens / apply activation (gated or eltwise)
    Device-->>CUDA: return results
    CUDA-->>Launcher: outputs
    Launcher-->>PyCore: surface results (activation_type, config)
    PyCore-->>CLI: report results / autotuner output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • joker-eph
  • aleozlx
  • cyx-6
  • djmmoss
  • IwakuraRein
  • nvmbreughe
  • bkryu
  • jiahanc
  • yzh119

Poem

🐇 I hopped from GatedAct to Activation bright,
Swiglu, Gelu, Relu juggled through the night.
Kernels launch, top-Ks reach further fields,
Experts route where tiny logic yields.
A rabbit cheers—new activation, delight.

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.10% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main changes: support for non-gated Relu2 activation in NVFP4 & FP8 and Nemotron support.
Description check ✅ Passed The PR description covers the main changes (element-wise activation support, ActivationType enum replacement, Nemotron support, UseShuffledMatrixA removal) and includes completed pre-commit checks.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @amitz-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility of fused Mixture-of-Experts (MoE) operations by introducing a unified ActivationType enum, moving beyond just gated activations. The primary focus is on enabling non-gated activation functions like Relu2 for FP8 MoE, which required extensive changes to the underlying C++ kernels, Python bindings, and testing infrastructure. These changes ensure that the system can efficiently handle various activation types, with specific optimizations and checks for gated vs. non-gated behaviors, leading to broader applicability and improved performance for different model architectures.

Highlights

  • Unified Activation Type: The pull request replaces the specific GatedActType enum with a more general ActivationType enum across the codebase. This new enum supports a wider range of activation functions, including Gelu, Relu, Silu, Swiglu, Geglu, SwigluBias, Relu2, and Identity.
  • Relu2 Activation Support: Explicit support for the Relu2 activation function has been introduced in fused MoE operations, particularly for FP8. This involved adjusting intermediate tensor sizes and activation logic based on whether the activation is gated or non-gated.
  • Benchmarking and Testing Updates: Benchmark scripts and test suites have been modified to incorporate the new ActivationType parameter, enabling comprehensive testing and performance evaluation of different activation functions within the fused MoE framework.
  • TMA Descriptor Enhancements: The Tensor Memory Access (TMA) descriptor building logic has been updated to handle padding and swizzling more flexibly, especially for block-scaled formats and uniform token batching, improving robustness and efficiency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for non-gated activations, such as Relu2, into the Fused MoE FP8 kernels. The changes primarily involve refactoring GatedActType to a more general ActivationType enum and plumbing this through the Python wrappers, benchmarks, and C++/CUDA implementation. The refactoring is extensive and mostly well-executed.

I have identified a few issues that need attention. There are potential bugs in csrc/trtllm_fused_moe_runner.cu related to the calculation of workspace size and GEMM configuration validation, which do not correctly handle the doubled intermediate size for gated activations. Additionally, there's a minor code cleanup opportunity in benchmarks/routines/moe.py and a parameter name typo in tests/moe/test_dpsk_fused_moe_fp8.py. Addressing these issues will improve the correctness and clarity of the code.

Comment on lines 305 to 310
return mRunner.getWorkspaceSizeInBytes(numTokens, intermediateSize, hiddenSize, {}, numTokens,
numExperts, maxNumCtasInBatchDim, configIndex);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The intermediateSize passed to getWorkspaceSizeInBytes does not account for gated activations, where the intermediate dimension is doubled. This could lead to under-allocating workspace. It should be multiplied by (isGatedActivation(mActType) ? 2 : 1) to ensure correct workspace size calculation, similar to how it's done in the run method.

  int32_t const intermediateSizeFactor = isGatedActivation(mActType) ? 2 : 1;
  return mRunner.getWorkspaceSizeInBytes(numTokens, intermediateSizeFactor * intermediateSize, hiddenSize, {}, numTokens,
                                         numExperts, maxNumCtasInBatchDim, configIndex);

Comment on lines 314 to 315
return mRunner.getDefaultValidConfigIndex(numTokens, intermediateSize, hiddenSize, {},
numTokens, numExperts, maxNumCtasInBatchDim);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The intermediateSize passed to getDefaultValidConfigIndex does not account for gated activations. This could lead to selecting a suboptimal or incorrect default configuration. It should be multiplied by (isGatedActivation(mActType) ? 2 : 1) to ensure correct config selection.

  int32_t const intermediateSizeFactor = isGatedActivation(mActType) ? 2 : 1;
  return mRunner.getDefaultValidConfigIndex(numTokens, intermediateSizeFactor * intermediateSize, hiddenSize, {},
                                            numTokens, numExperts, maxNumCtasInBatchDim);

Comment on lines 325 to 330
mRunner.isValidConfigIndex(configIndex, numTokens, intermediateSize, hiddenSize, {},
numTokens, numExperts, maxNumCtasInBatchDim);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The intermediateSize passed to isValidConfigIndex does not account for gated activations. This could lead to incorrect validation of GEMM configurations. It should be multiplied by (isGatedActivation(mActType) ? 2 : 1) to ensure correct config validation.

      mRunner.isValidConfigIndex(configIndex, numTokens, (isGatedActivation(mActType) ? 2 : 1) * intermediateSize, hiddenSize, {},
                                 numTokens, numExperts, maxNumCtasInBatchDim);

gemmData.mInputBuffers.mPtrScaleC = scaleC;
gemmData.mInputBuffers.mPtrScaleGate = scaleGateC;
// TODO amitz-nv: Do we want to pass scaleAct instead of using scaleGateC?
gemmData.mInputBuffers.mPtrScaleAct = scaleGateC;
Copy link
Contributor Author

@amitz-nv amitz-nv Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decide whether it's OK or fix in the future?

@amitz-nv amitz-nv changed the title Fused MoE FP8 non gated Relu2 Fused MoE non gated FP8 Relu2 Jan 7, 2026
@amitz-nv amitz-nv changed the title Fused MoE non gated FP8 Relu2 Fused MoE non gated Relu2 FP8 Jan 7, 2026
@amitz-nv amitz-nv changed the title Fused MoE non gated Relu2 FP8 Fused MoE non gated Relu2 NVFP4 & FP8 Jan 12, 2026
@amitz-nv amitz-nv force-pushed the fused-moe-non-gated-fp8 branch 2 times, most recently from 9655f95 to 4c9fb49 Compare January 26, 2026 15:07
@amitz-nv amitz-nv changed the title Fused MoE non gated Relu2 NVFP4 & FP8 Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron Jan 26, 2026
@amitz-nv amitz-nv force-pushed the fused-moe-non-gated-fp8 branch from 4c9fb49 to 9a1ffa0 Compare January 26, 2026 16:29
@amitz-nv amitz-nv marked this pull request as ready for review January 26, 2026 16:45
@amitz-nv amitz-nv changed the title Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron Jan 26, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
csrc/trtllm_fused_moe_routing_deepseek.cu (1)

517-538: Guard topK > 8 for non‑Nemotron expert counts.
LAUNCH_ROUTING_DEEPSEEK still instantiates kernels with MaxNumTopExperts=DefaultMaxNumTopExperts for numExperts <= NumKimiK2Experts, but runImpl now allows topK up to 22. That can drive params.mTopK beyond KernelParams::MaxNumTopExperts and corrupt stack/shared buffers.

Consider either selecting MaxSupportedTopExperts whenever data.mTopK > DefaultMaxNumTopExperts, or explicitly rejecting such inputs for non‑Nemotron expert counts.

🔧 Suggested guard (minimal change)
@@
   FLASHINFER_CHECK(data.mTopK <= MaxSupportedTopExperts,
                    "Routing kernel expects topK experts <= %d, got %d", MaxSupportedTopExperts,
                    data.mTopK);
+  if (data.mNumExperts < NumNemotronExperts) {
+    FLASHINFER_CHECK(
+        data.mTopK <= DefaultMaxNumTopExperts,
+        "For numExperts < %d, routing kernel supports topK <= %d, got %d",
+        NumNemotronExperts, DefaultMaxNumTopExperts, data.mTopK);
+  }

Also applies to: 560-573

benchmarks/routines/moe.py (1)

872-898: Fix stale args.gated_act access after the activation-type switch.
--gated_act was removed, so this path will raise AttributeError when output_path is used.

✅ Suggested fix
-        cur_res["gated_act"] = args.gated_act
+        cur_res["activation_type"] = str(args.activation_type)
include/flashinfer/trtllm/batched_gemm/trtllmGen_bmm_export/BatchedGemmOptions.h (1)

496-500: Guard optional mRouteSfsImpl in dumpOptions.
options.mRouteSfsImpl.value() will throw if dumpOptions() is called before mRouteSfsImpl is set (e.g., default options without checkAndUpdateBatchedGemmOptions). Guard the optional or emit nullopt.

🔧 Suggested fix
-  ss << "mRouteSfsImpl={batchedGemm::RouteImpl("
-     << static_cast<int32_t>(options.mRouteSfsImpl.value()) << ")}," << std::endl;
+  if (options.mRouteSfsImpl.has_value()) {
+    ss << "mRouteSfsImpl={batchedGemm::RouteImpl("
+       << static_cast<int32_t>(options.mRouteSfsImpl.value()) << ")}," << std::endl;
+  } else {
+    ss << "mRouteSfsImpl={nullopt}," << std::endl;
+  }
flashinfer/fused_moe/core.py (1)

210-236: Include gated/non‑gated flag in permute cache key.
permute0 now differs by is_gated_act_gemm, but the cache key ignores it, so a gated run can poison the cache for a non‑gated run (or vice‑versa), yielding incorrect permutations.

🔧 Suggested fix
-    cache_key = ("w3_w1", dst_w3_w1_weight.shape)
+    cache_key = ("w3_w1", dst_w3_w1_weight.shape, is_gated_act_gemm, num_elts_per_sf)
csrc/trtllm_fused_moe_kernel_launcher.cu (1)

533-553: Scale FP8‑per‑tensor GEMM1 buffers by activation type.
Now that activation_type can be non‑gated, prepare_moe() still allocates GEMM1 output/scales for 2 * intermediate_size. When the runner uses M=intermediate_size, the row stride is wrong and outputs can overlap.

🔧 Suggested fix
   void prepare_moe(int64_t& moe_tactic) override {
     FusedMoeLauncher::prepare_moe_common(moe_tactic);

     int32_t max_num_padded_tokens_gemm1 = workspace.total_max_padded_tokens + args->num_experts;
     int32_t max_num_padded_tokens_gemm2 = workspace.total_max_padded_tokens;
+
+    int32_t const intermediate_size_factor =
+        tensorrt_llm::kernels::trtllmgen_moe::MoE::isGatedActivation(activation_type) ? 2 : 1;
+    int32_t const gemm1_out_dim = intermediate_size_factor * args->intermediate_size;

-    gemm1_output = alloc_tensor({max_num_padded_tokens_gemm1, 2 * args->intermediate_size},
+    gemm1_output = alloc_tensor({max_num_padded_tokens_gemm1, gemm1_out_dim},
                                 dl_uint8, hidden_states.device());
-    gemm1_output_scale =
-        alloc_tensor({2 * args->intermediate_size / 128, max_num_padded_tokens_gemm1}, dl_float32,
-                     hidden_states.device());
+    gemm1_output_scale =
+        alloc_tensor({gemm1_out_dim / 128, max_num_padded_tokens_gemm1}, dl_float32,
+                     hidden_states.device());
🤖 Fix all issues with AI agents
In `@flashinfer/fused_moe/core.py`:
- Around line 1533-1535: The fake-op function signatures that accept the unused
parameter activation_type should rename that parameter to _activation_type (or
prefix it with an underscore) to silence Ruff ARG001 while preserving signature
compatibility; update the parameter name in each fake-op signature (the
occurrences where activation_type is accepted but unused) and ensure any
internal references (if any) are adjusted accordingly so behavior is unchanged.

In `@include/flashinfer/trtllm/batched_gemm/trtllmGen_bmm_export/GemmOptions.h`:
- Line 553: Update the debug/dump label to match the renamed field: replace the
literal "mUseShuffledMatrixA=" with "mUseShuffledMatrix=" where the code writes
out options.mUseShuffledMatrix (in the dump/ostream code that uses ss <<
"mUseShuffledMatrixA=" << options.mUseShuffledMatrix << ...), so the printed
label matches the actual member name.

In `@include/flashinfer/trtllm/fused_moe/runner.h`:
- Around line 176-178: The isGatedActivation helper currently only checks
ActivationType::Swiglu and ActivationType::Geglu; update it to also treat
ActivationType::SwigluBias as gated so code paths that expect gated activations
(e.g., hidden-size handling) follow the correct branch—modify the
isGatedActivation(ActivationType activationType) function to return true for
ActivationType::SwigluBias in addition to Swiglu and Geglu.
♻️ Duplicate comments (2)
csrc/trtllm_batched_gemm_runner.cu (1)

226-227: Follow up on the scaleAct vs scaleGateC TODO.

This open question can affect non-gated activation scaling once those paths are exercised.

csrc/trtllm_fused_moe_runner.cu (1)

304-330: Apply gated size factor in workspace/config helpers.
run() scales intermediateSize for gated activations, but the workspace/config helpers still use the unscaled value, risking under‑allocation or invalid config selection for Swiglu/Geglu.

🔧 Suggested fix
 size_t Runner::getWorkspaceSizeInBytes(int32_t topK, int32_t hiddenSize, int32_t intermediateSize,
                                        int32_t numExperts, int32_t numTokens,
                                        int32_t configIndex) const {
   auto maxNumCtasInBatchDim =
       Routing::getMaxNumCtasInBatchDim(numTokens, topK, numExperts, mTileTokensDim);
-  return mRunner.getWorkspaceSizeInBytes(numTokens, intermediateSize, hiddenSize, {}, numTokens,
+  int32_t const intermediateSizeFactor = isGatedActivation(mActType) ? 2 : 1;
+  return mRunner.getWorkspaceSizeInBytes(numTokens, intermediateSizeFactor * intermediateSize,
+                                         hiddenSize, {}, numTokens,
                                          numExperts, maxNumCtasInBatchDim, configIndex);
 }
@@
 int32_t Runner::getDefaultValidConfigIndex(int32_t topK, int32_t hiddenSize,
                                            int32_t intermediateSize, int32_t numExperts,
                                            int32_t numTokens) const {
   auto maxNumCtasInBatchDim =
       Routing::getMaxNumCtasInBatchDim(numTokens, topK, numExperts, mTileTokensDim);
-  return mRunner.getDefaultValidConfigIndex(numTokens, intermediateSize, hiddenSize, {}, numTokens,
+  int32_t const intermediateSizeFactor = isGatedActivation(mActType) ? 2 : 1;
+  return mRunner.getDefaultValidConfigIndex(numTokens, intermediateSizeFactor * intermediateSize,
+                                            hiddenSize, {}, numTokens,
                                             numExperts, maxNumCtasInBatchDim);
 }
@@
   auto const isValid =
-      mRunner.isValidConfigIndex(configIndex, numTokens, intermediateSize, hiddenSize, {},
+      mRunner.isValidConfigIndex(configIndex, numTokens,
+                                 (isGatedActivation(mActType) ? 2 : 1) * intermediateSize,
+                                 hiddenSize, {},
                                  numTokens, numExperts, maxNumCtasInBatchDim);
🧹 Nitpick comments (4)
include/flashinfer/trtllm/batched_gemm/trtllmGen_bmm_export/GemmGatedActOptions.h (1)

79-80: Consider naming ActType::None in getActTypeName.
Now that None is a concrete enum member, getActTypeName will return "Unknown type" for it, which can muddy diagnostics. A small switch update keeps logs clear.

♻️ Proposed update
 inline std::string getActTypeName(ActType type) {
   switch (type) {
     case ActType::SwiGlu:
       return "SwiGlu";
     case ActType::GeGlu:
       return "GeGlu";
+    case ActType::None:
+      return "None";
     default:
       return "Unknown type";
   }
 }
include/flashinfer/trtllm/fused_moe/RoutingKernel.h (1)

179-188: Guard MaxNumTopExperts against invalid instantiations.

Now that MaxNumTopExperts_ is part of the public template surface, a compile-time bound check helps prevent accidental topK > MaxNumExperts configurations from compiling and later causing bounds issues.

♻️ Proposed compile-time guard
 struct KernelParams : public KernelParamsBase<InputT_, OutputT_, MaxNumExperts_, isPow2_, UsePdl_> {
   using InputT = InputT_;
   using BiasT = BiasT_;
   using OutputT = OutputT_;

   static constexpr bool UseGroups = UseGroups_;
   static constexpr int MaxNumTopExperts = MaxNumTopExperts_;
+  static_assert(MaxNumTopExperts_ > 0 && MaxNumTopExperts_ <= MaxNumExperts_,
+                "MaxNumTopExperts must be within [1, MaxNumExperts]");
include/flashinfer/trtllm/batched_gemm/KernelRunner.h (1)

50-71: Keep EltwiseActType synced with the canonical GEMM enum.

This enum is later compared to config enums via integer casts, so any drift would silently filter out valid configs. Consider aliasing the canonical enum from Enums.h or add compile-time guards to ensure numeric parity.

🔧 Suggested safeguard (adjust namespace if needed)
 enum class EltwiseActType {
   None = 0,
   Gelu,
   Relu2,
 };
+
+static_assert(
+    static_cast<int>(EltwiseActType::None) ==
+        static_cast<int>(batchedGemm::gemm::EltwiseActType::None),
+    "EltwiseActType::None must stay in sync with batchedGemm::gemm::EltwiseActType::None");
+static_assert(
+    static_cast<int>(EltwiseActType::Gelu) ==
+        static_cast<int>(batchedGemm::gemm::EltwiseActType::Gelu),
+    "EltwiseActType::Gelu must stay in sync with batchedGemm::gemm::EltwiseActType::Gelu");
+static_assert(
+    static_cast<int>(EltwiseActType::Relu2) ==
+        static_cast<int>(batchedGemm::gemm::EltwiseActType::Relu2),
+    "EltwiseActType::Relu2 must stay in sync with batchedGemm::gemm::EltwiseActType::Relu2");
benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (1)

357-364: Argparse enum parsing may be unintuitive.

type=ActivationType tends to accept numeric values only (e.g., --activation-type 3), which is easy to mis-use. Consider accepting enum names and mapping them explicitly for CLI ergonomics.

🛠️ Suggested CLI parsing tweak
+def parse_activation_type(value: str) -> ActivationType:
+    try:
+        return ActivationType[value]
+    except KeyError:
+        return ActivationType(int(value))
+
 ...
 parser.add_argument(
     "--activation-type",
-    type=ActivationType,
+    type=parse_activation_type,
     choices=list(ActivationType),
     required=False,
     default=ActivationType.Swiglu,
     help=f"Type of gated activation function: {list(ActivationType)}",
 )

Comment on lines 1533 to 1535
enable_pdl: Optional[bool] = None,
activation_type: int = ActivationType.Identity.value,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silence unused activation_type in fake ops (Ruff ARG001).
The fake ops keep activation_type for signature compatibility but never use it.

🔧 Suggested fix
-    activation_type: int = ActivationType.Identity.value,
+    _activation_type: int = ActivationType.Identity.value,
-        activation_type: int,
+        _activation_type: int,

Also applies to: 1911-1913

🧰 Tools
🪛 Ruff (0.14.13)

1533-1533: Unused function argument: enable_pdl

(ARG001)


1534-1534: Unused function argument: activation_type

(ARG001)

🤖 Prompt for AI Agents
In `@flashinfer/fused_moe/core.py` around lines 1533 - 1535, The fake-op function
signatures that accept the unused parameter activation_type should rename that
parameter to _activation_type (or prefix it with an underscore) to silence Ruff
ARG001 while preserving signature compatibility; update the parameter name in
each fake-op signature (the occurrences where activation_type is accepted but
unused) and ensure any internal references (if any) are adjusted accordingly so
behavior is unchanged.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
csrc/trtllm_fused_moe_routing_deepseek.cu (1)

518-538: Prevent topK > DefaultMaxNumTopExperts on non‑Nemotron paths (OOB risk).

data.mTopK is validated against MaxSupportedTopExperts globally, but non‑Nemotron branches still instantiate kernels with DefaultMaxNumTopExperts. If data.mTopK > DefaultMaxNumTopExperts on Deepseek/Kimi (or smaller expert counts), the kernel will index beyond the fixed‑size topScores/topExperts arrays.

A minimal fix is to tighten validation for those branches so mTopK never exceeds the instantiated compile‑time bound.

🐛 Proposed fix (tighten validation for non‑Nemotron)
   FLASHINFER_CHECK(data.mTopK <= MaxSupportedTopExperts,
                    "Routing kernel expects topK experts <= %d, got %d", MaxSupportedTopExperts,
                    data.mTopK);
+  if (data.mNumExperts <= NumKimiK2Experts) {
+    FLASHINFER_CHECK(data.mTopK <= DefaultMaxNumTopExperts,
+                     "Routing kernel expects topK experts <= %d for %d experts, got %d",
+                     DefaultMaxNumTopExperts, data.mNumExperts, data.mTopK);
+  }

Also applies to: 560-562

🤖 Fix all issues with AI agents
In `@csrc/trtllm_fused_moe_routing_deepseek.cu`:
- Around line 568-573: The two validation checks using FLASHINFER_CHECK
incorrectly enforce a minimum expert count by comparing data.mNumExperts >=
MaxSupportedTopExperts; change the logic to validate that the requested top-K
does not exceed available experts (e.g., check topK <= data.mNumExperts or
data.mTopK <= data.mNumExperts) and keep the other check that data.mNumExperts
<= MaxSupportedExpertCount; update the error message to reflect "topK must be <=
`#experts`" and reference FLASHINFER_CHECK, data.mNumExperts,
MaxSupportedTopExperts, MaxSupportedExpertCount, and the top-K variable (topK or
data.mTopK) so the routing kernel accepts small valid expert counts.

Comment on lines +568 to +573
FLASHINFER_CHECK(data.mNumExperts >= MaxSupportedTopExperts,
"Routing kernel expects %d to be at most #experts %d", MaxSupportedTopExperts,
data.mNumExperts);
FLASHINFER_CHECK(data.mNumExperts <= NumKimiK2Experts,
FLASHINFER_CHECK(data.mNumExperts <= MaxSupportedExpertCount,
"Routing kernel expects #experts %d <= #threads %d", data.mNumExperts,
NumKimiK2Experts);
MaxSupportedExpertCount);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validation rejects valid small expert counts.

data.mNumExperts >= MaxSupportedTopExperts enforces a minimum expert count of 22, which can block supported configurations (e.g., <= topk::MaxNumExpertsUnit). This check should be about topK <= numExperts, not a hard minimum expert count.

✅ Proposed fix
-  FLASHINFER_CHECK(data.mNumExperts >= MaxSupportedTopExperts,
-                   "Routing kernel expects %d to be at most `#experts` %d", MaxSupportedTopExperts,
-                   data.mNumExperts);
+  FLASHINFER_CHECK(data.mTopK <= data.mNumExperts,
+                   "Routing kernel expects topK %d to be <= `#experts` %d", data.mTopK,
+                   data.mNumExperts);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
FLASHINFER_CHECK(data.mNumExperts >= MaxSupportedTopExperts,
"Routing kernel expects %d to be at most #experts %d", MaxSupportedTopExperts,
data.mNumExperts);
FLASHINFER_CHECK(data.mNumExperts <= NumKimiK2Experts,
FLASHINFER_CHECK(data.mNumExperts <= MaxSupportedExpertCount,
"Routing kernel expects #experts %d <= #threads %d", data.mNumExperts,
NumKimiK2Experts);
MaxSupportedExpertCount);
FLASHINFER_CHECK(data.mTopK <= data.mNumExperts,
"Routing kernel expects topK %d to be <= `#experts` %d", data.mTopK,
data.mNumExperts);
FLASHINFER_CHECK(data.mNumExperts <= MaxSupportedExpertCount,
"Routing kernel expects `#experts` %d <= `#threads` %d", data.mNumExperts,
MaxSupportedExpertCount);
🤖 Prompt for AI Agents
In `@csrc/trtllm_fused_moe_routing_deepseek.cu` around lines 568 - 573, The two
validation checks using FLASHINFER_CHECK incorrectly enforce a minimum expert
count by comparing data.mNumExperts >= MaxSupportedTopExperts; change the logic
to validate that the requested top-K does not exceed available experts (e.g.,
check topK <= data.mNumExperts or data.mTopK <= data.mNumExperts) and keep the
other check that data.mNumExperts <= MaxSupportedExpertCount; update the error
message to reflect "topK must be <= `#experts`" and reference FLASHINFER_CHECK,
data.mNumExperts, MaxSupportedTopExperts, MaxSupportedExpertCount, and the top-K
variable (topK or data.mTopK) so the routing kernel accepts small valid expert
counts.

@yzh119
Copy link
Collaborator

yzh119 commented Jan 27, 2026

/bot run

@yzh119
Copy link
Collaborator

yzh119 commented Jan 27, 2026

@flashinfer-bot run

…spaceSizeInBytes, getDefaultValidConfigIndex, isValidConfigIndex

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
@amitz-nv amitz-nv force-pushed the fused-moe-non-gated-fp8 branch from 20b4ba7 to e63e17d Compare January 28, 2026 15:06
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
@amitz-nv
Copy link
Contributor Author

amitz-nv commented Jan 28, 2026

@aleozlx

do we intend to touch the GemmOptions.h in gemm?

I don't think so. I just rebased so now main includes the required changes in trtllmGen_bmm_export, which removes those changes from this PR.

seeing errors like

error: namespace "batchedGemm::trtllm::gen" has no member "Sparsity"
  , trtllm::gen::Sparsity(0)

in the ci run

I believe the rebase should solve this as well

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
benchmarks/routines/moe.py (1)

872-898: Output export still references args.gated_act (now removed).

This will raise AttributeError when --output-path is used. Please switch to activation_type.

🛠️ Proposed fix
-        cur_res["gated_act"] = args.gated_act
+        cur_res["activation_type"] = (
+            args.activation_type.name
+            if isinstance(args.activation_type, ActivationType)
+            else str(args.activation_type)
+        )
flashinfer/fused_moe/core.py (1)

215-240: Cache key missing is_gated_act_gemm, risking incorrect permutation indices.

The cache key ("w3_w1", dst_w3_w1_weight.shape) does not include is_gated_act_gemm. If the same weight tensor is used with both gated and non-gated activations, the cached permute indices from the first call will be incorrectly reused for the second.

🐛 Proposed fix
 def _maybe_get_cached_w3_w1_permute_indices(
     _cache_permute_indices,
     dst_w3_w1_weight: torch.Tensor,
     epilogue_tile_m: int,
     num_elts_per_sf: Union[None, int] = None,
     is_gated_act_gemm: bool = True,
 ) -> torch.Tensor:
     # Create a unique cache key (weight_type, weight_shape)
-    cache_key = ("w3_w1", dst_w3_w1_weight.shape)
+    cache_key = ("w3_w1", dst_w3_w1_weight.shape, is_gated_act_gemm)
     if cache_key not in _cache_permute_indices:
🤖 Fix all issues with AI agents
In `@benchmarks/bench_trtllm_gen_fused_moe_autotuner.py`:
- Around line 357-364: The argparse configuration uses type=ActivationType which
passes raw strings to the ActivationType constructor and fails for IntEnum;
update the parser call (the parser.add_argument for "--activation-type") to use
a small custom parsing function that maps input strings to ActivationType via
bracket notation (e.g., ActivationType[input_str]) or accepts already-matching
enum members, validate choices using list(ActivationType), and set default to
ActivationType.Swiglu; locate the parser.add_argument where "--activation-type"
is declared and replace type=ActivationType with this custom parser function
(referencing ActivationType and the parser.add_argument invocation).

In `@benchmarks/routines/moe.py`:
- Around line 179-185: Argparse is currently using type=ActivationType which
only accepts integer enum values, causing names like "Swiglu" to fail; change
the add_argument call for "--activation-type" to accept enum names by replacing
type=ActivationType with a converter that maps strings to the enum (e.g., use a
lambda or small function that does ActivationType[item] if input is str or
ActivationType(int(item)) if numeric) and keep choices=list(ActivationType) and
default=ActivationType.Swiglu so both name and numeric inputs work; reference
the ActivationType enum and the "--activation-type" add_argument in the parser
to locate where to apply this change.

In `@include/flashinfer/trtllm/fused_moe/runner.h`:
- Around line 176-178: The isGatedActivation function incorrectly omits
SwigluBias from its gated activation check; update the function
(isGatedActivation in runner.h) to treat ActivationType::SwigluBias as a gated
activation alongside ActivationType::Swiglu and ActivationType::Geglu so its
behavior matches the implementation in moe_gemm_kernels.h.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
benchmarks/routines/moe.py (1)

897-897: Bug: args.gated_act no longer exists.

This line references args.gated_act which was renamed to args.activation_type. This will cause an AttributeError at runtime when args.output_path is set.

🐛 Proposed fix
-        cur_res["gated_act"] = args.gated_act
+        cur_res["activation_type"] = str(activation_type)
flashinfer/fused_moe/core.py (1)

1423-1509: Normalize activation_type before using .value.
trtllm_fp8_per_tensor_scale_moe_op can be called with activation_type as an int (the public API default is ActivationType.Swiglu.value). In that case, .value raises AttributeError. Coerce to ActivationType (or use int(...)) before the .value access.

🐛 Suggested fix
 def trtllm_fp8_per_tensor_scale_moe_op(
@@
-    activation_type: ActivationType = ActivationType.Swiglu,
+    activation_type: ActivationType = ActivationType.Swiglu,
 ) -> torch.Tensor:
+    activation_type = ActivationType(activation_type)
@@
-            activation_type=activation_type.value,
+            activation_type=int(activation_type),
@@
-            activation_type.value,
+            int(activation_type),
csrc/trtllm_fused_moe_kernel_launcher.cu (2)

379-402: Validate activation_type before storing.

activation_type is an external input; guard against invalid enum values to prevent undefined kernel paths.

🛠️ Proposed fix
   TVM_FFI_ICHECK(0 <= weight_layout && weight_layout <= 2)
       << "the value of weight_layout is not recognized";
+  auto act_type = static_cast<int64_t>(activation_type);
+  TVM_FFI_ICHECK(act_type >= 0 &&
+                 act_type < static_cast<int64_t>(ActivationType::InvalidType))
+      << "activation_type is not recognized";
   this->weight_layout = static_cast<batchedGemm::gemm::MatrixLayout>(weight_layout);
   this->activation_type = activation_type;

487-503: BF16 config generation should ignore non‑Swiglu act_type.

The BF16 runtime path hardcodes Swiglu; generating configs for other activations can yield mismatched configs.

🛠️ Proposed fix
   std::set<int32_t> selected_tile_nums =
       computeSelectedTileN(supported_tile_nums, num_tokens, top_k, num_local_experts);
 
+  TVM_FFI_ICHECK(static_cast<ActivationType>(act_type) == ActivationType::Swiglu)
+      << "BF16 MoE supports only Swiglu activation.";
+
   for (int32_t tile_N : selected_tile_nums) {
     auto moe_runner = std::make_unique<tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner>(
         btg::Dtype::Bfloat16,  // dtype_act
         btg::Dtype::Bfloat16,  // dtype_weights
         false,                 // useDeepSeekFp8
-        tile_N, static_cast<ActivationType>(act_type), use_shuffled_weight,
+        tile_N, ActivationType::Swiglu, use_shuffled_weight,
         static_cast<batchedGemm::gemm::MatrixLayout>(weight_layout));
🤖 Fix all issues with AI agents
In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 1955-1962: The activation_type_to_func lookup can KeyError on
unsupported ActivationType values; update the logic around activation_type,
activation_type_to_func and activation_func to explicitly handle unknown enums
by either expanding the map to include the remaining ActivationType members or
performing a guarded lookup (e.g., check activation_type in
activation_type_to_func or use dict.get) and raise a clear ValueError listing
supported activation types when not found; ensure the error mentions
ActivationType and activation_type variable so callers see which value was
invalid.
🧹 Nitpick comments (3)
csrc/trtllm_batched_gemm_runner.cu (1)

226-227: Comment typo and potential design clarification needed.

The comment has a typo: "For simplicity pass set scaleAct to scaleGateC" should likely be "For simplicity, set scaleAct to scaleGateC".

More importantly, reusing scaleGateC for scaleAct may be a simplification that works for current use cases, but consider adding a brief note explaining when this assumption holds (e.g., for specific activation types).

✏️ Suggested comment fix
-  // For simplicity pass set scaleAct to scaleGateC
+  // For simplicity, set scaleAct to scaleGateC (valid when activation scaling matches gate scaling)
   gemmData.mInputBuffers.mPtrScaleAct = scaleGateC;
tests/moe/test_trtllm_gen_routed_fused_moe.py (1)

186-186: Consider parameterizing activation types for broader coverage.

The test currently only exercises ActivationType.Swiglu. Since this PR adds support for non-gated activations like Relu2, consider adding a parametrize decorator to test at least one non-gated activation type (e.g., ActivationType.Relu2) to validate the new functionality.

Also applies to: 239-239

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (1)

379-390: CLI --activation-type argument is ignored for FP4 path.

The --activation-type argument is added to the CLI but not passed to bench_trtllm_gen_fused_moe_autotuner_fp4. If FP4 supports non-Swiglu activations, consider adding the parameter. If FP4 only supports Swiglu, consider either:

  1. Validating and warning when a non-Swiglu activation is specified with an FP4 quant mode, or
  2. Documenting this limitation in the help text.
🛠️ Option 1: Pass activation_type to FP4 (if supported)
     else:
         bench_trtllm_gen_fused_moe_autotuner_fp4(
             args.tune_max_num_tokens,
             args.quant_mode,
             args.num_tokens,
             args.num_experts,
             args.hidden_size,
             args.intermediate_size,
             args.top_k,
             args.warmups,
             args.iterations,
+            args.activation_type,
         )
🛠️ Option 2: Warn if non-Swiglu specified for FP4
     else:
+        if args.activation_type != ActivationType.Swiglu:
+            print(f"[WARNING] FP4 path only supports Swiglu activation. Ignoring --activation-type={args.activation_type}")
         bench_trtllm_gen_fused_moe_autotuner_fp4(
             ...
         )

Comment on lines +1955 to 1962
activation_type = args.activation_type
activation_type_to_func = {
ActivationType.Swiglu: F.silu,
ActivationType.Geglu: F.gelu,
ActivationType.Relu2: lambda x: F.relu(x) ** 2,
}
gated_act_func = gated_act_type_to_func[gated_act_type]
activation_func = activation_type_to_func[activation_type]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Guard unsupported ActivationType values in the reference activation map.
The reference path now accepts ActivationType, but the mapping only covers Swiglu, Geglu, and Relu2. Any other enum value (e.g., Gelu, Relu, Silu, Identity, SwigluBias) will currently throw a KeyError. Consider expanding the map and/or adding a clear error for unsupported activations.

🔧 Suggested hardening
 activation_type = args.activation_type
 activation_type_to_func = {
     ActivationType.Swiglu: F.silu,
     ActivationType.Geglu: F.gelu,
     ActivationType.Relu2: lambda x: F.relu(x) ** 2,
+    ActivationType.Gelu: F.gelu,
+    ActivationType.Relu: F.relu,
+    ActivationType.Silu: F.silu,
+    ActivationType.Identity: lambda x: x,
 }
-activation_func = activation_type_to_func[activation_type]
+activation_func = activation_type_to_func.get(activation_type)
+if activation_func is None:
+    raise NotImplementedError(
+        f"ActivationType {activation_type} not supported in reference path yet."
+    )
🤖 Prompt for AI Agents
In `@tests/moe/test_trtllm_gen_fused_moe.py` around lines 1955 - 1962, The
activation_type_to_func lookup can KeyError on unsupported ActivationType
values; update the logic around activation_type, activation_type_to_func and
activation_func to explicitly handle unknown enums by either expanding the map
to include the remaining ActivationType members or performing a guarded lookup
(e.g., check activation_type in activation_type_to_func or use dict.get) and
raise a clear ValueError listing supported activation types when not found;
ensure the error mentions ActivationType and activation_type variable so callers
see which value was invalid.

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (1)

270-299: FP4 autotuner ignores the requested activation type.

The CLI accepts --activation-type, but the FP4 path hard‑codes Swiglu, so Relu2/other activations can’t be benchmarked. Consider threading the argument through (or explicitly rejecting non‑Swiglu for FP4).

🛠️ Proposed fix
-def bench_trtllm_gen_fused_moe_autotuner_fp4(
+def bench_trtllm_gen_fused_moe_autotuner_fp4(
     tune_max_num_tokens: Optional[int],
     quant_mode: Literal["NvFP4xNvFP4", "MxFP4xMxFP8", "MxFP4xBf16"],
     num_tokens: int,
     num_experts: int,
     hidden_size: int,
     intermediate_size: int,
     top_k: int,
     warmups: int,
     iterations: int,
+    activation_type: ActivationType,
 ):
 ...
-        ActivationType.Swiglu.value,  # act_type
+        activation_type.value,  # act_type
         None,
         num_tokens if tune_max_num_tokens is None else tune_max_num_tokens,
     )
-        bench_trtllm_gen_fused_moe_autotuner_fp4(
+        bench_trtllm_gen_fused_moe_autotuner_fp4(
             args.tune_max_num_tokens,
             args.quant_mode,
             args.num_tokens,
             args.num_experts,
             args.hidden_size,
             args.intermediate_size,
             args.top_k,
             args.warmups,
             args.iterations,
+            args.activation_type,
         )
🤖 Fix all issues with AI agents
In `@benchmarks/routines/flashinfer_benchmark_utils.py`:
- Around line 458-470: The converter inside enum_type incorrectly lowercases all
but the first char (in function converter), causing camelCase names like
SwigluBias to be mangled and rejected; update converter to perform a
case-insensitive lookup by comparing the incoming string (value) to enum member
names in a casefold/lower-insensitive way (and accept numeric indices/values
where appropriate) so that enum_type and callers like ActivationType accept
"SwigluBias", "swiglubias", or numeric inputs; implement this by normalizing
value (e.g., casefold()) and matching against member.name.casefold() or trying
int(value) fallback before raising argparse.ArgumentTypeError listing valid
options.

In `@benchmarks/routines/moe.py`:
- Line 628: The code still reads args.gated_act (which no longer exists) when
constructing the output/result export, causing a crash; replace all uses of
args.gated_act with args.activation_type (e.g., update variables like
activation_type = args.activation_type and any places that populate the output
CSV/dict) and, if you must keep the original column name, map
args.activation_type into the existing 'gated_act' output field when building
the results (ensure references in result-building code and the export writer use
activation_type/args.activation_type instead of args.gated_act).

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@benchmarks/bench_trtllm_gen_fused_moe_autotuner.py`:
- Around line 358-365: The current parser.add_argument for "--activation-type"
uses type=enum_type(ActivationType) (which returns enum members) but choices are
strings, causing validation to always fail; update the argument so type and
choices match — either make choices a list of enum members (e.g.,
choices=list(ActivationType)) when keeping type=enum_type(ActivationType), or
keep choices=[e.name for e in ActivationType] and change the converter to parse
names (e.g., type=lambda s: ActivationType[s]); adjust the parser.add_argument
call for "--activation-type" accordingly so both type and choices use the same
representation.

In `@benchmarks/routines/moe.py`:
- Around line 180-185: The argparse setup for "--activation-type" mixes enum
members (from type=enum_type(ActivationType)) with string choices, causing valid
enum inputs to be rejected; update the choices to be enum members (e.g.,
choices=list(ActivationType) or [e for e in ActivationType]) so they match the
converter returned by enum_type(ActivationType), keep default as
ActivationType.Swiglu, and adjust the help text (e.g., show [e.name for e in
ActivationType]) if you want human-readable names.

…hmarks

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (1)

270-301: FP4 path ignores user-supplied activation type.
--activation-type is parsed but the FP4 benchmark hardcodes Swiglu, so Relu2 (and other types) can’t be exercised in FP4 modes. Consider threading the CLI value into the FP4 path.

💡 Suggested fix (thread activation_type into FP4 path)
-def bench_trtllm_gen_fused_moe_autotuner_fp4(
+def bench_trtllm_gen_fused_moe_autotuner_fp4(
     tune_max_num_tokens: Optional[int],
     quant_mode: Literal["NvFP4xNvFP4", "MxFP4xMxFP8", "MxFP4xBf16"],
     num_tokens: int,
     num_experts: int,
     hidden_size: int,
     intermediate_size: int,
     top_k: int,
     warmups: int,
-    iterations: int,
+    iterations: int,
+    activation_type: ActivationType,
 ):
@@
-        ActivationType.Swiglu.value,  # act_type
+        activation_type.value,  # act_type
@@
-        bench_trtllm_gen_fused_moe_autotuner_fp4(
+        bench_trtllm_gen_fused_moe_autotuner_fp4(
             args.tune_max_num_tokens,
             args.quant_mode,
             args.num_tokens,
             args.num_experts,
             args.hidden_size,
             args.intermediate_size,
             args.top_k,
             args.warmups,
-            args.iterations,
+            args.iterations,
+            args.activation_type,
         )
🧹 Nitpick comments (1)
benchmarks/routines/moe.py (1)

1742-1765: Consider adding activation_type to result export for consistency.

testTrtllmFp4BlockScaleMoe attempts to export activation info (though buggy at line 898), but testTrtllmFp8PerTensorScaleMoe omits it entirely. For consistent benchmark output, consider adding:

         cur_res["input_dtype"] = input_dtype
         cur_res["weight_dtype"] = weight_dtype
+        cur_res["activation_type"] = args.activation_type.name
         res.append(cur_res)

…fix)

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…tions

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (1)

101-130: FP8 block-scale kernel is hardcoded to Swiglu but other activations are not rejected.

The trtllm_fp8_block_scale_moe kernel function does not accept an activation_type parameter and is intentionally hardcoded to Swiglu (confirmed in the kernel implementation: constexpr ActivationType activation_type = ActivationType::Swiglu; // not exposed in api for now). However, the code only rejects Relu2, allowing any other activation type to pass through and silently use Swiglu, which will produce incorrect results.

✅ Fix: Enforce Swiglu-only for block-scale
-    if is_block_scale:
-        if activation_type == ActivationType.Relu2:
+    if is_block_scale:
+        if activation_type != ActivationType.Swiglu:
             raise ValueError(
-                "Relu2 activation is not supported for FP8 block scale MoE."
+                "Only Swiglu is supported for FP8 block scale MoE."
             )

…py for trtllm_fp8_block_scale_moe

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
benchmarks/routines/moe.py (1)

390-417: Adjust TFLOPS calculation to account for non-gated activation weight dimensions.

The TFLOPS calculation always assumes 2 * intermediate_size for the first GEMM, which is correct for gated activations (Swiglu, Geglu) but overstates the computational work for non-gated activations like Relu2, which use only intermediate_size. The test reference implementation confirms this: Relu2 computes F.relu(a @ weight.t()) ** 2 using the full weight tensor (shape [num_experts, intermediate_size, hidden_size]), while Swiglu splits the weight into two halves.

Update calculate_moe_tflops to branch based on activation type:

  • For gated activations: keep current 2 * intermediate_size calculation
  • For non-gated activations (Relu2, Identity): use intermediate_size

Alternatively, if the function cannot be made activation-aware, the function signature should be updated to accept activation_type as a parameter.

🧹 Nitpick comments (2)
benchmarks/routines/moe.py (2)

1507-1532: Note: activation_type not included in FP8 block scale results.

Per the PR scope (NVFP4 and FP8PerTensor only), testTrtllmFp8BlockScaleMoe doesn't support activation_type. However, this creates a column inconsistency when combining benchmark results from all three routines with --output-path.

Consider adding a placeholder for consistency:

         cur_res["input_dtype"] = input_dtype
         cur_res["weight_dtype"] = weight_dtype
+        cur_res["activation_type"] = "N/A"  # FP8 block scale doesn't support activation_type yet
         res.append(cur_res)

1236-1262: Same observation: activation_type not in CUTLASS results.

For output consistency across all MOE benchmark routines, consider adding a placeholder entry similar to the suggestion for FP8 block scale.

@yzh119
Copy link
Collaborator

yzh119 commented Jan 29, 2026

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !267 has been updated with latest changes, and the CI pipeline #42839544 is currently running. I'll report back once the pipeline job completes.

@yzh119
Copy link
Collaborator

yzh119 commented Jan 30, 2026

Hi @amitz-nv seems there are some abnormal output values on gitlab CI, e.g.

tests/moe/test_trtllm_gen_fused_moe.py:2531: in run_moe_test
    check_accuracy(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
a = tensor([[-443841.5625,  169349.1250,   -5101.0029,  ...,  -99210.4688,
          -97816.4062, -375885.8125],
        [... [ 157589.8594, -114925.8906,  -31947.7305,  ...,  -96621.3672,
          144419.6562,  -20856.9512]], device='cuda:0')
b = tensor([[  39424.,   99328., -157696.,  ..., -251904.,   93184.,   15104.],
        [ 207872.,  409600.,  191488.,  .....  992.,  -35328.],
        [ -84992.,   14528.,  -26752.,  ..., -124416.,  -88576., -149504.]],
       device='cuda:0')
atol = 0.1, rtol = 0.85, percent = 0.925
    def check_accuracy(a, b, atol, rtol, percent):
        """Unified accuracy checking function with detailed error reporting."""
        if not torch.isfinite(a).all():
            raise Exception("Non-finite values in reference output")
        if not torch.isfinite(b).all():
            raise Exception("Non-finite values in actual output")
        assert a.shape == b.shape, f"Shape mismatch: {a.shape} vs {b.shape}"
    
        close = torch.isclose(a, b, atol=atol, rtol=rtol)
        match_ratio = close.float().mean()
        if match_ratio >= percent:
            return
    
        mismatch_percent = 1.0 - match_ratio.item()
        if mismatch_percent > 1 - percent:
>           raise Exception(
                f"Mismatch percentage is {mismatch_percent:.4f} for rtol {rtol} "
                f"(threshold: {1 - percent:.4f})"
            )
E           Exception: Mismatch percentage is 0.6653 for rtol 0.85 (threshold: 0.0750)

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #42839544: 7/20 passed

@yzh119 yzh119 merged commit 83cdea3 into flashinfer-ai:main Jan 30, 2026
16 of 23 checks passed
yzh119 pushed a commit that referenced this pull request Feb 1, 2026
…rt Nemotron" (#2451)

Reverts #2304

As it introduces regression on unit test and no longer allow number of
experts lower than 22 to run trtllm deepseek routing kernel

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Refactor**
* Consolidated gated activation type handling across MoE implementations
with simplified parameter names and enum naming.
* Unified intermediate size calculations to consistently use 2x
configuration.
  * Streamlined routing logic for improved clarity and maintainability.

* **Breaking Changes**
* CLI argument `--activation-type` renamed to `--gated-act` with values
"swiglu" or "geglu".
* API parameter names updated from `activation_type` to `gated_act_type`
across public interfaces.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
yzh119 pushed a commit that referenced this pull request Feb 4, 2026
…ron, fixed (#2462)

<!-- .github/pull_request_template.md -->

## 📌 Description

- Support element wise activation (relu^2) in fused MoE in NVFP4 and in
FP8PerTensor.
- Use new ActivationType enum class instead of GatedActType.
- Support Nemotron in deepseek routing as in
NVIDIA/TensorRT-LLM#9792
- Remove 'A' suffix from UseShuffledMatrixA.

NOTE: This is the fixed version of
#2304 that was merged
and reverted.
- Replaced the problematic condition in deepseek routing that required
`NumExperts >= MaxSupportedTopExperts` with `topK<=numExperts`
  - DeepSeek R1 works with it (tested with VLLM).
- Removed irrelevant test cases.


## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Replaced old gated-activation API with a unified ActivationType enum
(many activation kinds supported).
  * Propagated activation_type across MoE workflows and kernels.

* **New Features**
* Added CLI option --activation-type to select activation kind for MoE
benchmarks.

* **Bug Fixes**
  * Enforced activation compatibility and validation for FP8/FP4 paths.

* **Tests**
* Updated and expanded tests to cover new activation types and
compatibility scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
raayandhar pushed a commit to raayandhar/flashinfer that referenced this pull request Feb 5, 2026
…rt Nemotron" (flashinfer-ai#2451)

Reverts flashinfer-ai#2304

As it introduces regression on unit test and no longer allow number of
experts lower than 22 to run trtllm deepseek routing kernel

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

## Release Notes

* **Refactor**
* Consolidated gated activation type handling across MoE implementations
with simplified parameter names and enum naming.
* Unified intermediate size calculations to consistently use 2x
configuration.
  * Streamlined routing logic for improved clarity and maintainability.

* **Breaking Changes**
* CLI argument `--activation-type` renamed to `--gated-act` with values
"swiglu" or "geglu".
* API parameter names updated from `activation_type` to `gated_act_type`
across public interfaces.

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
raayandhar pushed a commit to raayandhar/flashinfer that referenced this pull request Feb 5, 2026
…ron, fixed (flashinfer-ai#2462)

<!-- .github/pull_request_template.md -->

## 📌 Description

- Support element wise activation (relu^2) in fused MoE in NVFP4 and in
FP8PerTensor.
- Use new ActivationType enum class instead of GatedActType.
- Support Nemotron in deepseek routing as in
NVIDIA/TensorRT-LLM#9792
- Remove 'A' suffix from UseShuffledMatrixA.

NOTE: This is the fixed version of
flashinfer-ai#2304 that was merged
and reverted.
- Replaced the problematic condition in deepseek routing that required
`NumExperts >= MaxSupportedTopExperts` with `topK<=numExperts`
  - DeepSeek R1 works with it (tested with VLLM).
- Removed irrelevant test cases.


## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Refactor**
* Replaced old gated-activation API with a unified ActivationType enum
(many activation kinds supported).
  * Propagated activation_type across MoE workflows and kernels.

* **New Features**
* Added CLI option --activation-type to select activation kind for MoE
benchmarks.

* **Bug Fixes**
  * Enforced activation compatibility and validation for FP8/FP4 paths.

* **Tests**
* Updated and expanded tests to cover new activation types and
compatibility scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants