feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron by amitz-nv · Pull Request #2304 · flashinfer-ai/flashinfer

amitz-nv · 2026-01-07T18:10:38Z

📌 Description

Support element wise activation (relu^2) in fused MoE in NVFP4 and in FP8PerTensor.
- Use new ActivationType enum class instead of GatedActType.
Support Nemotron in deepseek routing as in [None][feat] Add routing support for the new model for both cutlass and trtllm moe backend NVIDIA/TensorRT-LLM#9792
Remove 'A' suffix from UseShuffledMatrixA

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Expanded activation options (Gelu, Relu, Silu, Swiglu, Geglu, SwigluBias, Relu2, Identity) and exposed ActivationType throughout the CLI and APIs.
- DeepSeek routing supports larger top‑K and a configurable top‑experts dimension.
- Added post‑GEMM element‑wise activation option and a CLI flag to select activation type.
Breaking Changes
- ActivationType replaces the previous gated-activation enum in public APIs and tests; callers must use ActivationType values.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-07T18:10:45Z

📝 Walkthrough

Walkthrough

Replaces GatedActType with ActivationType and threads activation configuration through Python APIs, benchmarks, tests, C++/CUDA launchers, routing kernels, and batched-GEMM runners; extends DeepSeek top-K/top-expert handling and adds eltwise activation and related options.

Changes

Cohort / File(s)	Summary
Python Public API & Core `flashinfer/__init__.py`, `flashinfer/fused_moe/__init__.py`, `flashinfer/fused_moe/core.py`	Expose `ActivationType`, remove `GatedActType`; update MoERunner and op signatures to accept `activation_type`; propagate enum through Python→C++ callsites and update defaults/docs.
Benchmarks & CLI `benchmarks/bench_trtllm_gen_fused_moe_autotuner.py`, `benchmarks/routines/moe.py`, `benchmarks/routines/flashinfer_benchmark_utils.py`	Add `--activation-type` CLI, `enum_type()` argparse helper, thread `activation_type` into FP8/FP4 autotuner paths and add runtime validation for incompatible combos.
Tests & Test Utils `tests/moe/*`, `tests/moe/utils.py`	Replace `GatedActType` imports/uses with `ActivationType`; add `is_gated_activation()` and `NON_GATED_ACTIVATION_SUPPORTED_QUANT_MODES`; update `skip_checks()` signature and test parametrizations.
C++ MoE Launchers & Runners `csrc/trtllm_fused_moe_kernel_launcher.cu`, `csrc/trtllm_fused_moe_runner.cu`, `csrc/trtllm_fused_moe_routing_deepseek.cu`	Replace gated-act wiring with `ActivationType` across launcher/runner APIs; add activation→gated/eltwise conversion helpers; increase DeepSeek top-K and introduce Max/Default top-expert constants; thread new top-experts parameter through routing.
C++ Kernel, Macros & Headers `include/.../KernelRunner.h`, `include/.../DevKernel.h`, `include/.../RoutingKernel.h`, `include/.../runner.h`	Add `EltwiseActType` and `eltwiseActType` option; rename `useShuffledMatrixA`→`useShuffledMatrix`; add `MaxNumTopExperts_` template param and constexpr; expand DEEPSEEK launch macros to accept `numTopExperts`; introduce `ActivationType` enum and helpers.
Batched GEMM Runner `csrc/trtllm_batched_gemm_runner.cu`, `include/.../KernelRunner.h`	Add eltwise activation type consistency checks, consolidate per-config filters, log activation-type fields, and pass `scaleAct` via `scaleGateC`.
MoE Kernel Entrypoints & Launch Signatures `csrc/...` and `include/...` trtllm_fp4/trtllm_fp8 entrypoints	Update entry signatures to accept `act_type` / `activation_type` integers; thread activation enum values into CUDA kernels and launcher config resolution.

Sequence Diagram(s)

sequenceDiagram
    participant CLI
    participant PyCore as Python Core
    participant Launcher as C++ Launcher/Runner
    participant CUDA as CUDA Kernel
    participant Device as Device/Experts

    CLI->>PyCore: parse args (--activation-type, quant_mode, numTopExperts)
    PyCore->>Launcher: build config (activation_type.value, useShuffledMatrix, numTopExperts)
    Launcher->>Launcher: select valid configs / getValidConfigs(act_type, numTopExperts)
    Launcher->>CUDA: launch kernel(config_index, act_type, topK)
    CUDA->>Device: route tokens / apply activation (gated or eltwise)
    Device-->>CUDA: return results
    CUDA-->>Launcher: outputs
    Launcher-->>PyCore: surface results (activation_type, config)
    PyCore-->>CLI: report results / autotuner output

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Feature: Support Relu2 activation in fused MoE #1954 — Shares ActivationType introduction and propagation through MoE APIs and kernels.
perf: TRT-LLM Gen finalize kernel optimization #2092 — Overlaps DeepSeek/top-K and KernelParams changes affecting routing macros and top-expert handling.
feat: update trtllm-gen MoE cubins #2416 — Touches batched GEMM runner config filtering/useShuffledMatrix changes similar to this diff.

Suggested reviewers

joker-eph
aleozlx
cyx-6
djmmoss
IwakuraRein
nvmbreughe
bkryu
jiahanc
yzh119

Poem

🐇 I hopped from GatedAct to Activation bright,
Swiglu, Gelu, Relu juggled through the night.
Kernels launch, top-Ks reach further fields,
Experts route where tiny logic yields.
A rabbit cheers—new activation, delight.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 38.10% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title accurately summarizes the main changes: support for non-gated Relu2 activation in NVFP4 & FP8 and Nemotron support.
Description check	✅ Passed	The PR description covers the main changes (element-wise activation support, ActivationType enum replacement, Nemotron support, UseShuffledMatrixA removal) and includes completed pre-commit checks.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-07T18:12:09Z

Summary of Changes

Hello @amitz-nv, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility of fused Mixture-of-Experts (MoE) operations by introducing a unified ActivationType enum, moving beyond just gated activations. The primary focus is on enabling non-gated activation functions like Relu2 for FP8 MoE, which required extensive changes to the underlying C++ kernels, Python bindings, and testing infrastructure. These changes ensure that the system can efficiently handle various activation types, with specific optimizations and checks for gated vs. non-gated behaviors, leading to broader applicability and improved performance for different model architectures.

Highlights

Unified Activation Type: The pull request replaces the specific GatedActType enum with a more general ActivationType enum across the codebase. This new enum supports a wider range of activation functions, including Gelu, Relu, Silu, Swiglu, Geglu, SwigluBias, Relu2, and Identity.
Relu2 Activation Support: Explicit support for the Relu2 activation function has been introduced in fused MoE operations, particularly for FP8. This involved adjusting intermediate tensor sizes and activation logic based on whether the activation is gated or non-gated.
Benchmarking and Testing Updates: Benchmark scripts and test suites have been modified to incorporate the new ActivationType parameter, enabling comprehensive testing and performance evaluation of different activation functions within the fused MoE framework.
TMA Descriptor Enhancements: The Tensor Memory Access (TMA) descriptor building logic has been updated to handle padding and swizzling more flexibly, especially for block-scaled formats and uniform token batching, improving robustness and efficiency.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for non-gated activations, such as Relu2, into the Fused MoE FP8 kernels. The changes primarily involve refactoring GatedActType to a more general ActivationType enum and plumbing this through the Python wrappers, benchmarks, and C++/CUDA implementation. The refactoring is extensive and mostly well-executed.

I have identified a few issues that need attention. There are potential bugs in csrc/trtllm_fused_moe_runner.cu related to the calculation of workspace size and GEMM configuration validation, which do not correctly handle the doubled intermediate size for gated activations. Additionally, there's a minor code cleanup opportunity in benchmarks/routines/moe.py and a parameter name typo in tests/moe/test_dpsk_fused_moe_fp8.py. Addressing these issues will improve the correctness and clarity of the code.

gemini-code-assist · 2026-01-07T18:15:00Z

csrc/trtllm_fused_moe_runner.cu

+  return mRunner.getWorkspaceSizeInBytes(numTokens, intermediateSize, hiddenSize, {}, numTokens,
                                         numExperts, maxNumCtasInBatchDim, configIndex);


The intermediateSize passed to getWorkspaceSizeInBytes does not account for gated activations, where the intermediate dimension is doubled. This could lead to under-allocating workspace. It should be multiplied by (isGatedActivation(mActType) ? 2 : 1) to ensure correct workspace size calculation, similar to how it's done in the run method.

int32_t const intermediateSizeFactor = isGatedActivation(mActType) ? 2 : 1; return mRunner.getWorkspaceSizeInBytes(numTokens, intermediateSizeFactor * intermediateSize, hiddenSize, {}, numTokens, numExperts, maxNumCtasInBatchDim, configIndex);

gemini-code-assist · 2026-01-07T18:15:00Z

csrc/trtllm_fused_moe_runner.cu

+  return mRunner.getDefaultValidConfigIndex(numTokens, intermediateSize, hiddenSize, {},
                                            numTokens, numExperts, maxNumCtasInBatchDim);


The intermediateSize passed to getDefaultValidConfigIndex does not account for gated activations. This could lead to selecting a suboptimal or incorrect default configuration. It should be multiplied by (isGatedActivation(mActType) ? 2 : 1) to ensure correct config selection.

int32_t const intermediateSizeFactor = isGatedActivation(mActType) ? 2 : 1; return mRunner.getDefaultValidConfigIndex(numTokens, intermediateSizeFactor * intermediateSize, hiddenSize, {}, numTokens, numExperts, maxNumCtasInBatchDim);

gemini-code-assist · 2026-01-07T18:15:00Z

csrc/trtllm_fused_moe_runner.cu

+      mRunner.isValidConfigIndex(configIndex, numTokens, intermediateSize, hiddenSize, {},
                                 numTokens, numExperts, maxNumCtasInBatchDim);


The intermediateSize passed to isValidConfigIndex does not account for gated activations. This could lead to incorrect validation of GEMM configurations. It should be multiplied by (isGatedActivation(mActType) ? 2 : 1) to ensure correct config validation.

mRunner.isValidConfigIndex(configIndex, numTokens, (isGatedActivation(mActType) ? 2 : 1) * intermediateSize, hiddenSize, {}, numTokens, numExperts, maxNumCtasInBatchDim);

benchmarks/routines/moe.py

tests/moe/test_dpsk_fused_moe_fp8.py

amitz-nv · 2026-01-07T18:29:28Z

csrc/trtllm_batched_gemm_runner.cu

  gemmData.mInputBuffers.mPtrScaleC = scaleC;
  gemmData.mInputBuffers.mPtrScaleGate = scaleGateC;
+  // TODO amitz-nv: Do we want to pass scaleAct instead of using scaleGateC?
+  gemmData.mInputBuffers.mPtrScaleAct = scaleGateC;


Decide whether it's OK or fix in the future?

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

csrc/trtllm_fused_moe_routing_deepseek.cu (1)
517-538: Guard topK > 8 for non‑Nemotron expert counts.
LAUNCH_ROUTING_DEEPSEEK still instantiates kernels with MaxNumTopExperts=DefaultMaxNumTopExperts for numExperts <= NumKimiK2Experts, but runImpl now allows topK up to 22. That can drive params.mTopK beyond KernelParams::MaxNumTopExperts and corrupt stack/shared buffers.

Consider either selecting MaxSupportedTopExperts whenever data.mTopK > DefaultMaxNumTopExperts, or explicitly rejecting such inputs for non‑Nemotron expert counts.
🔧 Suggested guard (minimal change)
@@
   FLASHINFER_CHECK(data.mTopK <= MaxSupportedTopExperts,
                    "Routing kernel expects topK experts <= %d, got %d", MaxSupportedTopExperts,
                    data.mTopK);
+  if (data.mNumExperts < NumNemotronExperts) {
+    FLASHINFER_CHECK(
+        data.mTopK <= DefaultMaxNumTopExperts,
+        "For numExperts < %d, routing kernel supports topK <= %d, got %d",
+        NumNemotronExperts, DefaultMaxNumTopExperts, data.mTopK);
+  }
Also applies to: 560-573
benchmarks/routines/moe.py (1)
872-898: Fix stale args.gated_act access after the activation-type switch.
--gated_act was removed, so this path will raise AttributeError when output_path is used.
✅ Suggested fix
-        cur_res["gated_act"] = args.gated_act
+        cur_res["activation_type"] = str(args.activation_type)
include/flashinfer/trtllm/batched_gemm/trtllmGen_bmm_export/BatchedGemmOptions.h (1)
496-500: Guard optional mRouteSfsImpl in dumpOptions.
options.mRouteSfsImpl.value() will throw if dumpOptions() is called before mRouteSfsImpl is set (e.g., default options without checkAndUpdateBatchedGemmOptions). Guard the optional or emit nullopt.
🔧 Suggested fix
-  ss << "mRouteSfsImpl={batchedGemm::RouteImpl("
-     << static_cast<int32_t>(options.mRouteSfsImpl.value()) << ")}," << std::endl;
+  if (options.mRouteSfsImpl.has_value()) {
+    ss << "mRouteSfsImpl={batchedGemm::RouteImpl("
+       << static_cast<int32_t>(options.mRouteSfsImpl.value()) << ")}," << std::endl;
+  } else {
+    ss << "mRouteSfsImpl={nullopt}," << std::endl;
+  }
flashinfer/fused_moe/core.py (1)
210-236: Include gated/non‑gated flag in permute cache key.
permute0 now differs by is_gated_act_gemm, but the cache key ignores it, so a gated run can poison the cache for a non‑gated run (or vice‑versa), yielding incorrect permutations.
🔧 Suggested fix
-    cache_key = ("w3_w1", dst_w3_w1_weight.shape)
+    cache_key = ("w3_w1", dst_w3_w1_weight.shape, is_gated_act_gemm, num_elts_per_sf)
csrc/trtllm_fused_moe_kernel_launcher.cu (1)
533-553: Scale FP8‑per‑tensor GEMM1 buffers by activation type.
Now that activation_type can be non‑gated, prepare_moe() still allocates GEMM1 output/scales for 2 * intermediate_size. When the runner uses M=intermediate_size, the row stride is wrong and outputs can overlap.
🔧 Suggested fix
   void prepare_moe(int64_t& moe_tactic) override {
     FusedMoeLauncher::prepare_moe_common(moe_tactic);

     int32_t max_num_padded_tokens_gemm1 = workspace.total_max_padded_tokens + args->num_experts;
     int32_t max_num_padded_tokens_gemm2 = workspace.total_max_padded_tokens;
+
+    int32_t const intermediate_size_factor =
+        tensorrt_llm::kernels::trtllmgen_moe::MoE::isGatedActivation(activation_type) ? 2 : 1;
+    int32_t const gemm1_out_dim = intermediate_size_factor * args->intermediate_size;

-    gemm1_output = alloc_tensor({max_num_padded_tokens_gemm1, 2 * args->intermediate_size},
+    gemm1_output = alloc_tensor({max_num_padded_tokens_gemm1, gemm1_out_dim},
                                 dl_uint8, hidden_states.device());
-    gemm1_output_scale =
-        alloc_tensor({2 * args->intermediate_size / 128, max_num_padded_tokens_gemm1}, dl_float32,
-                     hidden_states.device());
+    gemm1_output_scale =
+        alloc_tensor({gemm1_out_dim / 128, max_num_padded_tokens_gemm1}, dl_float32,
+                     hidden_states.device());

🤖 Fix all issues with AI agents

In `@flashinfer/fused_moe/core.py`:
- Around line 1533-1535: The fake-op function signatures that accept the unused
parameter activation_type should rename that parameter to _activation_type (or
prefix it with an underscore) to silence Ruff ARG001 while preserving signature
compatibility; update the parameter name in each fake-op signature (the
occurrences where activation_type is accepted but unused) and ensure any
internal references (if any) are adjusted accordingly so behavior is unchanged.

In `@include/flashinfer/trtllm/batched_gemm/trtllmGen_bmm_export/GemmOptions.h`:
- Line 553: Update the debug/dump label to match the renamed field: replace the
literal "mUseShuffledMatrixA=" with "mUseShuffledMatrix=" where the code writes
out options.mUseShuffledMatrix (in the dump/ostream code that uses ss <<
"mUseShuffledMatrixA=" << options.mUseShuffledMatrix << ...), so the printed
label matches the actual member name.

In `@include/flashinfer/trtllm/fused_moe/runner.h`:
- Around line 176-178: The isGatedActivation helper currently only checks
ActivationType::Swiglu and ActivationType::Geglu; update it to also treat
ActivationType::SwigluBias as gated so code paths that expect gated activations
(e.g., hidden-size handling) follow the correct branch—modify the
isGatedActivation(ActivationType activationType) function to return true for
ActivationType::SwigluBias in addition to Swiglu and Geglu.

♻️ Duplicate comments (2)

csrc/trtllm_batched_gemm_runner.cu (1)

226-227: Follow up on the scaleAct vs scaleGateC TODO.

This open question can affect non-gated activation scaling once those paths are exercised.

csrc/trtllm_fused_moe_runner.cu (1)

304-330: Apply gated size factor in workspace/config helpers.
run() scales intermediateSize for gated activations, but the workspace/config helpers still use the unscaled value, risking under‑allocation or invalid config selection for Swiglu/Geglu.

🔧 Suggested fix

 size_t Runner::getWorkspaceSizeInBytes(int32_t topK, int32_t hiddenSize, int32_t intermediateSize,
                                        int32_t numExperts, int32_t numTokens,
                                        int32_t configIndex) const {
   auto maxNumCtasInBatchDim =
       Routing::getMaxNumCtasInBatchDim(numTokens, topK, numExperts, mTileTokensDim);
-  return mRunner.getWorkspaceSizeInBytes(numTokens, intermediateSize, hiddenSize, {}, numTokens,
+  int32_t const intermediateSizeFactor = isGatedActivation(mActType) ? 2 : 1;
+  return mRunner.getWorkspaceSizeInBytes(numTokens, intermediateSizeFactor * intermediateSize,
+                                         hiddenSize, {}, numTokens,
                                          numExperts, maxNumCtasInBatchDim, configIndex);
 }
@@
 int32_t Runner::getDefaultValidConfigIndex(int32_t topK, int32_t hiddenSize,
                                            int32_t intermediateSize, int32_t numExperts,
                                            int32_t numTokens) const {
   auto maxNumCtasInBatchDim =
       Routing::getMaxNumCtasInBatchDim(numTokens, topK, numExperts, mTileTokensDim);
-  return mRunner.getDefaultValidConfigIndex(numTokens, intermediateSize, hiddenSize, {}, numTokens,
+  int32_t const intermediateSizeFactor = isGatedActivation(mActType) ? 2 : 1;
+  return mRunner.getDefaultValidConfigIndex(numTokens, intermediateSizeFactor * intermediateSize,
+                                            hiddenSize, {}, numTokens,
                                             numExperts, maxNumCtasInBatchDim);
 }
@@
   auto const isValid =
-      mRunner.isValidConfigIndex(configIndex, numTokens, intermediateSize, hiddenSize, {},
+      mRunner.isValidConfigIndex(configIndex, numTokens,
+                                 (isGatedActivation(mActType) ? 2 : 1) * intermediateSize,
+                                 hiddenSize, {},
                                  numTokens, numExperts, maxNumCtasInBatchDim);

🧹 Nitpick comments (4)

include/flashinfer/trtllm/batched_gemm/trtllmGen_bmm_export/GemmGatedActOptions.h (1)
79-80: Consider naming ActType::None in getActTypeName.
Now that None is a concrete enum member, getActTypeName will return "Unknown type" for it, which can muddy diagnostics. A small switch update keeps logs clear.
♻️ Proposed update
 inline std::string getActTypeName(ActType type) {
   switch (type) {
     case ActType::SwiGlu:
       return "SwiGlu";
     case ActType::GeGlu:
       return "GeGlu";
+    case ActType::None:
+      return "None";
     default:
       return "Unknown type";
   }
 }
include/flashinfer/trtllm/fused_moe/RoutingKernel.h (1)
179-188: Guard MaxNumTopExperts against invalid instantiations.

Now that MaxNumTopExperts_ is part of the public template surface, a compile-time bound check helps prevent accidental topK > MaxNumExperts configurations from compiling and later causing bounds issues.
♻️ Proposed compile-time guard
 struct KernelParams : public KernelParamsBase<InputT_, OutputT_, MaxNumExperts_, isPow2_, UsePdl_> {
   using InputT = InputT_;
   using BiasT = BiasT_;
   using OutputT = OutputT_;

   static constexpr bool UseGroups = UseGroups_;
   static constexpr int MaxNumTopExperts = MaxNumTopExperts_;
+  static_assert(MaxNumTopExperts_ > 0 && MaxNumTopExperts_ <= MaxNumExperts_,
+                "MaxNumTopExperts must be within [1, MaxNumExperts]");
include/flashinfer/trtllm/batched_gemm/KernelRunner.h (1)
50-71: Keep EltwiseActType synced with the canonical GEMM enum.

This enum is later compared to config enums via integer casts, so any drift would silently filter out valid configs. Consider aliasing the canonical enum from Enums.h or add compile-time guards to ensure numeric parity.
🔧 Suggested safeguard (adjust namespace if needed)
 enum class EltwiseActType {
   None = 0,
   Gelu,
   Relu2,
 };
+
+static_assert(
+    static_cast<int>(EltwiseActType::None) ==
+        static_cast<int>(batchedGemm::gemm::EltwiseActType::None),
+    "EltwiseActType::None must stay in sync with batchedGemm::gemm::EltwiseActType::None");
+static_assert(
+    static_cast<int>(EltwiseActType::Gelu) ==
+        static_cast<int>(batchedGemm::gemm::EltwiseActType::Gelu),
+    "EltwiseActType::Gelu must stay in sync with batchedGemm::gemm::EltwiseActType::Gelu");
+static_assert(
+    static_cast<int>(EltwiseActType::Relu2) ==
+        static_cast<int>(batchedGemm::gemm::EltwiseActType::Relu2),
+    "EltwiseActType::Relu2 must stay in sync with batchedGemm::gemm::EltwiseActType::Relu2");
benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (1)
357-364: Argparse enum parsing may be unintuitive.

type=ActivationType tends to accept numeric values only (e.g., --activation-type 3), which is easy to mis-use. Consider accepting enum names and mapping them explicitly for CLI ergonomics.
🛠️ Suggested CLI parsing tweak
+def parse_activation_type(value: str) -> ActivationType:
+    try:
+        return ActivationType[value]
+    except KeyError:
+        return ActivationType(int(value))
+
 ...
 parser.add_argument(
     "--activation-type",
-    type=ActivationType,
+    type=parse_activation_type,
     choices=list(ActivationType),
     required=False,
     default=ActivationType.Swiglu,
     help=f"Type of gated activation function: {list(ActivationType)}",
 )

coderabbitai · 2026-01-26T16:54:29Z

flashinfer/fused_moe/core.py

        enable_pdl: Optional[bool] = None,
+        activation_type: int = ActivationType.Identity.value,
    ):


⚠️ Potential issue | 🟡 Minor

Silence unused activation_type in fake ops (Ruff ARG001).
The fake ops keep activation_type for signature compatibility but never use it.

🔧 Suggested fix

- activation_type: int = ActivationType.Identity.value, + _activation_type: int = ActivationType.Identity.value,

- activation_type: int, + _activation_type: int,

Also applies to: 1911-1913

🧰 Tools

🪛 Ruff (0.14.13)

1533-1533: Unused function argument: enable_pdl

(ARG001)

1534-1534: Unused function argument: activation_type

(ARG001)

🤖 Prompt for AI Agents

In `@flashinfer/fused_moe/core.py` around lines 1533 - 1535, The fake-op function signatures that accept the unused parameter activation_type should rename that parameter to _activation_type (or prefix it with an underscore) to silence Ruff ARG001 while preserving signature compatibility; update the parameter name in each fake-op signature (the occurrences where activation_type is accepted but unused) and ensure any internal references (if any) are adjusted accordingly so behavior is unchanged.

include/flashinfer/trtllm/batched_gemm/trtllmGen_bmm_export/GemmOptions.h

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

csrc/trtllm_fused_moe_routing_deepseek.cu (1)
518-538: Prevent topK > DefaultMaxNumTopExperts on non‑Nemotron paths (OOB risk).

data.mTopK is validated against MaxSupportedTopExperts globally, but non‑Nemotron branches still instantiate kernels with DefaultMaxNumTopExperts. If data.mTopK > DefaultMaxNumTopExperts on Deepseek/Kimi (or smaller expert counts), the kernel will index beyond the fixed‑size topScores/topExperts arrays.

A minimal fix is to tighten validation for those branches so mTopK never exceeds the instantiated compile‑time bound.
🐛 Proposed fix (tighten validation for non‑Nemotron)
   FLASHINFER_CHECK(data.mTopK <= MaxSupportedTopExperts,
                    "Routing kernel expects topK experts <= %d, got %d", MaxSupportedTopExperts,
                    data.mTopK);
+  if (data.mNumExperts <= NumKimiK2Experts) {
+    FLASHINFER_CHECK(data.mTopK <= DefaultMaxNumTopExperts,
+                     "Routing kernel expects topK experts <= %d for %d experts, got %d",
+                     DefaultMaxNumTopExperts, data.mNumExperts, data.mTopK);
+  }
Also applies to: 560-562

🤖 Fix all issues with AI agents

In `@csrc/trtllm_fused_moe_routing_deepseek.cu`:
- Around line 568-573: The two validation checks using FLASHINFER_CHECK
incorrectly enforce a minimum expert count by comparing data.mNumExperts >=
MaxSupportedTopExperts; change the logic to validate that the requested top-K
does not exceed available experts (e.g., check topK <= data.mNumExperts or
data.mTopK <= data.mNumExperts) and keep the other check that data.mNumExperts
<= MaxSupportedExpertCount; update the error message to reflect "topK must be <=
`#experts`" and reference FLASHINFER_CHECK, data.mNumExperts,
MaxSupportedTopExperts, MaxSupportedExpertCount, and the top-K variable (topK or
data.mTopK) so the routing kernel accepts small valid expert counts.

coderabbitai · 2026-01-26T17:01:35Z

csrc/trtllm_fused_moe_routing_deepseek.cu

+  FLASHINFER_CHECK(data.mNumExperts >= MaxSupportedTopExperts,
+                   "Routing kernel expects %d to be at most #experts %d", MaxSupportedTopExperts,
                   data.mNumExperts);
-  FLASHINFER_CHECK(data.mNumExperts <= NumKimiK2Experts,
+  FLASHINFER_CHECK(data.mNumExperts <= MaxSupportedExpertCount,
                   "Routing kernel expects #experts %d  <= #threads %d", data.mNumExperts,
-                   NumKimiK2Experts);
+                   MaxSupportedExpertCount);


⚠️ Potential issue | 🟠 Major

Validation rejects valid small expert counts.

data.mNumExperts >= MaxSupportedTopExperts enforces a minimum expert count of 22, which can block supported configurations (e.g., <= topk::MaxNumExpertsUnit). This check should be about topK <= numExperts, not a hard minimum expert count.

✅ Proposed fix

- FLASHINFER_CHECK(data.mNumExperts >= MaxSupportedTopExperts, - "Routing kernel expects %d to be at most `#experts` %d", MaxSupportedTopExperts, - data.mNumExperts); + FLASHINFER_CHECK(data.mTopK <= data.mNumExperts, + "Routing kernel expects topK %d to be <= `#experts` %d", data.mTopK, + data.mNumExperts);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

FLASHINFER_CHECK(data.mNumExperts >= MaxSupportedTopExperts,

"Routing kernel expects %d to be at most #experts %d", MaxSupportedTopExperts,

data.mNumExperts);

FLASHINFER_CHECK(data.mNumExperts <= NumKimiK2Experts,

FLASHINFER_CHECK(data.mNumExperts <= MaxSupportedExpertCount,

"Routing kernel expects #experts %d <= #threads %d", data.mNumExperts,

NumKimiK2Experts);

MaxSupportedExpertCount);

FLASHINFER_CHECK(data.mTopK <= data.mNumExperts,

"Routing kernel expects topK %d to be <= `#experts` %d", data.mTopK,

data.mNumExperts);

FLASHINFER_CHECK(data.mNumExperts <= MaxSupportedExpertCount,

"Routing kernel expects `#experts` %d <= `#threads` %d", data.mNumExperts,

MaxSupportedExpertCount);

🤖 Prompt for AI Agents

In `@csrc/trtllm_fused_moe_routing_deepseek.cu` around lines 568 - 573, The two validation checks using FLASHINFER_CHECK incorrectly enforce a minimum expert count by comparing data.mNumExperts >= MaxSupportedTopExperts; change the logic to validate that the requested top-K does not exceed available experts (e.g., check topK <= data.mNumExperts or data.mTopK <= data.mNumExperts) and keep the other check that data.mNumExperts <= MaxSupportedExpertCount; update the error message to reflect "topK must be <= `#experts`" and reference FLASHINFER_CHECK, data.mNumExperts, MaxSupportedTopExperts, MaxSupportedExpertCount, and the top-K variable (topK or data.mTopK) so the routing kernel accepts small valid expert counts.

yzh119 · 2026-01-27T10:53:49Z

/bot run

yzh119 · 2026-01-27T10:54:32Z

@flashinfer-bot run

…spaceSizeInBytes, getDefaultValidConfigIndex, isValidConfigIndex Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

amitz-nv · 2026-01-28T15:21:43Z

@aleozlx

do we intend to touch the GemmOptions.h in gemm?

I don't think so. I just rebased so now main includes the required changes in trtllmGen_bmm_export, which removes those changes from this PR.

seeing errors like

error: namespace "batchedGemm::trtllm::gen" has no member "Sparsity"
  , trtllm::gen::Sparsity(0)

in the ci run

I believe the rebase should solve this as well

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

benchmarks/routines/moe.py (1)
872-898: Output export still references args.gated_act (now removed).

This will raise AttributeError when --output-path is used. Please switch to activation_type.
🛠️ Proposed fix
-        cur_res["gated_act"] = args.gated_act
+        cur_res["activation_type"] = (
+            args.activation_type.name
+            if isinstance(args.activation_type, ActivationType)
+            else str(args.activation_type)
+        )
flashinfer/fused_moe/core.py (1)
215-240: Cache key missing is_gated_act_gemm, risking incorrect permutation indices.

The cache key ("w3_w1", dst_w3_w1_weight.shape) does not include is_gated_act_gemm. If the same weight tensor is used with both gated and non-gated activations, the cached permute indices from the first call will be incorrectly reused for the second.
🐛 Proposed fix
 def _maybe_get_cached_w3_w1_permute_indices(
     _cache_permute_indices,
     dst_w3_w1_weight: torch.Tensor,
     epilogue_tile_m: int,
     num_elts_per_sf: Union[None, int] = None,
     is_gated_act_gemm: bool = True,
 ) -> torch.Tensor:
     # Create a unique cache key (weight_type, weight_shape)
-    cache_key = ("w3_w1", dst_w3_w1_weight.shape)
+    cache_key = ("w3_w1", dst_w3_w1_weight.shape, is_gated_act_gemm)
     if cache_key not in _cache_permute_indices:

🤖 Fix all issues with AI agents

In `@benchmarks/bench_trtllm_gen_fused_moe_autotuner.py`:
- Around line 357-364: The argparse configuration uses type=ActivationType which
passes raw strings to the ActivationType constructor and fails for IntEnum;
update the parser call (the parser.add_argument for "--activation-type") to use
a small custom parsing function that maps input strings to ActivationType via
bracket notation (e.g., ActivationType[input_str]) or accepts already-matching
enum members, validate choices using list(ActivationType), and set default to
ActivationType.Swiglu; locate the parser.add_argument where "--activation-type"
is declared and replace type=ActivationType with this custom parser function
(referencing ActivationType and the parser.add_argument invocation).

In `@benchmarks/routines/moe.py`:
- Around line 179-185: Argparse is currently using type=ActivationType which
only accepts integer enum values, causing names like "Swiglu" to fail; change
the add_argument call for "--activation-type" to accept enum names by replacing
type=ActivationType with a converter that maps strings to the enum (e.g., use a
lambda or small function that does ActivationType[item] if input is str or
ActivationType(int(item)) if numeric) and keep choices=list(ActivationType) and
default=ActivationType.Swiglu so both name and numeric inputs work; reference
the ActivationType enum and the "--activation-type" add_argument in the parser
to locate where to apply this change.

In `@include/flashinfer/trtllm/fused_moe/runner.h`:
- Around line 176-178: The isGatedActivation function incorrectly omits
SwigluBias from its gated activation check; update the function
(isGatedActivation in runner.h) to treat ActivationType::SwigluBias as a gated
activation alongside ActivationType::Swiglu and ActivationType::Geglu so its
behavior matches the implementation in moe_gemm_kernels.h.

benchmarks/routines/moe.py

include/flashinfer/trtllm/fused_moe/runner.h

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

benchmarks/routines/moe.py (1)
897-897: Bug: args.gated_act no longer exists.

This line references args.gated_act which was renamed to args.activation_type. This will cause an AttributeError at runtime when args.output_path is set.
🐛 Proposed fix
-        cur_res["gated_act"] = args.gated_act
+        cur_res["activation_type"] = str(activation_type)
flashinfer/fused_moe/core.py (1)
1423-1509: Normalize activation_type before using .value.
trtllm_fp8_per_tensor_scale_moe_op can be called with activation_type as an int (the public API default is ActivationType.Swiglu.value). In that case, .value raises AttributeError. Coerce to ActivationType (or use int(...)) before the .value access.
🐛 Suggested fix
 def trtllm_fp8_per_tensor_scale_moe_op(
@@
-    activation_type: ActivationType = ActivationType.Swiglu,
+    activation_type: ActivationType = ActivationType.Swiglu,
 ) -> torch.Tensor:
+    activation_type = ActivationType(activation_type)
@@
-            activation_type=activation_type.value,
+            activation_type=int(activation_type),
@@
-            activation_type.value,
+            int(activation_type),
csrc/trtllm_fused_moe_kernel_launcher.cu (2)
379-402: Validate activation_type before storing.

activation_type is an external input; guard against invalid enum values to prevent undefined kernel paths.
🛠️ Proposed fix
   TVM_FFI_ICHECK(0 <= weight_layout && weight_layout <= 2)
       << "the value of weight_layout is not recognized";
+  auto act_type = static_cast<int64_t>(activation_type);
+  TVM_FFI_ICHECK(act_type >= 0 &&
+                 act_type < static_cast<int64_t>(ActivationType::InvalidType))
+      << "activation_type is not recognized";
   this->weight_layout = static_cast<batchedGemm::gemm::MatrixLayout>(weight_layout);
   this->activation_type = activation_type;
487-503: BF16 config generation should ignore non‑Swiglu act_type.

The BF16 runtime path hardcodes Swiglu; generating configs for other activations can yield mismatched configs.
🛠️ Proposed fix
   std::set<int32_t> selected_tile_nums =
       computeSelectedTileN(supported_tile_nums, num_tokens, top_k, num_local_experts);
 
+  TVM_FFI_ICHECK(static_cast<ActivationType>(act_type) == ActivationType::Swiglu)
+      << "BF16 MoE supports only Swiglu activation.";
+
   for (int32_t tile_N : selected_tile_nums) {
     auto moe_runner = std::make_unique<tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner>(
         btg::Dtype::Bfloat16,  // dtype_act
         btg::Dtype::Bfloat16,  // dtype_weights
         false,                 // useDeepSeekFp8
-        tile_N, static_cast<ActivationType>(act_type), use_shuffled_weight,
+        tile_N, ActivationType::Swiglu, use_shuffled_weight,
         static_cast<batchedGemm::gemm::MatrixLayout>(weight_layout));

🤖 Fix all issues with AI agents

In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 1955-1962: The activation_type_to_func lookup can KeyError on
unsupported ActivationType values; update the logic around activation_type,
activation_type_to_func and activation_func to explicitly handle unknown enums
by either expanding the map to include the remaining ActivationType members or
performing a guarded lookup (e.g., check activation_type in
activation_type_to_func or use dict.get) and raise a clear ValueError listing
supported activation types when not found; ensure the error mentions
ActivationType and activation_type variable so callers see which value was
invalid.

🧹 Nitpick comments (3)

csrc/trtllm_batched_gemm_runner.cu (1)
226-227: Comment typo and potential design clarification needed.

The comment has a typo: "For simplicity pass set scaleAct to scaleGateC" should likely be "For simplicity, set scaleAct to scaleGateC".

More importantly, reusing scaleGateC for scaleAct may be a simplification that works for current use cases, but consider adding a brief note explaining when this assumption holds (e.g., for specific activation types).
✏️ Suggested comment fix
-  // For simplicity pass set scaleAct to scaleGateC
+  // For simplicity, set scaleAct to scaleGateC (valid when activation scaling matches gate scaling)
   gemmData.mInputBuffers.mPtrScaleAct = scaleGateC;
tests/moe/test_trtllm_gen_routed_fused_moe.py (1)

186-186: Consider parameterizing activation types for broader coverage.

The test currently only exercises ActivationType.Swiglu. Since this PR adds support for non-gated activations like Relu2, consider adding a parametrize decorator to test at least one non-gated activation type (e.g., ActivationType.Relu2) to validate the new functionality.

Also applies to: 239-239
benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (1)
379-390: CLI --activation-type argument is ignored for FP4 path.

The --activation-type argument is added to the CLI but not passed to bench_trtllm_gen_fused_moe_autotuner_fp4. If FP4 supports non-Swiglu activations, consider adding the parameter. If FP4 only supports Swiglu, consider either:

Validating and warning when a non-Swiglu activation is specified with an FP4 quant mode, or

Documenting this limitation in the help text.
🛠️ Option 1: Pass activation_type to FP4 (if supported)
     else:
         bench_trtllm_gen_fused_moe_autotuner_fp4(
             args.tune_max_num_tokens,
             args.quant_mode,
             args.num_tokens,
             args.num_experts,
             args.hidden_size,
             args.intermediate_size,
             args.top_k,
             args.warmups,
             args.iterations,
+            args.activation_type,
         )
🛠️ Option 2: Warn if non-Swiglu specified for FP4
     else:
+        if args.activation_type != ActivationType.Swiglu:
+            print(f"[WARNING] FP4 path only supports Swiglu activation. Ignoring --activation-type={args.activation_type}")
         bench_trtllm_gen_fused_moe_autotuner_fp4(
             ...
         )

coderabbitai · 2026-01-28T15:33:29Z

tests/moe/test_trtllm_gen_fused_moe.py

+    activation_type = args.activation_type
+    activation_type_to_func = {
+        ActivationType.Swiglu: F.silu,
+        ActivationType.Geglu: F.gelu,
+        ActivationType.Relu2: lambda x: F.relu(x) ** 2,
    }
-    gated_act_func = gated_act_type_to_func[gated_act_type]
+    activation_func = activation_type_to_func[activation_type]



⚠️ Potential issue | 🟡 Minor

Guard unsupported ActivationType values in the reference activation map.
The reference path now accepts ActivationType, but the mapping only covers Swiglu, Geglu, and Relu2. Any other enum value (e.g., Gelu, Relu, Silu, Identity, SwigluBias) will currently throw a KeyError. Consider expanding the map and/or adding a clear error for unsupported activations.

🔧 Suggested hardening

activation_type = args.activation_type activation_type_to_func = { ActivationType.Swiglu: F.silu, ActivationType.Geglu: F.gelu, ActivationType.Relu2: lambda x: F.relu(x) ** 2, + ActivationType.Gelu: F.gelu, + ActivationType.Relu: F.relu, + ActivationType.Silu: F.silu, + ActivationType.Identity: lambda x: x, } -activation_func = activation_type_to_func[activation_type] +activation_func = activation_type_to_func.get(activation_type) +if activation_func is None: + raise NotImplementedError( + f"ActivationType {activation_type} not supported in reference path yet." + )

🤖 Prompt for AI Agents

In `@tests/moe/test_trtllm_gen_fused_moe.py` around lines 1955 - 1962, The activation_type_to_func lookup can KeyError on unsupported ActivationType values; update the logic around activation_type, activation_type_to_func and activation_func to explicitly handle unknown enums by either expanding the map to include the remaining ActivationType members or performing a guarded lookup (e.g., check activation_type in activation_type_to_func or use dict.get) and raise a clear ValueError listing supported activation types when not found; ensure the error mentions ActivationType and activation_type variable so callers see which value was invalid.

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (1)

270-299: FP4 autotuner ignores the requested activation type.

The CLI accepts --activation-type, but the FP4 path hard‑codes Swiglu, so Relu2/other activations can’t be benchmarked. Consider threading the argument through (or explicitly rejecting non‑Swiglu for FP4).

🛠️ Proposed fix

-def bench_trtllm_gen_fused_moe_autotuner_fp4(
+def bench_trtllm_gen_fused_moe_autotuner_fp4(
     tune_max_num_tokens: Optional[int],
     quant_mode: Literal["NvFP4xNvFP4", "MxFP4xMxFP8", "MxFP4xBf16"],
     num_tokens: int,
     num_experts: int,
     hidden_size: int,
     intermediate_size: int,
     top_k: int,
     warmups: int,
     iterations: int,
+    activation_type: ActivationType,
 ):
 ...
-        ActivationType.Swiglu.value,  # act_type
+        activation_type.value,  # act_type
         None,
         num_tokens if tune_max_num_tokens is None else tune_max_num_tokens,
     )

-        bench_trtllm_gen_fused_moe_autotuner_fp4(
+        bench_trtllm_gen_fused_moe_autotuner_fp4(
             args.tune_max_num_tokens,
             args.quant_mode,
             args.num_tokens,
             args.num_experts,
             args.hidden_size,
             args.intermediate_size,
             args.top_k,
             args.warmups,
             args.iterations,
+            args.activation_type,
         )

🤖 Fix all issues with AI agents

In `@benchmarks/routines/flashinfer_benchmark_utils.py`:
- Around line 458-470: The converter inside enum_type incorrectly lowercases all
but the first char (in function converter), causing camelCase names like
SwigluBias to be mangled and rejected; update converter to perform a
case-insensitive lookup by comparing the incoming string (value) to enum member
names in a casefold/lower-insensitive way (and accept numeric indices/values
where appropriate) so that enum_type and callers like ActivationType accept
"SwigluBias", "swiglubias", or numeric inputs; implement this by normalizing
value (e.g., casefold()) and matching against member.name.casefold() or trying
int(value) fallback before raising argparse.ArgumentTypeError listing valid
options.

In `@benchmarks/routines/moe.py`:
- Line 628: The code still reads args.gated_act (which no longer exists) when
constructing the output/result export, causing a crash; replace all uses of
args.gated_act with args.activation_type (e.g., update variables like
activation_type = args.activation_type and any places that populate the output
CSV/dict) and, if you must keep the original column name, map
args.activation_type into the existing 'gated_act' output field when building
the results (ensure references in result-building code and the export writer use
activation_type/args.activation_type instead of args.gated_act).

benchmarks/routines/flashinfer_benchmark_utils.py

benchmarks/routines/moe.py

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@benchmarks/bench_trtllm_gen_fused_moe_autotuner.py`:
- Around line 358-365: The current parser.add_argument for "--activation-type"
uses type=enum_type(ActivationType) (which returns enum members) but choices are
strings, causing validation to always fail; update the argument so type and
choices match — either make choices a list of enum members (e.g.,
choices=list(ActivationType)) when keeping type=enum_type(ActivationType), or
keep choices=[e.name for e in ActivationType] and change the converter to parse
names (e.g., type=lambda s: ActivationType[s]); adjust the parser.add_argument
call for "--activation-type" accordingly so both type and choices use the same
representation.

In `@benchmarks/routines/moe.py`:
- Around line 180-185: The argparse setup for "--activation-type" mixes enum
members (from type=enum_type(ActivationType)) with string choices, causing valid
enum inputs to be rejected; update the choices to be enum members (e.g.,
choices=list(ActivationType) or [e for e in ActivationType]) so they match the
converter returned by enum_type(ActivationType), keep default as
ActivationType.Swiglu, and adjust the help text (e.g., show [e.name for e in
ActivationType]) if you want human-readable names.

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py

benchmarks/routines/moe.py

…hmarks Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (1)

270-301: FP4 path ignores user-supplied activation type.
--activation-type is parsed but the FP4 benchmark hardcodes Swiglu, so Relu2 (and other types) can’t be exercised in FP4 modes. Consider threading the CLI value into the FP4 path.

💡 Suggested fix (thread activation_type into FP4 path)

-def bench_trtllm_gen_fused_moe_autotuner_fp4(
+def bench_trtllm_gen_fused_moe_autotuner_fp4(
     tune_max_num_tokens: Optional[int],
     quant_mode: Literal["NvFP4xNvFP4", "MxFP4xMxFP8", "MxFP4xBf16"],
     num_tokens: int,
     num_experts: int,
     hidden_size: int,
     intermediate_size: int,
     top_k: int,
     warmups: int,
-    iterations: int,
+    iterations: int,
+    activation_type: ActivationType,
 ):
@@
-        ActivationType.Swiglu.value,  # act_type
+        activation_type.value,  # act_type
@@
-        bench_trtllm_gen_fused_moe_autotuner_fp4(
+        bench_trtllm_gen_fused_moe_autotuner_fp4(
             args.tune_max_num_tokens,
             args.quant_mode,
             args.num_tokens,
             args.num_experts,
             args.hidden_size,
             args.intermediate_size,
             args.top_k,
             args.warmups,
-            args.iterations,
+            args.iterations,
+            args.activation_type,
         )

🧹 Nitpick comments (1)

benchmarks/routines/moe.py (1)
1742-1765: Consider adding activation_type to result export for consistency.

testTrtllmFp4BlockScaleMoe attempts to export activation info (though buggy at line 898), but testTrtllmFp8PerTensorScaleMoe omits it entirely. For consistent benchmark output, consider adding:
         cur_res["input_dtype"] = input_dtype
         cur_res["weight_dtype"] = weight_dtype
+        cur_res["activation_type"] = args.activation_type.name
         res.append(cur_res)

…fix) Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

…tions Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (1)
101-130: FP8 block-scale kernel is hardcoded to Swiglu but other activations are not rejected.

The trtllm_fp8_block_scale_moe kernel function does not accept an activation_type parameter and is intentionally hardcoded to Swiglu (confirmed in the kernel implementation: constexpr ActivationType activation_type = ActivationType::Swiglu; // not exposed in api for now). However, the code only rejects Relu2, allowing any other activation type to pass through and silently use Swiglu, which will produce incorrect results.
✅ Fix: Enforce Swiglu-only for block-scale
-    if is_block_scale:
-        if activation_type == ActivationType.Relu2:
+    if is_block_scale:
+        if activation_type != ActivationType.Swiglu:
             raise ValueError(
-                "Relu2 activation is not supported for FP8 block scale MoE."
+                "Only Swiglu is supported for FP8 block scale MoE."
             )

…py for trtllm_fp8_block_scale_moe Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

benchmarks/routines/moe.py (1)

390-417: Adjust TFLOPS calculation to account for non-gated activation weight dimensions.

The TFLOPS calculation always assumes 2 * intermediate_size for the first GEMM, which is correct for gated activations (Swiglu, Geglu) but overstates the computational work for non-gated activations like Relu2, which use only intermediate_size. The test reference implementation confirms this: Relu2 computes F.relu(a @ weight.t()) ** 2 using the full weight tensor (shape [num_experts, intermediate_size, hidden_size]), while Swiglu splits the weight into two halves.

Update calculate_moe_tflops to branch based on activation type:

For gated activations: keep current 2 * intermediate_size calculation

For non-gated activations (Relu2, Identity): use intermediate_size

Alternatively, if the function cannot be made activation-aware, the function signature should be updated to accept activation_type as a parameter.

🧹 Nitpick comments (2)

benchmarks/routines/moe.py (2)
1507-1532: Note: activation_type not included in FP8 block scale results.

Per the PR scope (NVFP4 and FP8PerTensor only), testTrtllmFp8BlockScaleMoe doesn't support activation_type. However, this creates a column inconsistency when combining benchmark results from all three routines with --output-path.

Consider adding a placeholder for consistency:
         cur_res["input_dtype"] = input_dtype
         cur_res["weight_dtype"] = weight_dtype
+        cur_res["activation_type"] = "N/A"  # FP8 block scale doesn't support activation_type yet
         res.append(cur_res)
1236-1262: Same observation: activation_type not in CUTLASS results.

For output consistency across all MOE benchmark routines, consider adding a placeholder entry similar to the suggestion for FP8 block scale.

yzh119 · 2026-01-29T18:03:54Z

/bot run

flashinfer-bot · 2026-01-29T18:04:59Z

GitLab MR !267 has been updated with latest changes, and the CI pipeline #42839544 is currently running. I'll report back once the pipeline job completes.

yzh119 · 2026-01-30T00:24:17Z

Hi @amitz-nv seems there are some abnormal output values on gitlab CI, e.g.

tests/moe/test_trtllm_gen_fused_moe.py:2531: in run_moe_test
    check_accuracy(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
a = tensor([[-443841.5625,  169349.1250,   -5101.0029,  ...,  -99210.4688,
          -97816.4062, -375885.8125],
        [... [ 157589.8594, -114925.8906,  -31947.7305,  ...,  -96621.3672,
          144419.6562,  -20856.9512]], device='cuda:0')
b = tensor([[  39424.,   99328., -157696.,  ..., -251904.,   93184.,   15104.],
        [ 207872.,  409600.,  191488.,  .....  992.,  -35328.],
        [ -84992.,   14528.,  -26752.,  ..., -124416.,  -88576., -149504.]],
       device='cuda:0')
atol = 0.1, rtol = 0.85, percent = 0.925
    def check_accuracy(a, b, atol, rtol, percent):
        """Unified accuracy checking function with detailed error reporting."""
        if not torch.isfinite(a).all():
            raise Exception("Non-finite values in reference output")
        if not torch.isfinite(b).all():
            raise Exception("Non-finite values in actual output")
        assert a.shape == b.shape, f"Shape mismatch: {a.shape} vs {b.shape}"
    
        close = torch.isclose(a, b, atol=atol, rtol=rtol)
        match_ratio = close.float().mean()
        if match_ratio >= percent:
            return
    
        mismatch_percent = 1.0 - match_ratio.item()
        if mismatch_percent > 1 - percent:
>           raise Exception(
                f"Mismatch percentage is {mismatch_percent:.4f} for rtol {rtol} "
                f"(threshold: {1 - percent:.4f})"
            )
E           Exception: Mismatch percentage is 0.6653 for rtol 0.85 (threshold: 0.0750)

flashinfer-bot · 2026-01-30T03:04:42Z

[FAILED] Pipeline #42839544: 7/20 passed

…rt Nemotron" (#2451) Reverts #2304 As it introduces regression on unit test and no longer allow number of experts lower than 22 to run trtllm deepseek routing kernel  ## Summary by CodeRabbit ## Release Notes * **Refactor** * Consolidated gated activation type handling across MoE implementations with simplified parameter names and enum naming. * Unified intermediate size calculations to consistently use 2x configuration. * Streamlined routing logic for improved clarity and maintainability. * **Breaking Changes** * CLI argument `--activation-type` renamed to `--gated-act` with values "swiglu" or "geglu". * API parameter names updated from `activation_type` to `gated_act_type` across public interfaces. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>

…ron, fixed (#2462)  ## 📌 Description - Support element wise activation (relu^2) in fused MoE in NVFP4 and in FP8PerTensor. - Use new ActivationType enum class instead of GatedActType. - Support Nemotron in deepseek routing as in NVIDIA/TensorRT-LLM#9792 - Remove 'A' suffix from UseShuffledMatrixA. NOTE: This is the fixed version of #2304 that was merged and reverted. - Replaced the problematic condition in deepseek routing that required `NumExperts >= MaxSupportedTopExperts` with `topK<=numExperts` - DeepSeek R1 works with it (tested with VLLM). - Removed irrelevant test cases. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Refactor** * Replaced old gated-activation API with a unified ActivationType enum (many activation kinds supported). * Propagated activation_type across MoE workflows and kernels. * **New Features** * Added CLI option --activation-type to select activation kind for MoE benchmarks. * **Bug Fixes** * Enforced activation compatibility and validation for FP8/FP4 paths. * **Tests** * Updated and expanded tests to cover new activation types and compatibility scenarios.  --------- Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

…rt Nemotron" (flashinfer-ai#2451) Reverts flashinfer-ai#2304 As it introduces regression on unit test and no longer allow number of experts lower than 22 to run trtllm deepseek routing kernel  ## Summary by CodeRabbit ## Release Notes * **Refactor** * Consolidated gated activation type handling across MoE implementations with simplified parameter names and enum naming. * Unified intermediate size calculations to consistently use 2x configuration. * Streamlined routing logic for improved clarity and maintainability. * **Breaking Changes** * CLI argument `--activation-type` renamed to `--gated-act` with values "swiglu" or "geglu". * API parameter names updated from `activation_type` to `gated_act_type` across public interfaces. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub>

…ron, fixed (flashinfer-ai#2462)  ## 📌 Description - Support element wise activation (relu^2) in fused MoE in NVFP4 and in FP8PerTensor. - Use new ActivationType enum class instead of GatedActType. - Support Nemotron in deepseek routing as in NVIDIA/TensorRT-LLM#9792 - Remove 'A' suffix from UseShuffledMatrixA. NOTE: This is the fixed version of flashinfer-ai#2304 that was merged and reverted. - Replaced the problematic condition in deepseek routing that required `NumExperts >= MaxSupportedTopExperts` with `topK<=numExperts` - DeepSeek R1 works with it (tested with VLLM). - Removed irrelevant test cases. ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Refactor** * Replaced old gated-activation API with a unified ActivationType enum (many activation kinds supported). * Propagated activation_type across MoE workflows and kernels. * **New Features** * Added CLI option --activation-type to select activation kind for MoE benchmarks. * **Bug Fixes** * Enforced activation compatibility and validation for FP8/FP4 paths. * **Tests** * Updated and expanded tests to cover new activation types and compatibility scenarios.  --------- Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

amitz-nv commented Jan 7, 2026

View reviewed changes

amitz-nv changed the title ~~Fused MoE FP8 non gated Relu2~~ Fused MoE non gated FP8 Relu2 Jan 7, 2026

amitz-nv changed the title ~~Fused MoE non gated FP8 Relu2~~ Fused MoE non gated Relu2 FP8 Jan 7, 2026

amitz-nv changed the title ~~Fused MoE non gated Relu2 FP8~~ Fused MoE non gated Relu2 NVFP4 & FP8 Jan 12, 2026

amitz-nv force-pushed the fused-moe-non-gated-fp8 branch 2 times, most recently from 9655f95 to 4c9fb49 Compare January 26, 2026 15:07

amitz-nv changed the title ~~Fused MoE non gated Relu2 NVFP4 & FP8~~ Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron Jan 26, 2026

amitz-nv force-pushed the fused-moe-non-gated-fp8 branch from 4c9fb49 to 9a1ffa0 Compare January 26, 2026 16:29

amitz-nv marked this pull request as ready for review January 26, 2026 16:45

amitz-nv requested review from Anerudhan, IwakuraRein, aleozlx, bkryu, cyx-6, djmmoss, jiahanc, jimmyzho, joker-eph, kahyunnam, nv-yunzheq, nvmbreughe and yzh119 as code owners January 26, 2026 16:45

amitz-nv changed the title ~~Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron~~ feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron Jan 26, 2026

coderabbitai bot reviewed Jan 26, 2026

View reviewed changes

amitz-nv added 2 commits January 28, 2026 11:58

Restore intermediate size factor of 2 for gated activation in getWork…

cf6f76b

…spaceSizeInBytes, getDefaultValidConfigIndex, isValidConfigIndex Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Formatting fixes

e63e17d

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

amitz-nv force-pushed the fused-moe-non-gated-fp8 branch from 20b4ba7 to e63e17d Compare January 28, 2026 15:06

Treat SwigluBias as gated activation

8398e20

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

benchmarks/routines/moe.py Show resolved Hide resolved

include/flashinfer/trtllm/fused_moe/runner.h Show resolved Hide resolved

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

Fix use of ActivationType enum in CLI

ea67cef

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

coderabbitai bot reviewed Jan 28, 2026

View reviewed changes

benchmarks/routines/flashinfer_benchmark_utils.py Show resolved Hide resolved

benchmarks/routines/moe.py Show resolved Hide resolved

Fix activation-type command line argument handling in benchmarks

abefe22

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

benchmarks/bench_trtllm_gen_fused_moe_autotuner.py Show resolved Hide resolved

benchmarks/routines/moe.py Show resolved Hide resolved

Fix choices of activation-type command line argument handling in benc…

da35764

…hmarks Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

amitz-nv added 2 commits January 29, 2026 13:45

GEMM (non batched) still has mUseShuffledMatrixA member (with 'A' suf…

205989f

…fix) Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Update bench_trtllm_gen_fused_moe_autotuner.py to support more activa…

e467f1d

…tions Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

amitz-nv added 2 commits January 29, 2026 14:03

Revert activation_Type check in bench_trtllm_gen_fused_moe_autotuner.…

80d1b53

…py for trtllm_fp8_block_scale_moe Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

Include activation type in results in benchmarks/routings/moe.py

21e0e08

Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

yzh119 merged commit 83cdea3 into flashinfer-ai:main Jan 30, 2026
16 of 23 checks passed

nv-yunzheq mentioned this pull request Jan 31, 2026

Revert "feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron" #2451

Merged

amitz-nv mentioned this pull request Feb 2, 2026

feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron, fixed #2462

Merged

5 tasks

		return mRunner.getWorkspaceSizeInBytes(numTokens, intermediateSize, hiddenSize, {}, numTokens,
		numExperts, maxNumCtasInBatchDim, configIndex);

		return mRunner.getDefaultValidConfigIndex(numTokens, intermediateSize, hiddenSize, {},
		numTokens, numExperts, maxNumCtasInBatchDim);

		mRunner.isValidConfigIndex(configIndex, numTokens, intermediateSize, hiddenSize, {},
		numTokens, numExperts, maxNumCtasInBatchDim);

Conversation

amitz-nv commented Jan 7, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist bot commented Jan 7, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

amitz-nv Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

yzh119 commented Jan 27, 2026

Uh oh!

yzh119 commented Jan 27, 2026

Uh oh!

amitz-nv commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amitz-nv commented Jan 7, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 7, 2026 •

edited

Loading

amitz-nv Jan 7, 2026 •

edited

Loading

amitz-nv commented Jan 28, 2026 •

edited

Loading