feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron#2304
feat: Support Fused MoE non gated Relu2 NVFP4 & FP8 and support Nemotron#2304yzh119 merged 25 commits intoflashinfer-ai:mainfrom
Conversation
📝 WalkthroughWalkthroughReplaces Changes
Sequence Diagram(s)sequenceDiagram
participant CLI
participant PyCore as Python Core
participant Launcher as C++ Launcher/Runner
participant CUDA as CUDA Kernel
participant Device as Device/Experts
CLI->>PyCore: parse args (--activation-type, quant_mode, numTopExperts)
PyCore->>Launcher: build config (activation_type.value, useShuffledMatrix, numTopExperts)
Launcher->>Launcher: select valid configs / getValidConfigs(act_type, numTopExperts)
Launcher->>CUDA: launch kernel(config_index, act_type, topK)
CUDA->>Device: route tokens / apply activation (gated or eltwise)
Device-->>CUDA: return results
CUDA-->>Launcher: outputs
Launcher-->>PyCore: surface results (activation_type, config)
PyCore-->>CLI: report results / autotuner output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @amitz-nv, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the flexibility of fused Mixture-of-Experts (MoE) operations by introducing a unified Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for non-gated activations, such as Relu2, into the Fused MoE FP8 kernels. The changes primarily involve refactoring GatedActType to a more general ActivationType enum and plumbing this through the Python wrappers, benchmarks, and C++/CUDA implementation. The refactoring is extensive and mostly well-executed.
I have identified a few issues that need attention. There are potential bugs in csrc/trtllm_fused_moe_runner.cu related to the calculation of workspace size and GEMM configuration validation, which do not correctly handle the doubled intermediate size for gated activations. Additionally, there's a minor code cleanup opportunity in benchmarks/routines/moe.py and a parameter name typo in tests/moe/test_dpsk_fused_moe_fp8.py. Addressing these issues will improve the correctness and clarity of the code.
csrc/trtllm_fused_moe_runner.cu
Outdated
| return mRunner.getWorkspaceSizeInBytes(numTokens, intermediateSize, hiddenSize, {}, numTokens, | ||
| numExperts, maxNumCtasInBatchDim, configIndex); |
There was a problem hiding this comment.
The intermediateSize passed to getWorkspaceSizeInBytes does not account for gated activations, where the intermediate dimension is doubled. This could lead to under-allocating workspace. It should be multiplied by (isGatedActivation(mActType) ? 2 : 1) to ensure correct workspace size calculation, similar to how it's done in the run method.
int32_t const intermediateSizeFactor = isGatedActivation(mActType) ? 2 : 1;
return mRunner.getWorkspaceSizeInBytes(numTokens, intermediateSizeFactor * intermediateSize, hiddenSize, {}, numTokens,
numExperts, maxNumCtasInBatchDim, configIndex);
csrc/trtllm_fused_moe_runner.cu
Outdated
| return mRunner.getDefaultValidConfigIndex(numTokens, intermediateSize, hiddenSize, {}, | ||
| numTokens, numExperts, maxNumCtasInBatchDim); |
There was a problem hiding this comment.
The intermediateSize passed to getDefaultValidConfigIndex does not account for gated activations. This could lead to selecting a suboptimal or incorrect default configuration. It should be multiplied by (isGatedActivation(mActType) ? 2 : 1) to ensure correct config selection.
int32_t const intermediateSizeFactor = isGatedActivation(mActType) ? 2 : 1;
return mRunner.getDefaultValidConfigIndex(numTokens, intermediateSizeFactor * intermediateSize, hiddenSize, {},
numTokens, numExperts, maxNumCtasInBatchDim);
csrc/trtllm_fused_moe_runner.cu
Outdated
| mRunner.isValidConfigIndex(configIndex, numTokens, intermediateSize, hiddenSize, {}, | ||
| numTokens, numExperts, maxNumCtasInBatchDim); |
There was a problem hiding this comment.
The intermediateSize passed to isValidConfigIndex does not account for gated activations. This could lead to incorrect validation of GEMM configurations. It should be multiplied by (isGatedActivation(mActType) ? 2 : 1) to ensure correct config validation.
mRunner.isValidConfigIndex(configIndex, numTokens, (isGatedActivation(mActType) ? 2 : 1) * intermediateSize, hiddenSize, {},
numTokens, numExperts, maxNumCtasInBatchDim);
| gemmData.mInputBuffers.mPtrScaleC = scaleC; | ||
| gemmData.mInputBuffers.mPtrScaleGate = scaleGateC; | ||
| // TODO amitz-nv: Do we want to pass scaleAct instead of using scaleGateC? | ||
| gemmData.mInputBuffers.mPtrScaleAct = scaleGateC; |
There was a problem hiding this comment.
Decide whether it's OK or fix in the future?
9655f95 to
4c9fb49
Compare
4c9fb49 to
9a1ffa0
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
csrc/trtllm_fused_moe_routing_deepseek.cu (1)
517-538: Guard topK > 8 for non‑Nemotron expert counts.
LAUNCH_ROUTING_DEEPSEEKstill instantiates kernels withMaxNumTopExperts=DefaultMaxNumTopExpertsfornumExperts <= NumKimiK2Experts, butrunImplnow allowstopKup to 22. That can driveparams.mTopKbeyondKernelParams::MaxNumTopExpertsand corrupt stack/shared buffers.Consider either selecting
MaxSupportedTopExpertswheneverdata.mTopK > DefaultMaxNumTopExperts, or explicitly rejecting such inputs for non‑Nemotron expert counts.🔧 Suggested guard (minimal change)
@@ FLASHINFER_CHECK(data.mTopK <= MaxSupportedTopExperts, "Routing kernel expects topK experts <= %d, got %d", MaxSupportedTopExperts, data.mTopK); + if (data.mNumExperts < NumNemotronExperts) { + FLASHINFER_CHECK( + data.mTopK <= DefaultMaxNumTopExperts, + "For numExperts < %d, routing kernel supports topK <= %d, got %d", + NumNemotronExperts, DefaultMaxNumTopExperts, data.mTopK); + }Also applies to: 560-573
benchmarks/routines/moe.py (1)
872-898: Fix staleargs.gated_actaccess after the activation-type switch.
--gated_actwas removed, so this path will raiseAttributeErrorwhenoutput_pathis used.✅ Suggested fix
- cur_res["gated_act"] = args.gated_act + cur_res["activation_type"] = str(args.activation_type)include/flashinfer/trtllm/batched_gemm/trtllmGen_bmm_export/BatchedGemmOptions.h (1)
496-500: Guard optionalmRouteSfsImplindumpOptions.
options.mRouteSfsImpl.value()will throw ifdumpOptions()is called beforemRouteSfsImplis set (e.g., default options withoutcheckAndUpdateBatchedGemmOptions). Guard the optional or emitnullopt.🔧 Suggested fix
- ss << "mRouteSfsImpl={batchedGemm::RouteImpl(" - << static_cast<int32_t>(options.mRouteSfsImpl.value()) << ")}," << std::endl; + if (options.mRouteSfsImpl.has_value()) { + ss << "mRouteSfsImpl={batchedGemm::RouteImpl(" + << static_cast<int32_t>(options.mRouteSfsImpl.value()) << ")}," << std::endl; + } else { + ss << "mRouteSfsImpl={nullopt}," << std::endl; + }flashinfer/fused_moe/core.py (1)
210-236: Include gated/non‑gated flag in permute cache key.
permute0now differs byis_gated_act_gemm, but the cache key ignores it, so a gated run can poison the cache for a non‑gated run (or vice‑versa), yielding incorrect permutations.🔧 Suggested fix
- cache_key = ("w3_w1", dst_w3_w1_weight.shape) + cache_key = ("w3_w1", dst_w3_w1_weight.shape, is_gated_act_gemm, num_elts_per_sf)csrc/trtllm_fused_moe_kernel_launcher.cu (1)
533-553: Scale FP8‑per‑tensor GEMM1 buffers by activation type.
Now thatactivation_typecan be non‑gated,prepare_moe()still allocates GEMM1 output/scales for2 * intermediate_size. When the runner usesM=intermediate_size, the row stride is wrong and outputs can overlap.🔧 Suggested fix
void prepare_moe(int64_t& moe_tactic) override { FusedMoeLauncher::prepare_moe_common(moe_tactic); int32_t max_num_padded_tokens_gemm1 = workspace.total_max_padded_tokens + args->num_experts; int32_t max_num_padded_tokens_gemm2 = workspace.total_max_padded_tokens; + + int32_t const intermediate_size_factor = + tensorrt_llm::kernels::trtllmgen_moe::MoE::isGatedActivation(activation_type) ? 2 : 1; + int32_t const gemm1_out_dim = intermediate_size_factor * args->intermediate_size; - gemm1_output = alloc_tensor({max_num_padded_tokens_gemm1, 2 * args->intermediate_size}, + gemm1_output = alloc_tensor({max_num_padded_tokens_gemm1, gemm1_out_dim}, dl_uint8, hidden_states.device()); - gemm1_output_scale = - alloc_tensor({2 * args->intermediate_size / 128, max_num_padded_tokens_gemm1}, dl_float32, - hidden_states.device()); + gemm1_output_scale = + alloc_tensor({gemm1_out_dim / 128, max_num_padded_tokens_gemm1}, dl_float32, + hidden_states.device());
🤖 Fix all issues with AI agents
In `@flashinfer/fused_moe/core.py`:
- Around line 1533-1535: The fake-op function signatures that accept the unused
parameter activation_type should rename that parameter to _activation_type (or
prefix it with an underscore) to silence Ruff ARG001 while preserving signature
compatibility; update the parameter name in each fake-op signature (the
occurrences where activation_type is accepted but unused) and ensure any
internal references (if any) are adjusted accordingly so behavior is unchanged.
In `@include/flashinfer/trtllm/batched_gemm/trtllmGen_bmm_export/GemmOptions.h`:
- Line 553: Update the debug/dump label to match the renamed field: replace the
literal "mUseShuffledMatrixA=" with "mUseShuffledMatrix=" where the code writes
out options.mUseShuffledMatrix (in the dump/ostream code that uses ss <<
"mUseShuffledMatrixA=" << options.mUseShuffledMatrix << ...), so the printed
label matches the actual member name.
In `@include/flashinfer/trtllm/fused_moe/runner.h`:
- Around line 176-178: The isGatedActivation helper currently only checks
ActivationType::Swiglu and ActivationType::Geglu; update it to also treat
ActivationType::SwigluBias as gated so code paths that expect gated activations
(e.g., hidden-size handling) follow the correct branch—modify the
isGatedActivation(ActivationType activationType) function to return true for
ActivationType::SwigluBias in addition to Swiglu and Geglu.
♻️ Duplicate comments (2)
csrc/trtllm_batched_gemm_runner.cu (1)
226-227: Follow up on the scaleAct vs scaleGateC TODO.This open question can affect non-gated activation scaling once those paths are exercised.
csrc/trtllm_fused_moe_runner.cu (1)
304-330: Apply gated size factor in workspace/config helpers.
run()scalesintermediateSizefor gated activations, but the workspace/config helpers still use the unscaled value, risking under‑allocation or invalid config selection for Swiglu/Geglu.🔧 Suggested fix
size_t Runner::getWorkspaceSizeInBytes(int32_t topK, int32_t hiddenSize, int32_t intermediateSize, int32_t numExperts, int32_t numTokens, int32_t configIndex) const { auto maxNumCtasInBatchDim = Routing::getMaxNumCtasInBatchDim(numTokens, topK, numExperts, mTileTokensDim); - return mRunner.getWorkspaceSizeInBytes(numTokens, intermediateSize, hiddenSize, {}, numTokens, + int32_t const intermediateSizeFactor = isGatedActivation(mActType) ? 2 : 1; + return mRunner.getWorkspaceSizeInBytes(numTokens, intermediateSizeFactor * intermediateSize, + hiddenSize, {}, numTokens, numExperts, maxNumCtasInBatchDim, configIndex); } @@ int32_t Runner::getDefaultValidConfigIndex(int32_t topK, int32_t hiddenSize, int32_t intermediateSize, int32_t numExperts, int32_t numTokens) const { auto maxNumCtasInBatchDim = Routing::getMaxNumCtasInBatchDim(numTokens, topK, numExperts, mTileTokensDim); - return mRunner.getDefaultValidConfigIndex(numTokens, intermediateSize, hiddenSize, {}, numTokens, + int32_t const intermediateSizeFactor = isGatedActivation(mActType) ? 2 : 1; + return mRunner.getDefaultValidConfigIndex(numTokens, intermediateSizeFactor * intermediateSize, + hiddenSize, {}, numTokens, numExperts, maxNumCtasInBatchDim); } @@ auto const isValid = - mRunner.isValidConfigIndex(configIndex, numTokens, intermediateSize, hiddenSize, {}, + mRunner.isValidConfigIndex(configIndex, numTokens, + (isGatedActivation(mActType) ? 2 : 1) * intermediateSize, + hiddenSize, {}, numTokens, numExperts, maxNumCtasInBatchDim);
🧹 Nitpick comments (4)
include/flashinfer/trtllm/batched_gemm/trtllmGen_bmm_export/GemmGatedActOptions.h (1)
79-80: Consider namingActType::NoneingetActTypeName.
Now thatNoneis a concrete enum member,getActTypeNamewill return"Unknown type"for it, which can muddy diagnostics. A small switch update keeps logs clear.♻️ Proposed update
inline std::string getActTypeName(ActType type) { switch (type) { case ActType::SwiGlu: return "SwiGlu"; case ActType::GeGlu: return "GeGlu"; + case ActType::None: + return "None"; default: return "Unknown type"; } }include/flashinfer/trtllm/fused_moe/RoutingKernel.h (1)
179-188: Guard MaxNumTopExperts against invalid instantiations.Now that
MaxNumTopExperts_is part of the public template surface, a compile-time bound check helps prevent accidentaltopK > MaxNumExpertsconfigurations from compiling and later causing bounds issues.♻️ Proposed compile-time guard
struct KernelParams : public KernelParamsBase<InputT_, OutputT_, MaxNumExperts_, isPow2_, UsePdl_> { using InputT = InputT_; using BiasT = BiasT_; using OutputT = OutputT_; static constexpr bool UseGroups = UseGroups_; static constexpr int MaxNumTopExperts = MaxNumTopExperts_; + static_assert(MaxNumTopExperts_ > 0 && MaxNumTopExperts_ <= MaxNumExperts_, + "MaxNumTopExperts must be within [1, MaxNumExperts]");include/flashinfer/trtllm/batched_gemm/KernelRunner.h (1)
50-71: KeepEltwiseActTypesynced with the canonical GEMM enum.This enum is later compared to config enums via integer casts, so any drift would silently filter out valid configs. Consider aliasing the canonical enum from
Enums.hor add compile-time guards to ensure numeric parity.🔧 Suggested safeguard (adjust namespace if needed)
enum class EltwiseActType { None = 0, Gelu, Relu2, }; + +static_assert( + static_cast<int>(EltwiseActType::None) == + static_cast<int>(batchedGemm::gemm::EltwiseActType::None), + "EltwiseActType::None must stay in sync with batchedGemm::gemm::EltwiseActType::None"); +static_assert( + static_cast<int>(EltwiseActType::Gelu) == + static_cast<int>(batchedGemm::gemm::EltwiseActType::Gelu), + "EltwiseActType::Gelu must stay in sync with batchedGemm::gemm::EltwiseActType::Gelu"); +static_assert( + static_cast<int>(EltwiseActType::Relu2) == + static_cast<int>(batchedGemm::gemm::EltwiseActType::Relu2), + "EltwiseActType::Relu2 must stay in sync with batchedGemm::gemm::EltwiseActType::Relu2");benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (1)
357-364: Argparse enum parsing may be unintuitive.
type=ActivationTypetends to accept numeric values only (e.g.,--activation-type 3), which is easy to mis-use. Consider accepting enum names and mapping them explicitly for CLI ergonomics.🛠️ Suggested CLI parsing tweak
+def parse_activation_type(value: str) -> ActivationType: + try: + return ActivationType[value] + except KeyError: + return ActivationType(int(value)) + ... parser.add_argument( "--activation-type", - type=ActivationType, + type=parse_activation_type, choices=list(ActivationType), required=False, default=ActivationType.Swiglu, help=f"Type of gated activation function: {list(ActivationType)}", )
| enable_pdl: Optional[bool] = None, | ||
| activation_type: int = ActivationType.Identity.value, | ||
| ): |
There was a problem hiding this comment.
Silence unused activation_type in fake ops (Ruff ARG001).
The fake ops keep activation_type for signature compatibility but never use it.
🔧 Suggested fix
- activation_type: int = ActivationType.Identity.value,
+ _activation_type: int = ActivationType.Identity.value,- activation_type: int,
+ _activation_type: int,Also applies to: 1911-1913
🧰 Tools
🪛 Ruff (0.14.13)
1533-1533: Unused function argument: enable_pdl
(ARG001)
1534-1534: Unused function argument: activation_type
(ARG001)
🤖 Prompt for AI Agents
In `@flashinfer/fused_moe/core.py` around lines 1533 - 1535, The fake-op function
signatures that accept the unused parameter activation_type should rename that
parameter to _activation_type (or prefix it with an underscore) to silence Ruff
ARG001 while preserving signature compatibility; update the parameter name in
each fake-op signature (the occurrences where activation_type is accepted but
unused) and ensure any internal references (if any) are adjusted accordingly so
behavior is unchanged.
include/flashinfer/trtllm/batched_gemm/trtllmGen_bmm_export/GemmOptions.h
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
csrc/trtllm_fused_moe_routing_deepseek.cu (1)
518-538: Prevent topK > DefaultMaxNumTopExperts on non‑Nemotron paths (OOB risk).
data.mTopKis validated againstMaxSupportedTopExpertsglobally, but non‑Nemotron branches still instantiate kernels withDefaultMaxNumTopExperts. Ifdata.mTopK > DefaultMaxNumTopExpertson Deepseek/Kimi (or smaller expert counts), the kernel will index beyond the fixed‑sizetopScores/topExpertsarrays.A minimal fix is to tighten validation for those branches so
mTopKnever exceeds the instantiated compile‑time bound.🐛 Proposed fix (tighten validation for non‑Nemotron)
FLASHINFER_CHECK(data.mTopK <= MaxSupportedTopExperts, "Routing kernel expects topK experts <= %d, got %d", MaxSupportedTopExperts, data.mTopK); + if (data.mNumExperts <= NumKimiK2Experts) { + FLASHINFER_CHECK(data.mTopK <= DefaultMaxNumTopExperts, + "Routing kernel expects topK experts <= %d for %d experts, got %d", + DefaultMaxNumTopExperts, data.mNumExperts, data.mTopK); + }Also applies to: 560-562
🤖 Fix all issues with AI agents
In `@csrc/trtllm_fused_moe_routing_deepseek.cu`:
- Around line 568-573: The two validation checks using FLASHINFER_CHECK
incorrectly enforce a minimum expert count by comparing data.mNumExperts >=
MaxSupportedTopExperts; change the logic to validate that the requested top-K
does not exceed available experts (e.g., check topK <= data.mNumExperts or
data.mTopK <= data.mNumExperts) and keep the other check that data.mNumExperts
<= MaxSupportedExpertCount; update the error message to reflect "topK must be <=
`#experts`" and reference FLASHINFER_CHECK, data.mNumExperts,
MaxSupportedTopExperts, MaxSupportedExpertCount, and the top-K variable (topK or
data.mTopK) so the routing kernel accepts small valid expert counts.
| FLASHINFER_CHECK(data.mNumExperts >= MaxSupportedTopExperts, | ||
| "Routing kernel expects %d to be at most #experts %d", MaxSupportedTopExperts, | ||
| data.mNumExperts); | ||
| FLASHINFER_CHECK(data.mNumExperts <= NumKimiK2Experts, | ||
| FLASHINFER_CHECK(data.mNumExperts <= MaxSupportedExpertCount, | ||
| "Routing kernel expects #experts %d <= #threads %d", data.mNumExperts, | ||
| NumKimiK2Experts); | ||
| MaxSupportedExpertCount); |
There was a problem hiding this comment.
Validation rejects valid small expert counts.
data.mNumExperts >= MaxSupportedTopExperts enforces a minimum expert count of 22, which can block supported configurations (e.g., <= topk::MaxNumExpertsUnit). This check should be about topK <= numExperts, not a hard minimum expert count.
✅ Proposed fix
- FLASHINFER_CHECK(data.mNumExperts >= MaxSupportedTopExperts,
- "Routing kernel expects %d to be at most `#experts` %d", MaxSupportedTopExperts,
- data.mNumExperts);
+ FLASHINFER_CHECK(data.mTopK <= data.mNumExperts,
+ "Routing kernel expects topK %d to be <= `#experts` %d", data.mTopK,
+ data.mNumExperts);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| FLASHINFER_CHECK(data.mNumExperts >= MaxSupportedTopExperts, | |
| "Routing kernel expects %d to be at most #experts %d", MaxSupportedTopExperts, | |
| data.mNumExperts); | |
| FLASHINFER_CHECK(data.mNumExperts <= NumKimiK2Experts, | |
| FLASHINFER_CHECK(data.mNumExperts <= MaxSupportedExpertCount, | |
| "Routing kernel expects #experts %d <= #threads %d", data.mNumExperts, | |
| NumKimiK2Experts); | |
| MaxSupportedExpertCount); | |
| FLASHINFER_CHECK(data.mTopK <= data.mNumExperts, | |
| "Routing kernel expects topK %d to be <= `#experts` %d", data.mTopK, | |
| data.mNumExperts); | |
| FLASHINFER_CHECK(data.mNumExperts <= MaxSupportedExpertCount, | |
| "Routing kernel expects `#experts` %d <= `#threads` %d", data.mNumExperts, | |
| MaxSupportedExpertCount); |
🤖 Prompt for AI Agents
In `@csrc/trtllm_fused_moe_routing_deepseek.cu` around lines 568 - 573, The two
validation checks using FLASHINFER_CHECK incorrectly enforce a minimum expert
count by comparing data.mNumExperts >= MaxSupportedTopExperts; change the logic
to validate that the requested top-K does not exceed available experts (e.g.,
check topK <= data.mNumExperts or data.mTopK <= data.mNumExperts) and keep the
other check that data.mNumExperts <= MaxSupportedExpertCount; update the error
message to reflect "topK must be <= `#experts`" and reference FLASHINFER_CHECK,
data.mNumExperts, MaxSupportedTopExperts, MaxSupportedExpertCount, and the top-K
variable (topK or data.mTopK) so the routing kernel accepts small valid expert
counts.
|
/bot run |
|
@flashinfer-bot run |
…spaceSizeInBytes, getDefaultValidConfigIndex, isValidConfigIndex Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
20b4ba7 to
e63e17d
Compare
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
I don't think so. I just rebased so now main includes the required changes in
I believe the rebase should solve this as well |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
benchmarks/routines/moe.py (1)
872-898: Output export still referencesargs.gated_act(now removed).This will raise
AttributeErrorwhen--output-pathis used. Please switch toactivation_type.🛠️ Proposed fix
- cur_res["gated_act"] = args.gated_act + cur_res["activation_type"] = ( + args.activation_type.name + if isinstance(args.activation_type, ActivationType) + else str(args.activation_type) + )flashinfer/fused_moe/core.py (1)
215-240: Cache key missingis_gated_act_gemm, risking incorrect permutation indices.The cache key
("w3_w1", dst_w3_w1_weight.shape)does not includeis_gated_act_gemm. If the same weight tensor is used with both gated and non-gated activations, the cached permute indices from the first call will be incorrectly reused for the second.🐛 Proposed fix
def _maybe_get_cached_w3_w1_permute_indices( _cache_permute_indices, dst_w3_w1_weight: torch.Tensor, epilogue_tile_m: int, num_elts_per_sf: Union[None, int] = None, is_gated_act_gemm: bool = True, ) -> torch.Tensor: # Create a unique cache key (weight_type, weight_shape) - cache_key = ("w3_w1", dst_w3_w1_weight.shape) + cache_key = ("w3_w1", dst_w3_w1_weight.shape, is_gated_act_gemm) if cache_key not in _cache_permute_indices:
🤖 Fix all issues with AI agents
In `@benchmarks/bench_trtllm_gen_fused_moe_autotuner.py`:
- Around line 357-364: The argparse configuration uses type=ActivationType which
passes raw strings to the ActivationType constructor and fails for IntEnum;
update the parser call (the parser.add_argument for "--activation-type") to use
a small custom parsing function that maps input strings to ActivationType via
bracket notation (e.g., ActivationType[input_str]) or accepts already-matching
enum members, validate choices using list(ActivationType), and set default to
ActivationType.Swiglu; locate the parser.add_argument where "--activation-type"
is declared and replace type=ActivationType with this custom parser function
(referencing ActivationType and the parser.add_argument invocation).
In `@benchmarks/routines/moe.py`:
- Around line 179-185: Argparse is currently using type=ActivationType which
only accepts integer enum values, causing names like "Swiglu" to fail; change
the add_argument call for "--activation-type" to accept enum names by replacing
type=ActivationType with a converter that maps strings to the enum (e.g., use a
lambda or small function that does ActivationType[item] if input is str or
ActivationType(int(item)) if numeric) and keep choices=list(ActivationType) and
default=ActivationType.Swiglu so both name and numeric inputs work; reference
the ActivationType enum and the "--activation-type" add_argument in the parser
to locate where to apply this change.
In `@include/flashinfer/trtllm/fused_moe/runner.h`:
- Around line 176-178: The isGatedActivation function incorrectly omits
SwigluBias from its gated activation check; update the function
(isGatedActivation in runner.h) to treat ActivationType::SwigluBias as a gated
activation alongside ActivationType::Swiglu and ActivationType::Geglu so its
behavior matches the implementation in moe_gemm_kernels.h.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
benchmarks/routines/moe.py (1)
897-897: Bug:args.gated_actno longer exists.This line references
args.gated_actwhich was renamed toargs.activation_type. This will cause anAttributeErrorat runtime whenargs.output_pathis set.🐛 Proposed fix
- cur_res["gated_act"] = args.gated_act + cur_res["activation_type"] = str(activation_type)flashinfer/fused_moe/core.py (1)
1423-1509: Normalizeactivation_typebefore using.value.
trtllm_fp8_per_tensor_scale_moe_opcan be called withactivation_typeas anint(the public API default isActivationType.Swiglu.value). In that case,.valueraisesAttributeError. Coerce toActivationType(or useint(...)) before the.valueaccess.🐛 Suggested fix
def trtllm_fp8_per_tensor_scale_moe_op( @@ - activation_type: ActivationType = ActivationType.Swiglu, + activation_type: ActivationType = ActivationType.Swiglu, ) -> torch.Tensor: + activation_type = ActivationType(activation_type) @@ - activation_type=activation_type.value, + activation_type=int(activation_type), @@ - activation_type.value, + int(activation_type),csrc/trtllm_fused_moe_kernel_launcher.cu (2)
379-402: Validate activation_type before storing.
activation_typeis an external input; guard against invalid enum values to prevent undefined kernel paths.🛠️ Proposed fix
TVM_FFI_ICHECK(0 <= weight_layout && weight_layout <= 2) << "the value of weight_layout is not recognized"; + auto act_type = static_cast<int64_t>(activation_type); + TVM_FFI_ICHECK(act_type >= 0 && + act_type < static_cast<int64_t>(ActivationType::InvalidType)) + << "activation_type is not recognized"; this->weight_layout = static_cast<batchedGemm::gemm::MatrixLayout>(weight_layout); this->activation_type = activation_type;
487-503: BF16 config generation should ignore non‑Swiglu act_type.The BF16 runtime path hardcodes Swiglu; generating configs for other activations can yield mismatched configs.
🛠️ Proposed fix
std::set<int32_t> selected_tile_nums = computeSelectedTileN(supported_tile_nums, num_tokens, top_k, num_local_experts); + TVM_FFI_ICHECK(static_cast<ActivationType>(act_type) == ActivationType::Swiglu) + << "BF16 MoE supports only Swiglu activation."; + for (int32_t tile_N : selected_tile_nums) { auto moe_runner = std::make_unique<tensorrt_llm::kernels::trtllmgen_moe::MoE::Runner>( btg::Dtype::Bfloat16, // dtype_act btg::Dtype::Bfloat16, // dtype_weights false, // useDeepSeekFp8 - tile_N, static_cast<ActivationType>(act_type), use_shuffled_weight, + tile_N, ActivationType::Swiglu, use_shuffled_weight, static_cast<batchedGemm::gemm::MatrixLayout>(weight_layout));
🤖 Fix all issues with AI agents
In `@tests/moe/test_trtllm_gen_fused_moe.py`:
- Around line 1955-1962: The activation_type_to_func lookup can KeyError on
unsupported ActivationType values; update the logic around activation_type,
activation_type_to_func and activation_func to explicitly handle unknown enums
by either expanding the map to include the remaining ActivationType members or
performing a guarded lookup (e.g., check activation_type in
activation_type_to_func or use dict.get) and raise a clear ValueError listing
supported activation types when not found; ensure the error mentions
ActivationType and activation_type variable so callers see which value was
invalid.
🧹 Nitpick comments (3)
csrc/trtllm_batched_gemm_runner.cu (1)
226-227: Comment typo and potential design clarification needed.The comment has a typo: "For simplicity pass set scaleAct to scaleGateC" should likely be "For simplicity, set scaleAct to scaleGateC".
More importantly, reusing
scaleGateCforscaleActmay be a simplification that works for current use cases, but consider adding a brief note explaining when this assumption holds (e.g., for specific activation types).✏️ Suggested comment fix
- // For simplicity pass set scaleAct to scaleGateC + // For simplicity, set scaleAct to scaleGateC (valid when activation scaling matches gate scaling) gemmData.mInputBuffers.mPtrScaleAct = scaleGateC;tests/moe/test_trtllm_gen_routed_fused_moe.py (1)
186-186: Consider parameterizing activation types for broader coverage.The test currently only exercises
ActivationType.Swiglu. Since this PR adds support for non-gated activations likeRelu2, consider adding a parametrize decorator to test at least one non-gated activation type (e.g.,ActivationType.Relu2) to validate the new functionality.Also applies to: 239-239
benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (1)
379-390: CLI--activation-typeargument is ignored for FP4 path.The
--activation-typeargument is added to the CLI but not passed tobench_trtllm_gen_fused_moe_autotuner_fp4. If FP4 supports non-Swiglu activations, consider adding the parameter. If FP4 only supports Swiglu, consider either:
- Validating and warning when a non-Swiglu activation is specified with an FP4 quant mode, or
- Documenting this limitation in the help text.
🛠️ Option 1: Pass activation_type to FP4 (if supported)
else: bench_trtllm_gen_fused_moe_autotuner_fp4( args.tune_max_num_tokens, args.quant_mode, args.num_tokens, args.num_experts, args.hidden_size, args.intermediate_size, args.top_k, args.warmups, args.iterations, + args.activation_type, )🛠️ Option 2: Warn if non-Swiglu specified for FP4
else: + if args.activation_type != ActivationType.Swiglu: + print(f"[WARNING] FP4 path only supports Swiglu activation. Ignoring --activation-type={args.activation_type}") bench_trtllm_gen_fused_moe_autotuner_fp4( ... )
| activation_type = args.activation_type | ||
| activation_type_to_func = { | ||
| ActivationType.Swiglu: F.silu, | ||
| ActivationType.Geglu: F.gelu, | ||
| ActivationType.Relu2: lambda x: F.relu(x) ** 2, | ||
| } | ||
| gated_act_func = gated_act_type_to_func[gated_act_type] | ||
| activation_func = activation_type_to_func[activation_type] | ||
|
|
There was a problem hiding this comment.
Guard unsupported ActivationType values in the reference activation map.
The reference path now accepts ActivationType, but the mapping only covers Swiglu, Geglu, and Relu2. Any other enum value (e.g., Gelu, Relu, Silu, Identity, SwigluBias) will currently throw a KeyError. Consider expanding the map and/or adding a clear error for unsupported activations.
🔧 Suggested hardening
activation_type = args.activation_type
activation_type_to_func = {
ActivationType.Swiglu: F.silu,
ActivationType.Geglu: F.gelu,
ActivationType.Relu2: lambda x: F.relu(x) ** 2,
+ ActivationType.Gelu: F.gelu,
+ ActivationType.Relu: F.relu,
+ ActivationType.Silu: F.silu,
+ ActivationType.Identity: lambda x: x,
}
-activation_func = activation_type_to_func[activation_type]
+activation_func = activation_type_to_func.get(activation_type)
+if activation_func is None:
+ raise NotImplementedError(
+ f"ActivationType {activation_type} not supported in reference path yet."
+ )🤖 Prompt for AI Agents
In `@tests/moe/test_trtllm_gen_fused_moe.py` around lines 1955 - 1962, The
activation_type_to_func lookup can KeyError on unsupported ActivationType
values; update the logic around activation_type, activation_type_to_func and
activation_func to explicitly handle unknown enums by either expanding the map
to include the remaining ActivationType members or performing a guarded lookup
(e.g., check activation_type in activation_type_to_func or use dict.get) and
raise a clear ValueError listing supported activation types when not found;
ensure the error mentions ActivationType and activation_type variable so callers
see which value was invalid.
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (1)
270-299: FP4 autotuner ignores the requested activation type.The CLI accepts
--activation-type, but the FP4 path hard‑codesSwiglu, soRelu2/other activations can’t be benchmarked. Consider threading the argument through (or explicitly rejecting non‑Swiglu for FP4).🛠️ Proposed fix
-def bench_trtllm_gen_fused_moe_autotuner_fp4( +def bench_trtllm_gen_fused_moe_autotuner_fp4( tune_max_num_tokens: Optional[int], quant_mode: Literal["NvFP4xNvFP4", "MxFP4xMxFP8", "MxFP4xBf16"], num_tokens: int, num_experts: int, hidden_size: int, intermediate_size: int, top_k: int, warmups: int, iterations: int, + activation_type: ActivationType, ): ... - ActivationType.Swiglu.value, # act_type + activation_type.value, # act_type None, num_tokens if tune_max_num_tokens is None else tune_max_num_tokens, )- bench_trtllm_gen_fused_moe_autotuner_fp4( + bench_trtllm_gen_fused_moe_autotuner_fp4( args.tune_max_num_tokens, args.quant_mode, args.num_tokens, args.num_experts, args.hidden_size, args.intermediate_size, args.top_k, args.warmups, args.iterations, + args.activation_type, )
🤖 Fix all issues with AI agents
In `@benchmarks/routines/flashinfer_benchmark_utils.py`:
- Around line 458-470: The converter inside enum_type incorrectly lowercases all
but the first char (in function converter), causing camelCase names like
SwigluBias to be mangled and rejected; update converter to perform a
case-insensitive lookup by comparing the incoming string (value) to enum member
names in a casefold/lower-insensitive way (and accept numeric indices/values
where appropriate) so that enum_type and callers like ActivationType accept
"SwigluBias", "swiglubias", or numeric inputs; implement this by normalizing
value (e.g., casefold()) and matching against member.name.casefold() or trying
int(value) fallback before raising argparse.ArgumentTypeError listing valid
options.
In `@benchmarks/routines/moe.py`:
- Line 628: The code still reads args.gated_act (which no longer exists) when
constructing the output/result export, causing a crash; replace all uses of
args.gated_act with args.activation_type (e.g., update variables like
activation_type = args.activation_type and any places that populate the output
CSV/dict) and, if you must keep the original column name, map
args.activation_type into the existing 'gated_act' output field when building
the results (ensure references in result-building code and the export writer use
activation_type/args.activation_type instead of args.gated_act).
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@benchmarks/bench_trtllm_gen_fused_moe_autotuner.py`:
- Around line 358-365: The current parser.add_argument for "--activation-type"
uses type=enum_type(ActivationType) (which returns enum members) but choices are
strings, causing validation to always fail; update the argument so type and
choices match — either make choices a list of enum members (e.g.,
choices=list(ActivationType)) when keeping type=enum_type(ActivationType), or
keep choices=[e.name for e in ActivationType] and change the converter to parse
names (e.g., type=lambda s: ActivationType[s]); adjust the parser.add_argument
call for "--activation-type" accordingly so both type and choices use the same
representation.
In `@benchmarks/routines/moe.py`:
- Around line 180-185: The argparse setup for "--activation-type" mixes enum
members (from type=enum_type(ActivationType)) with string choices, causing valid
enum inputs to be rejected; update the choices to be enum members (e.g.,
choices=list(ActivationType) or [e for e in ActivationType]) so they match the
converter returned by enum_type(ActivationType), keep default as
ActivationType.Swiglu, and adjust the help text (e.g., show [e.name for e in
ActivationType]) if you want human-readable names.
…hmarks Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (1)
270-301: FP4 path ignores user-supplied activation type.
--activation-typeis parsed but the FP4 benchmark hardcodesSwiglu, so Relu2 (and other types) can’t be exercised in FP4 modes. Consider threading the CLI value into the FP4 path.💡 Suggested fix (thread activation_type into FP4 path)
-def bench_trtllm_gen_fused_moe_autotuner_fp4( +def bench_trtllm_gen_fused_moe_autotuner_fp4( tune_max_num_tokens: Optional[int], quant_mode: Literal["NvFP4xNvFP4", "MxFP4xMxFP8", "MxFP4xBf16"], num_tokens: int, num_experts: int, hidden_size: int, intermediate_size: int, top_k: int, warmups: int, - iterations: int, + iterations: int, + activation_type: ActivationType, ): @@ - ActivationType.Swiglu.value, # act_type + activation_type.value, # act_type @@ - bench_trtllm_gen_fused_moe_autotuner_fp4( + bench_trtllm_gen_fused_moe_autotuner_fp4( args.tune_max_num_tokens, args.quant_mode, args.num_tokens, args.num_experts, args.hidden_size, args.intermediate_size, args.top_k, args.warmups, - args.iterations, + args.iterations, + args.activation_type, )
🧹 Nitpick comments (1)
benchmarks/routines/moe.py (1)
1742-1765: Consider addingactivation_typeto result export for consistency.
testTrtllmFp4BlockScaleMoeattempts to export activation info (though buggy at line 898), buttestTrtllmFp8PerTensorScaleMoeomits it entirely. For consistent benchmark output, consider adding:cur_res["input_dtype"] = input_dtype cur_res["weight_dtype"] = weight_dtype + cur_res["activation_type"] = args.activation_type.name res.append(cur_res)
…fix) Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…tions Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
benchmarks/bench_trtllm_gen_fused_moe_autotuner.py (1)
101-130: FP8 block-scale kernel is hardcoded to Swiglu but other activations are not rejected.The
trtllm_fp8_block_scale_moekernel function does not accept anactivation_typeparameter and is intentionally hardcoded to Swiglu (confirmed in the kernel implementation:constexpr ActivationType activation_type = ActivationType::Swiglu; // not exposed in api for now). However, the code only rejectsRelu2, allowing any other activation type to pass through and silently use Swiglu, which will produce incorrect results.✅ Fix: Enforce Swiglu-only for block-scale
- if is_block_scale: - if activation_type == ActivationType.Relu2: + if is_block_scale: + if activation_type != ActivationType.Swiglu: raise ValueError( - "Relu2 activation is not supported for FP8 block scale MoE." + "Only Swiglu is supported for FP8 block scale MoE." )
…py for trtllm_fp8_block_scale_moe Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
benchmarks/routines/moe.py (1)
390-417: Adjust TFLOPS calculation to account for non-gated activation weight dimensions.The TFLOPS calculation always assumes
2 * intermediate_sizefor the first GEMM, which is correct for gated activations (Swiglu, Geglu) but overstates the computational work for non-gated activations like Relu2, which use onlyintermediate_size. The test reference implementation confirms this: Relu2 computesF.relu(a @ weight.t()) ** 2using the full weight tensor (shape [num_experts, intermediate_size, hidden_size]), while Swiglu splits the weight into two halves.Update
calculate_moe_tflopsto branch based on activation type:
- For gated activations: keep current
2 * intermediate_sizecalculation- For non-gated activations (Relu2, Identity): use
intermediate_sizeAlternatively, if the function cannot be made activation-aware, the function signature should be updated to accept
activation_typeas a parameter.
🧹 Nitpick comments (2)
benchmarks/routines/moe.py (2)
1507-1532: Note:activation_typenot included in FP8 block scale results.Per the PR scope (NVFP4 and FP8PerTensor only),
testTrtllmFp8BlockScaleMoedoesn't supportactivation_type. However, this creates a column inconsistency when combining benchmark results from all three routines with--output-path.Consider adding a placeholder for consistency:
cur_res["input_dtype"] = input_dtype cur_res["weight_dtype"] = weight_dtype + cur_res["activation_type"] = "N/A" # FP8 block scale doesn't support activation_type yet res.append(cur_res)
1236-1262: Same observation:activation_typenot in CUTLASS results.For output consistency across all MOE benchmark routines, consider adding a placeholder entry similar to the suggestion for FP8 block scale.
|
/bot run |
|
Hi @amitz-nv seems there are some abnormal output values on gitlab CI, e.g. |
|
[FAILED] Pipeline #42839544: 7/20 passed |
…rt Nemotron" (#2451) Reverts #2304 As it introduces regression on unit test and no longer allow number of experts lower than 22 to run trtllm deepseek routing kernel <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **Refactor** * Consolidated gated activation type handling across MoE implementations with simplified parameter names and enum naming. * Unified intermediate size calculations to consistently use 2x configuration. * Streamlined routing logic for improved clarity and maintainability. * **Breaking Changes** * CLI argument `--activation-type` renamed to `--gated-act` with values "swiglu" or "geglu". * API parameter names updated from `activation_type` to `gated_act_type` across public interfaces. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…ron, fixed (#2462) <!-- .github/pull_request_template.md --> ## 📌 Description - Support element wise activation (relu^2) in fused MoE in NVFP4 and in FP8PerTensor. - Use new ActivationType enum class instead of GatedActType. - Support Nemotron in deepseek routing as in NVIDIA/TensorRT-LLM#9792 - Remove 'A' suffix from UseShuffledMatrixA. NOTE: This is the fixed version of #2304 that was merged and reverted. - Replaced the problematic condition in deepseek routing that required `NumExperts >= MaxSupportedTopExperts` with `topK<=numExperts` - DeepSeek R1 works with it (tested with VLLM). - Removed irrelevant test cases. ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Replaced old gated-activation API with a unified ActivationType enum (many activation kinds supported). * Propagated activation_type across MoE workflows and kernels. * **New Features** * Added CLI option --activation-type to select activation kind for MoE benchmarks. * **Bug Fixes** * Enforced activation compatibility and validation for FP8/FP4 paths. * **Tests** * Updated and expanded tests to cover new activation types and compatibility scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
…rt Nemotron" (flashinfer-ai#2451) Reverts flashinfer-ai#2304 As it introduces regression on unit test and no longer allow number of experts lower than 22 to run trtllm deepseek routing kernel <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **Refactor** * Consolidated gated activation type handling across MoE implementations with simplified parameter names and enum naming. * Unified intermediate size calculations to consistently use 2x configuration. * Streamlined routing logic for improved clarity and maintainability. * **Breaking Changes** * CLI argument `--activation-type` renamed to `--gated-act` with values "swiglu" or "geglu". * API parameter names updated from `activation_type` to `gated_act_type` across public interfaces. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai -->
…ron, fixed (flashinfer-ai#2462) <!-- .github/pull_request_template.md --> ## 📌 Description - Support element wise activation (relu^2) in fused MoE in NVFP4 and in FP8PerTensor. - Use new ActivationType enum class instead of GatedActType. - Support Nemotron in deepseek routing as in NVIDIA/TensorRT-LLM#9792 - Remove 'A' suffix from UseShuffledMatrixA. NOTE: This is the fixed version of flashinfer-ai#2304 that was merged and reverted. - Replaced the problematic condition in deepseek routing that required `NumExperts >= MaxSupportedTopExperts` with `topK<=numExperts` - DeepSeek R1 works with it (tested with VLLM). - Removed irrelevant test cases. ## 🔍 Related Issues <!-- Link any related issues here --> ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Refactor** * Replaced old gated-activation API with a unified ActivationType enum (many activation kinds supported). * Propagated activation_type across MoE workflows and kernels. * **New Features** * Added CLI option --activation-type to select activation kind for MoE benchmarks. * **Bug Fixes** * Enforced activation compatibility and validation for FP8/FP4 paths. * **Tests** * Updated and expanded tests to cover new activation types and compatibility scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com>
📌 Description
ActivationTypeenum class instead ofGatedActType.UseShuffledMatrixA🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
New Features
Breaking Changes
✏️ Tip: You can customize this high-level summary in your review settings.