Tiny optimizations for moe #1717

fzyzcjy · 2025-09-18T15:11:45Z

📌 Description

~~has speedup but still WIP, need to sleep now~~

EDIT: some speedup is in code commit history, does speedup but intro complexity thus I reverted them

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

This reverts commit 9178d57.

gemini-code-assist

Summary of Changes

Hello @fzyzcjy, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on refining and optimizing the Mixture-of-Experts (MoE) implementation within the FlashInfer library. It introduces an experimental, warp-optimized CUDA kernel to potentially speed up element counting, alongside significant enhancements to the MoE benchmarking script. These enhancements include dynamic configuration generation, temporary expert masking for testing, and the integration of CUDA profiling hooks, all aimed at facilitating deeper performance analysis and identifying further optimization opportunities. Additionally, debug logging has been improved in key areas to aid development and understanding of execution flow.

Highlights

MoE Kernel Optimization Experimentation: Introduced an experimental, warp-optimized CUDA kernel (findTotalEltsLessThanTarget_v2) for counting elements, alongside the existing binary search approach, suggesting exploration into performance improvements for Mixture-of-Experts (MoE) operations.
Benchmarking Enhancements: Added temporary "HACK" code to the MoE benchmark script for masking selected experts and dynamically setting tune_max_num_tokens, along with integrating cudaProfilerStart and cudaProfilerStop calls for detailed performance analysis.
Dynamic Benchmark Configuration: Modified the bench_cutlass_fused_moe.py script to dynamically generate num_experts based on num_ranks and updated the default --num-tokens argument, making benchmark configurations more flexible.
Debug Logging Additions: Incorporated new debug print statements in flashinfer/fused_moe/core.py and flashinfer/autotuner.py to provide more visibility into tensor shapes and autotuner cache hits during execution.
Pyproject.toml Changes: Commented out license-related fields in pyproject.toml, which might be a temporary measure or part of a larger licensing review.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces several optimizations for MoE, but it is clearly a work-in-progress and contains a significant amount of temporary code for debugging and profiling. This includes if 1: blocks, "HACK" print statements, and hardcoded CUDA profiler calls in the benchmark file. The C++ code includes an experimental, non-generic function with asm("trap;") which is unsafe. Furthermore, the project's license information has been commented out in pyproject.toml, which is a critical issue. All temporary and debug code must be removed, the unsafe C++ function needs to be made robust or removed, and the license information must be restored before this PR can be merged.

gemini-code-assist · 2025-09-18T15:13:56Z

csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh

+__device__ inline int64_t findTotalEltsLessThanTarget_v2(T const* sorted_indices, int64_t const arr_length, T const target) {
+  constexpr int ARR_LENGTH_CONST = 128;
+  if (arr_length != ARR_LENGTH_CONST) {
+      asm("trap;");
+  }


The function findTotalEltsLessThanTarget_v2 has a hardcoded ARR_LENGTH_CONST and uses asm("trap;") if the input array length does not match. This makes the function non-generic and unsafe for general use, as it will cause a crash for any other input size. This experimental implementation should be made more robust or removed if it's not ready for production.

gemini-code-assist · 2025-09-18T15:13:56Z

pyproject.toml

 requires-python = ">=3.9,<4.0"
 authors = [{ name = "FlashInfer team" }]
-license = "Apache-2.0"
+#license = "Apache-2.0"


Commenting out the license field is a critical issue. A project's license must be clearly declared in its metadata. Please restore this line.

Suggested change

#license = "Apache-2.0"

license = "Apache-2.0"

gemini-code-assist · 2025-09-18T15:13:57Z

pyproject.toml

 urls = { Homepage = "https://github.com/flashinfer-ai/flashinfer" }
 dynamic = ["dependencies", "version"]
-license-files = ["LICENSE", "licenses/*"]
+#license-files = ["LICENSE", "licenses/*"]


Commenting out license-files is a critical issue. Please restore this line to ensure license files are included in the package.

Suggested change

#license-files = ["LICENSE", "licenses/*"]

license-files = ["LICENSE", "licenses/*"]

gemini-code-assist · 2025-09-18T15:13:57Z

benchmarks/bench_cutlass_fused_moe.py

+    if 1:
+        print("HACK: mask some selected_experts")
+        selected_experts[torch.randn(selected_experts.shape) > 1 / num_ranks] = 9999999
+
+        tune_max_num_tokens = batch_size
+        print(f"HACK: {tune_max_num_tokens=}")
+


This if 1: block appears to be temporary code for debugging and testing, as indicated by the "HACK" print statements. This block, including the hardcoded logic for masking experts and setting tune_max_num_tokens, should be removed before merging.

gemini-code-assist · 2025-09-18T15:13:57Z

csrc/fused_moe/cutlass_backend/cutlass_fused_moe_kernels.cuh

+__device__ inline int64_t findTotalEltsLessThanTarget(T const* sorted_indices, int64_t const arr_length, T const target) {
+    return findTotalEltsLessThanTarget_v1(sorted_indices, arr_length, target);
+
+//     return findTotalEltsLessThanTarget_v2(sorted_indices, arr_length, target);
+
+//     int64_t out_v1 = findTotalEltsLessThanTarget_v1(sorted_indices, arr_length, target);
+//     int64_t out_v2 = findTotalEltsLessThanTarget_v2(sorted_indices, arr_length, target);
+//     if (out_v1 != out_v2) {
+//         printf("different output! v1=%lld v2=%lld\n", out_v1, out_v2);
+//         asm("trap;");
+//     }
+//     return out_v1;
+}


This function contains a significant amount of commented-out debug code, including printf statements and an asm("trap;"). This should be cleaned up and removed before merging.

gemini-code-assist · 2025-09-18T15:13:57Z

benchmarks/bench_cutlass_fused_moe.py

    ]
    hidden_states = x
    hidden_states, input_sf = fp4_quantize(x, a1_gs)
+    print(f"{hidden_states.shape=}")


This print statement appears to be for debugging purposes and should be removed before merging.

gemini-code-assist · 2025-09-18T15:13:57Z

flashinfer/autotuner.py

+            else:
+                # NOTE ADD
+                logger.debug(
+                    f"[AutoTunner]: HACK ADD cache hit {custom_op=} {input_shapes=}"
+                )
            return runner, tactic


This else block with the "HACK ADD" debug log statement appears to be temporary and should be removed.

gemini-code-assist · 2025-09-18T15:13:57Z

flashinfer/fused_moe/core.py

+    print(
+        "hi flashinfer cutlass_fused_moe "
+        f"{input.shape=} {input.dtype=} "
+        f"{token_selected_experts.shape=}"
+    )


This print statement appears to be for debugging and should be removed before merging.

This reverts commit f31b592.

This reverts commit f670fa4.

This reverts commit 7853d15.

This reverts commit ec89f0c.

This reverts commit 5873bb3.

This reverts commit fece864.

This reverts commit 8c719e6.

This reverts commit 348a536.

…t. Restore mm_fp4 API behavior (flashinfer-ai#1706)" This reverts commit e8f5460.

This reverts commit bc42393. # Conflicts: # tests/test_mm_fp4.py

This reverts commit d83a3cb.

fzyzcjy added 24 commits September 10, 2025 09:11

more

498027c

more

8b4c44b

more

db646ac

temp 4gpu

9c5e3ca

more

fd64748

more

145efc4

more

ed240ab

temp

9178d57

Revert "temp"

27d9053

This reverts commit 9178d57.

tune_max_num_tokens 16k -> 32k

d39f3ec

more

2890f7e

more

dde43f6

more

671d1cd

fix instasll err

0158022

Merge branch 'main-upstream' into feat/bench_cutlass_moe

cc512e5

Merge branch 'feat/hack_license' into feat/bench_cutlass_moe

0bd7a7f

more

20da055

more

99f5ff2

more

622f8ae

hack: mask some selected experts

eb55ef1

fix tune_max_num_tokens

723fea7

hack findTotalEltsLessThanTarget

ec9cab1

more

29d4170

more

20a361b

gemini-code-assist bot reviewed Sep 18, 2025

View reviewed changes

fzyzcjy added 4 commits September 19, 2025 07:49

more

d66ef25

writeSF

f31b592

pragma unroll

480057d

Revert "writeSF"

742521e

This reverts commit f31b592.

fzyzcjy added 16 commits September 19, 2025 14:48

hack: acti - infinite num blocks

7853d15

hack: acti - mid num blocks

f670fa4

Revert "hack: acti - mid num blocks"

50378ba

This reverts commit f670fa4.

Revert "hack: acti - infinite num blocks"

9d1456a

This reverts commit 7853d15.

temp rm all

ec89f0c

change test

783120b

Revert "temp rm all"

7b0f471

This reverts commit ec89f0c.

ARR_LENGTH_CONST

5873bb3

Revert "ARR_LENGTH_CONST"

bb7a97a

This reverts commit 5873bb3.

hack: findTotalEltsLessThanTarget_v2 support arbitrary arr len

8c719e6

hack: unroll(4)

fece864

Revert "hack: unroll(4)"

b9bb8c7

This reverts commit fece864.

Revert "hack: findTotalEltsLessThanTarget_v2 support arbitrary arr len"

7e6c876

This reverts commit 8c719e6.

hack NUM_EXPERTS_PER_NODE_CONST

06003aa

temp: 4gpu bench

afa5a61

fix

1583eb0

fzyzcjy force-pushed the feat/speedup_moe branch from b204a15 to 1583eb0 Compare September 19, 2025 08:54

fzyzcjy added 3 commits September 19, 2025 20:17

temp rm all

3704820

partial cp

3a74536

cp change-block-thread, pragma-unroll, mv-if-check

348a536

fzyzcjy force-pushed the feat/speedup_moe branch from a34bb8d to 348a536 Compare September 19, 2025 13:08

fzyzcjy added 8 commits September 19, 2025 22:24

Revert "cp change-block-thread, pragma-unroll, mv-if-check"

0e62fa8

This reverts commit 348a536.

enable all except for 21:06

941b68c

Merge remote-tracking branch 'upstream/main' into feat/speedup_moe

47f3d50

Revert "feat: Benchmark mm_fp4 mxfp4 support and gemm autotune suppor…

d504c61

…t. Restore mm_fp4 API behavior (flashinfer-ai#1706)" This reverts commit e8f5460.

Revert "Enabled alpha with the mx_fp4 format (flashinfer-ai#1688)"

822ae9b

This reverts commit bc42393. # Conflicts: # tests/test_mm_fp4.py

enable change-thread-block

299caf5

enable mv-unpermuted_row_to_permuted_row

d83a3cb

Revert "enable mv-unpermuted_row_to_permuted_row"

68b8e6d

This reverts commit d83a3cb.

fzyzcjy mentioned this pull request Oct 4, 2025

[Performance]: Willing to PR for optimizations about several moe-related kernels NVIDIA/TensorRT-LLM#8143

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tiny optimizations for moe #1717

Tiny optimizations for moe #1717

Uh oh!

fzyzcjy commented Sep 18, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Uh oh!

gemini-code-assist bot Sep 18, 2025

Uh oh!

gemini-code-assist bot Sep 18, 2025

Uh oh!

gemini-code-assist bot Sep 18, 2025

Uh oh!

gemini-code-assist bot Sep 18, 2025

Uh oh!

gemini-code-assist bot Sep 18, 2025

Uh oh!

gemini-code-assist bot Sep 18, 2025

Uh oh!

gemini-code-assist bot Sep 18, 2025

Uh oh!

Uh oh!

	#license-files = ["LICENSE", "licenses/*"]
	license-files = ["LICENSE", "licenses/*"]

Tiny optimizations for moe #1717

Are you sure you want to change the base?

Tiny optimizations for moe #1717

Uh oh!

Conversation

fzyzcjy commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fzyzcjy commented Sep 18, 2025 •

edited

Loading