[RFC]MXFP8 autotune IMA at some batch size by charlotte12l · Pull Request #2800 · flashinfer-ai/flashinfer

charlotte12l · 2026-03-17T06:11:03Z

📌 Description

We are running vLLM with MXFP8 and will get the below flashinfer autotune error with max_batch_size=64. However, max_batch_size=128 or max_batch_size=256 is fine. We tried with CUDA_LAUNCH_BLOCKING=1 however it's the same stacktrace.

The current PR could avoid the error and eval accuracy is fine. However we doubt if it's the correct fix as perf regressed.

(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,483 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...

(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,812 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [8, 10], due to failure while profiling: CUDA error: an illegal memory access was encountered

(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) Device-side assertion tracking was not enabled by user.

(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,813 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [8, 11], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,814 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [8, 14], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,814 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [8, 15], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,814 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [16, 10], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,815 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [16, 11], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,815 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [16, 14], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,815 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [16, 15], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,815 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [32, 10], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,816 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [32, 11], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,816 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [32, 14], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,816 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [32, 15], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.

(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,820 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends

(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932] WorkerProc hit an exception.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932] Traceback (most recent call last):
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/multiproc_executor.py", line 927, in worker_busy_loop
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     output = func(*args, **kwargs)
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/gpu_worker.py", line 594, in compile_or_warm_up_model
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     kernel_warmup(self)
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/kernel_warmup.py", line 46, in kernel_warmup
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     flashinfer_autotune(worker.model_runner)
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/kernel_warmup.py", line 103, in flashinfer_autotune
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     runner._dummy_run(
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/_contextlib.py", line 124, in decorate_context
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/gpu_model_runner.py", line 5182, in _dummy_run
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     outputs = self.model(
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/cuda_graph.py", line 241, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     return self.runnable(*args, **kwargs)
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/decorators.py", line 492, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     return TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/wrapper.py", line 225, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     return self._call_with_optional_nvtx_range(
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/wrapper.py", line 119, in _call_with_optional_nvtx_range
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     return callable_fn(*args, **kwargs)
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/model.py", line 975, in forward
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/eval_frame.py", line 1266, in _fn
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/output_graph.py", line 2587, in _tf_disabled_wrapper
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/caching.py", line 206, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/graph_module.py", line 949, in call_wrapped
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/graph_module.py", line 461, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/module.py", line 1778, in _wrapped_call_impl
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/module.py", line 1789, in _call_impl
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<eval_with_key>.74", line 459, in forward
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     submod_4 = self.submod_4(...);  ... = None

(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/cuda_graph.py", line 241, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/piecewise_backend.py", line 379, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/standalone_compile.py", line 122, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/eval_frame.py", line 1266, in _fn
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/aot_autograd.py", line 1210, in forward
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/runtime_wrappers.py", line 582, in runtime_wrapper
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/utils.py", line 138, in call_func_at_runtime_with_args
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/runtime_wrappers.py", line 2311, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/runtime_wrappers.py", line 785, in wrapper
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/runtime_wrappers.py", line 989, in inner_fn
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/output_code.py", line 673, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/utils.py", line 3439, in run
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/<torchinductor_generated>.py", line 1413, in call
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     buf12 = torch.ops.vllm.moe_forward.default(buf11, buf10, None, arg10_1)
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/_ops.py", line 871, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/default_moe_runner.py", line 85, in _moe_forward
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/default_moe_runner.py", line 678, in forward_impl
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/mxfp8_moe_method.py", line 234, in apply_monolithic
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/moe_post_norm_fi_trtllm.py", line 348, in run_fi_trtllm_moe_mxfp8_with_post_norm
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/flashinfer_utils.py", line 424, in flashinfer_fused_moe_mxfp8_routed
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/flashinfer_utils.py", line 151, in flashinfer_trtllm_fp8_block_scale_routed_moe
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/core.py", line 2618, in trtllm_fp8_block_scale_routed_moe
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/core.py", line 1709, in trtllm_fp8_block_scale_moe_op
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/autotuner.py", line 469, in choose_one
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/autotuner.py", line 775, in _prepare_input_tensors
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/autotuner.py", line 764, in _create_tensor_like
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/core.py", line 969, in <lambda>
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     lambda shapes, dtype, device: torch.randn(shapes, device=device).to(
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932] Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932] Device-side assertion tracking was not enabled by user.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932] Traceback (most recent call last):

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed an out-of-bounds read issue in the FP8 quantization path that could impact inference stability and correctness during MoE model execution.

coderabbitai · 2026-03-17T06:11:19Z

📝 Walkthrough

Walkthrough

This PR fixes an out-of-bounds read issue in the MoE GEMM kernel when using MxFp8 quantization. The fix reconstructs the current_hidden_states_scale tensor to the correct size (sf_size) by replacing it with a ones tensor when needed, preventing runtime crashes during forward passes.

Changes

Cohort / File(s)	Summary
FP8 MxFp8 Scale Reconstruction `flashinfer/fused_moe/core.py`	Added logic to compute `padded_k` and `sf_size`, then reconstruct the scale tensor with appropriate dimensions to prevent out-of-bounds reads in GEMM kernel. Removed inline comment in trtllm_fp8_block_scale_moe path.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

#2412: Updates FP8/block-layout docstrings and comments in the same MoE quantization code path, addressing documentation alongside the current runtime fix.

Suggested labels

op: moe

Suggested reviewers

yzh119
nv-yunzheq

Poem

🐰 Scales were dancing out of bounds,
Causing kernels to crash with sounds,
A rabbit fixed the MoE way,
With proper sizing, all is gay!
FP8 quantization now runs true. ✨

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title is vague and lacks clarity about the specific fix; '[RFC]MXFP8 autotune IMA at some batch size' uses non-specific terms and does not clearly convey the core change.	Clarify the title to describe what fix was applied, e.g., 'Fix MXFP8 autotune scale reconstruction for batch size 64' or similar.
Description check	❓ Inconclusive	The PR description provides context about the CUDA error but lacks structured information and fails to clearly explain the fix or address checklist items.	Add a detailed description of what the fix does, why it addresses the root cause, and address the PR checklist items (pre-commit checks, tests). Clarify whether the performance regression is acceptable or requires further investigation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can get early access to new features in CodeRabbit.

Enable the early_access setting to enable early access features such as new models, tools, and more.

gemini-code-assist · 2026-03-17T06:11:20Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request provides a provisional solution to a critical CUDA 'illegal memory access' error encountered during the autotuning process for MXFP8 quantization within vLLM, particularly affecting smaller batch sizes. The change ensures that the scale tensors used in the Mixture-of-Experts (MoE) GEMM kernel are allocated with the correct dimensions, thereby preventing runtime failures. While this fix resolves the error and maintains evaluation accuracy, the author notes a performance regression, indicating that further optimization or a more fundamental solution may be required.

Highlights

MXFP8 Autotuning Fix: Addressed an 'illegal memory access' CUDA error that occurred during MXFP8 autotuning in vLLM, specifically at max_batch_size=64, by ensuring the scale tensor is correctly sized.
Scale Tensor Resizing: Implemented logic to detect and correct undersized 1D scale tensors created by DynamicTensorSpec during autotuner profiling, preventing out-of-bounds reads in the MoE GEMM kernel.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

flashinfer/fused_moe/core.py
- Implemented a check and re-initialization for current_hidden_states_scale to ensure its size is correct during MXFP8 autotuning, preventing CUDA illegal memory access errors.
- Removed a redundant comment from the tuner.choose_one call related to FP8 block-scale usage.

Activity

The author, charlotte12l, opened this pull request as an RFC (Request For Comments) to address a specific CUDA error during MXFP8 autotuning.
The PR description includes a detailed stack trace of the 'illegal memory access' error encountered when running vLLM with MXFP8 at max_batch_size=64.
The author has identified that the proposed fix resolves the error and maintains evaluation accuracy, but also noted a performance regression, inviting further discussion and review.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses a CUDA error: an illegal memory access was encountered that occurs during autotuning for MXFP8 with small batch sizes. The fix correctly identifies that the scale tensor created by DynamicTensorSpec can be undersized and resizes it to the correct dimension, which effectively prevents the crash. My feedback focuses on improving this fix to address the performance regression mentioned in the pull request description. I suggest using more realistic random data for the recreated scale tensor instead of filling it with ones. This should help the autotuner select a more optimal and performant kernel.

gemini-code-assist · 2026-03-17T06:13:33Z

flashinfer/fused_moe/core.py

+                        if current_hidden_states_scale.numel() < sf_size:
+                            current_hidden_states_scale = torch.ones(
+                                (sf_size,),
+                                dtype=torch.uint8,
+                                device=hidden_states.device,
+                            )


This is a good catch to fix the illegal memory access error during autotuning. The logic to check and resize the current_hidden_states_scale tensor is correct.

However, as you noted, this fix may cause a performance regression. This is likely because filling the tensor with torch.ones does not provide realistic scale values for the autotuner. The autotuner might be selecting a suboptimal kernel based on this non-representative data.

To potentially resolve the performance regression, I suggest initializing the tensor with random data, which better simulates real-world scale values. This should help the autotuner find a more performant kernel.

Suggested change

if current_hidden_states_scale.numel() < sf_size:

current_hidden_states_scale = torch.ones(

(sf_size,),

dtype=torch.uint8,

device=hidden_states.device,

)

if current_hidden_states_scale.numel() < sf_size:

current_hidden_states_scale = torch.randint(

0, 256, (sf_size,),

dtype=torch.uint8,

device=hidden_states.device,

)

coderabbitai

🧹 Nitpick comments (1)

flashinfer/fused_moe/core.py (1)
1160-1173: UE8M0 scale value encoding: byte 1 ≠ 1.0 scale.

The fix correctly prevents out-of-bounds reads during autotuner profiling. However, torch.ones(..., dtype=torch.uint8) creates a tensor filled with byte value 1, which in UE8M0 format represents 2^(1-127) = 2^(-126) ≈ 0, not a scale of 1.0.

For UE8M0, byte value 127 represents 2^(127-127) = 1.0. While this may not cause crashes, using near-zero scales during profiling could affect tactic selection accuracy.
Suggested fix to use correct UE8M0 encoding for 1.0
                         if current_hidden_states_scale.numel() < sf_size:
-                            current_hidden_states_scale = torch.ones(
+                            current_hidden_states_scale = torch.full(
                                 (sf_size,),
+                                127,  # UE8M0 encoding for scale = 1.0
                                 dtype=torch.uint8,
                                 device=hidden_states.device,
                             )
Based on learnings: In FlashInfer's quantization code, torch.float8_e4m3fn is used as a carrier dtype for 1-byte scale factors (UE8M0, etc.) — the raw bytes are interpreted by C++ kernels according to the actual format semantics.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/core.py` around lines 1160 - 1173, The created padding
scale uses torch.ones(..., dtype=torch.uint8) which sets raw byte value 1 (not
1.0 in UE8M0); replace that with a uint8 tensor filled with the UE8M0 encoding
for 1.0 (byte value 127) so current_hidden_states_scale has correct 1.0 scale
bytes; update the creation site that constructs current_hidden_states_scale (the
torch.ones call using sf_size, dtype=torch.uint8, device=hidden_states.device)
to create a tensor filled with 127 instead (or use the established float8
carrier path used elsewhere) to ensure profiling uses true 1.0 scales.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@flashinfer/fused_moe/core.py`:
- Around line 1160-1173: The created padding scale uses torch.ones(...,
dtype=torch.uint8) which sets raw byte value 1 (not 1.0 in UE8M0); replace that
with a uint8 tensor filled with the UE8M0 encoding for 1.0 (byte value 127) so
current_hidden_states_scale has correct 1.0 scale bytes; update the creation
site that constructs current_hidden_states_scale (the torch.ones call using
sf_size, dtype=torch.uint8, device=hidden_states.device) to create a tensor
filled with 127 instead (or use the established float8 carrier path used
elsewhere) to ensure profiling uses true 1.0 scales.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: efbff30c-41da-4fa7-aa5d-e6813d2ddae6

📥 Commits

Reviewing files that changed from the base of the PR and between e4dc66f and c22bf0e.

📒 Files selected for processing (1)

flashinfer/fused_moe/core.py

IwakuraRein · 2026-03-19T21:11:31Z

flashinfer/fused_moe/core.py

+                            current_hidden_states_scale = torch.ones(
+                                (sf_size,),
+                                dtype=torch.uint8,
+                                device=hidden_states.device,
+                            )


Considering the e8m0 data, maybe using 126-127 (0.5-2).

Suggested change

current_hidden_states_scale = torch.ones(

(sf_size,),

dtype=torch.uint8,

device=hidden_states.device,

)

if current_hidden_states_scale.numel() < sf_size:

current_hidden_states_scale = torch.randint(

126, 128, (sf_size,),

dtype=torch.uint8,

device=hidden_states.device,

)

IwakuraRein · 2026-03-19T21:44:35Z

flashinfer/fused_moe/core.py

@@ -1157,7 +1157,20 @@ def forward(
                        )
                    elif self.fp8_quantization_type == Fp8QuantizationType.MxFp8:
                        current_hidden_states_scale = extra_inputs[0]


Suggested change

current_hidden_states_scale = extra_inputs[0]

current_hidden_states_scale = hidden_states_scale

…2725) ## Summary SM120 desktop Blackwell GPUs (RTX PRO 6000, RTX 5090) are blocked from NVFP4 MoE grouped GEMM due to hardcoded SM100-only checks. **Changes:** - `jit/fused_moe.py`: Add major version 12 to `supported_major_versions` - `csrc/trtllm_fused_moe_kernel_launcher.cu`: `ICHECK_EQ(major, 10)` -> `ICHECK_GE(major, 10)` **Benchmark** (Qwen3.5-397B on 4x RTX PRO 6000 SM120): | Config | tok/s | Output | |--------|-------|--------| | compute_120f (CUDA 13.0) | 39.0 | Correct | | compute_120a (CUDA 12.8) | 14.6 | Correct (slow fallback) | | Marlin W4A16 | 46-49 | Correct | **Root cause:** All TMA WS grouped GEMM autotuner tactics fail on `compute_120a`, requiring `compute_120f` (CUDA 13.0). CuTe DSL `admissible_archs` in vendored CUTLASS also needs `sm_120a`/`sm_120f` (cpasync/copy.py, tcgen05/mma.py, arch/mbar.py, etc). Related: CUTLASS #2820, #2800; vLLM #33416, #33333; FlashInfer #2577  ## Summary by CodeRabbit * **Bug Fixes** * Broadened GPU architecture checks to accept additional modern compute capabilities (SM 10.x and 12.x), improving compatibility and clearer SM reporting. * Improved compute-capability detection and encoding, preserving user-provided architecture suffixes and more accurately generating nvcc architecture flags. * Expanded JIT module generation to include additional CUDA majors so fused-MoE kernels run on more recent GPUs.  --------- Signed-off-by: Brandon Music <brandon.m.music@gmail.com> Co-authored-by: Brandon Music <brandonmmusic-max@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Brandon Music <brandonmusic@pop-os.tail8674da.ts.net>

autotune error at bs=64

c22bf0e

charlotte12l requested review from IwakuraRein, bkryu, nv-yunzheq and yzh119 as code owners March 17, 2026 06:11

gemini-code-assist bot reviewed Mar 17, 2026

View reviewed changes

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

coderabbitai bot mentioned this pull request Mar 14, 2026

fix: Add SM120 (RTX Blackwell desktop) support for NVFP4 MoE kernels #2725

Merged

aleozlx added the op: moe label Mar 19, 2026

IwakuraRein reviewed Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC]MXFP8 autotune IMA at some batch size#2800

[RFC]MXFP8 autotune IMA at some batch size#2800
charlotte12l wants to merge 1 commit intoflashinfer-ai:mainfrom
charlotte12l:autotune

charlotte12l commented Mar 17, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 17, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning, 2 inconclusive)

Uh oh!

gemini-code-assist bot commented Mar 17, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 17, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

IwakuraRein Mar 19, 2026

Uh oh!

IwakuraRein Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	current_hidden_states_scale = extra_inputs[0]
	current_hidden_states_scale = hidden_states_scale

Conversation

charlotte12l commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning, 2 inconclusive)

Uh oh!

gemini-code-assist bot commented Mar 17, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

IwakuraRein Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

IwakuraRein Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

charlotte12l commented Mar 17, 2026 •

edited

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading