Skip to content

[RFC]MXFP8 autotune IMA at some batch size#2800

Open
charlotte12l wants to merge 1 commit intoflashinfer-ai:mainfrom
charlotte12l:autotune
Open

[RFC]MXFP8 autotune IMA at some batch size#2800
charlotte12l wants to merge 1 commit intoflashinfer-ai:mainfrom
charlotte12l:autotune

Conversation

@charlotte12l
Copy link
Contributor

@charlotte12l charlotte12l commented Mar 17, 2026

📌 Description

We are running vLLM with MXFP8 and will get the below flashinfer autotune error with max_batch_size=64. However, max_batch_size=128 or max_batch_size=256 is fine. We tried with CUDA_LAUNCH_BLOCKING=1 however it's the same stacktrace.

The current PR could avoid the error and eval accuracy is fine. However we doubt if it's the correct fix as perf regressed.

(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,483 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...

(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,812 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [8, 10], due to failure while profiling: CUDA error: an illegal memory access was encountered

(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) Device-side assertion tracking was not enabled by user.

(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,813 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [8, 11], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,814 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [8, 14], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,814 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [8, 15], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,814 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [16, 10], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,815 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [16, 11], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,815 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [16, 14], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,815 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [16, 15], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,815 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [32, 10], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,816 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [32, 11], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,816 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [32, 14], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,816 - WARNING - autotuner.py:490 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_trtllm_moe_sm100_module.<locals>.MoERunner object at 0xffde1df4af30> [32, 15], due to failure while profiling: Error in function 'run' at <REDACTED_PATH>/trtllm_fused_moe_routing_renormalize.cu:463: Got CUDA error. See above for details.

(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) 2026-03-13 12:01:42,820 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends

(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932] WorkerProc hit an exception.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932] Traceback (most recent call last):
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/multiproc_executor.py", line 927, in worker_busy_loop
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     output = func(*args, **kwargs)
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/gpu_worker.py", line 594, in compile_or_warm_up_model
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     kernel_warmup(self)
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/kernel_warmup.py", line 46, in kernel_warmup
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     flashinfer_autotune(worker.model_runner)
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/kernel_warmup.py", line 103, in flashinfer_autotune
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     runner._dummy_run(
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/_contextlib.py", line 124, in decorate_context
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     return func(*args, **kwargs)
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/gpu_model_runner.py", line 5182, in _dummy_run
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     outputs = self.model(
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/cuda_graph.py", line 241, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     return self.runnable(*args, **kwargs)
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/decorators.py", line 492, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     return TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/wrapper.py", line 225, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     return self._call_with_optional_nvtx_range(
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/wrapper.py", line 119, in _call_with_optional_nvtx_range
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     return callable_fn(*args, **kwargs)
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/model.py", line 975, in forward
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/eval_frame.py", line 1266, in _fn
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/output_graph.py", line 2587, in _tf_disabled_wrapper
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/caching.py", line 206, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/graph_module.py", line 949, in call_wrapped
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/graph_module.py", line 461, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/module.py", line 1778, in _wrapped_call_impl
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/module.py", line 1789, in _call_impl
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<eval_with_key>.74", line 459, in forward
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     submod_4 = self.submod_4(...);  ... = None

(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/cuda_graph.py", line 241, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/piecewise_backend.py", line 379, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/standalone_compile.py", line 122, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/eval_frame.py", line 1266, in _fn
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/aot_autograd.py", line 1210, in forward
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/runtime_wrappers.py", line 582, in runtime_wrapper
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/utils.py", line 138, in call_func_at_runtime_with_args
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/runtime_wrappers.py", line 2311, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/runtime_wrappers.py", line 785, in wrapper
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/runtime_wrappers.py", line 989, in inner_fn
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/output_code.py", line 673, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/utils.py", line 3439, in run
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/<torchinductor_generated>.py", line 1413, in call
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     buf12 = torch.ops.vllm.moe_forward.default(buf11, buf10, None, arg10_1)
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/_ops.py", line 871, in __call__
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/default_moe_runner.py", line 85, in _moe_forward
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/default_moe_runner.py", line 678, in forward_impl
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/mxfp8_moe_method.py", line 234, in apply_monolithic
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/moe_post_norm_fi_trtllm.py", line 348, in run_fi_trtllm_moe_mxfp8_with_post_norm
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/flashinfer_utils.py", line 424, in flashinfer_fused_moe_mxfp8_routed
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/flashinfer_utils.py", line 151, in flashinfer_trtllm_fp8_block_scale_routed_moe
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/core.py", line 2618, in trtllm_fp8_block_scale_routed_moe
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/core.py", line 1709, in trtllm_fp8_block_scale_moe_op
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/autotuner.py", line 469, in choose_one
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/autotuner.py", line 775, in _prepare_input_tensors
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/autotuner.py", line 764, in _create_tensor_like
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]   File "<REDACTED_PATH>/core.py", line 969, in <lambda>
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932]     lambda shapes, dtype, device: torch.randn(shapes, device=device).to(
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932] Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932] Device-side assertion tracking was not enabled by user.
(Worker pid=4178717) (Worker_DP0_EP0 pid=4178717) ERROR 03-13 12:01:42 [multiproc_executor.py:932] Traceback (most recent call last):

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Fixed an out-of-bounds read issue in the FP8 quantization path that could impact inference stability and correctness during MoE model execution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

This PR fixes an out-of-bounds read issue in the MoE GEMM kernel when using MxFp8 quantization. The fix reconstructs the current_hidden_states_scale tensor to the correct size (sf_size) by replacing it with a ones tensor when needed, preventing runtime crashes during forward passes.

Changes

Cohort / File(s) Summary
FP8 MxFp8 Scale Reconstruction
flashinfer/fused_moe/core.py
Added logic to compute padded_k and sf_size, then reconstruct the scale tensor with appropriate dimensions to prevent out-of-bounds reads in GEMM kernel. Removed inline comment in trtllm_fp8_block_scale_moe path.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • #2412: Updates FP8/block-layout docstrings and comments in the same MoE quantization code path, addressing documentation alongside the current runtime fix.

Suggested labels

op: moe

Suggested reviewers

  • yzh119
  • nv-yunzheq

Poem

🐰 Scales were dancing out of bounds,
Causing kernels to crash with sounds,
A rabbit fixed the MoE way,
With proper sizing, all is gay!
FP8 quantization now runs true. ✨

🚥 Pre-merge checks | ❌ 3

❌ Failed checks (1 warning, 2 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is vague and lacks clarity about the specific fix; '[RFC]MXFP8 autotune IMA at some batch size' uses non-specific terms and does not clearly convey the core change. Clarify the title to describe what fix was applied, e.g., 'Fix MXFP8 autotune scale reconstruction for batch size 64' or similar.
Description check ❓ Inconclusive The PR description provides context about the CUDA error but lacks structured information and fails to clearly explain the fix or address checklist items. Add a detailed description of what the fix does, why it addresses the root cause, and address the PR checklist items (pre-commit checks, tests). Clarify whether the performance regression is acceptable or requires further investigation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can get early access to new features in CodeRabbit.

Enable the early_access setting to enable early access features such as new models, tools, and more.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request provides a provisional solution to a critical CUDA 'illegal memory access' error encountered during the autotuning process for MXFP8 quantization within vLLM, particularly affecting smaller batch sizes. The change ensures that the scale tensors used in the Mixture-of-Experts (MoE) GEMM kernel are allocated with the correct dimensions, thereby preventing runtime failures. While this fix resolves the error and maintains evaluation accuracy, the author notes a performance regression, indicating that further optimization or a more fundamental solution may be required.

Highlights

  • MXFP8 Autotuning Fix: Addressed an 'illegal memory access' CUDA error that occurred during MXFP8 autotuning in vLLM, specifically at max_batch_size=64, by ensuring the scale tensor is correctly sized.
  • Scale Tensor Resizing: Implemented logic to detect and correct undersized 1D scale tensors created by DynamicTensorSpec during autotuner profiling, preventing out-of-bounds reads in the MoE GEMM kernel.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • flashinfer/fused_moe/core.py
    • Implemented a check and re-initialization for current_hidden_states_scale to ensure its size is correct during MXFP8 autotuning, preventing CUDA illegal memory access errors.
    • Removed a redundant comment from the tuner.choose_one call related to FP8 block-scale usage.
Activity
  • The author, charlotte12l, opened this pull request as an RFC (Request For Comments) to address a specific CUDA error during MXFP8 autotuning.
  • The PR description includes a detailed stack trace of the 'illegal memory access' error encountered when running vLLM with MXFP8 at max_batch_size=64.
  • The author has identified that the proposed fix resolves the error and maintains evaluation accuracy, but also noted a performance regression, inviting further discussion and review.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a CUDA error: an illegal memory access was encountered that occurs during autotuning for MXFP8 with small batch sizes. The fix correctly identifies that the scale tensor created by DynamicTensorSpec can be undersized and resizes it to the correct dimension, which effectively prevents the crash. My feedback focuses on improving this fix to address the performance regression mentioned in the pull request description. I suggest using more realistic random data for the recreated scale tensor instead of filling it with ones. This should help the autotuner select a more optimal and performant kernel.

Comment on lines +1168 to +1173
if current_hidden_states_scale.numel() < sf_size:
current_hidden_states_scale = torch.ones(
(sf_size,),
dtype=torch.uint8,
device=hidden_states.device,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This is a good catch to fix the illegal memory access error during autotuning. The logic to check and resize the current_hidden_states_scale tensor is correct.

However, as you noted, this fix may cause a performance regression. This is likely because filling the tensor with torch.ones does not provide realistic scale values for the autotuner. The autotuner might be selecting a suboptimal kernel based on this non-representative data.

To potentially resolve the performance regression, I suggest initializing the tensor with random data, which better simulates real-world scale values. This should help the autotuner find a more performant kernel.

Suggested change
if current_hidden_states_scale.numel() < sf_size:
current_hidden_states_scale = torch.ones(
(sf_size,),
dtype=torch.uint8,
device=hidden_states.device,
)
if current_hidden_states_scale.numel() < sf_size:
current_hidden_states_scale = torch.randint(
0, 256, (sf_size,),
dtype=torch.uint8,
device=hidden_states.device,
)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
flashinfer/fused_moe/core.py (1)

1160-1173: UE8M0 scale value encoding: byte 1 ≠ 1.0 scale.

The fix correctly prevents out-of-bounds reads during autotuner profiling. However, torch.ones(..., dtype=torch.uint8) creates a tensor filled with byte value 1, which in UE8M0 format represents 2^(1-127) = 2^(-126) ≈ 0, not a scale of 1.0.

For UE8M0, byte value 127 represents 2^(127-127) = 1.0. While this may not cause crashes, using near-zero scales during profiling could affect tactic selection accuracy.

Suggested fix to use correct UE8M0 encoding for 1.0
                         if current_hidden_states_scale.numel() < sf_size:
-                            current_hidden_states_scale = torch.ones(
+                            current_hidden_states_scale = torch.full(
                                 (sf_size,),
+                                127,  # UE8M0 encoding for scale = 1.0
                                 dtype=torch.uint8,
                                 device=hidden_states.device,
                             )

Based on learnings: In FlashInfer's quantization code, torch.float8_e4m3fn is used as a carrier dtype for 1-byte scale factors (UE8M0, etc.) — the raw bytes are interpreted by C++ kernels according to the actual format semantics.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/core.py` around lines 1160 - 1173, The created padding
scale uses torch.ones(..., dtype=torch.uint8) which sets raw byte value 1 (not
1.0 in UE8M0); replace that with a uint8 tensor filled with the UE8M0 encoding
for 1.0 (byte value 127) so current_hidden_states_scale has correct 1.0 scale
bytes; update the creation site that constructs current_hidden_states_scale (the
torch.ones call using sf_size, dtype=torch.uint8, device=hidden_states.device)
to create a tensor filled with 127 instead (or use the established float8
carrier path used elsewhere) to ensure profiling uses true 1.0 scales.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@flashinfer/fused_moe/core.py`:
- Around line 1160-1173: The created padding scale uses torch.ones(...,
dtype=torch.uint8) which sets raw byte value 1 (not 1.0 in UE8M0); replace that
with a uint8 tensor filled with the UE8M0 encoding for 1.0 (byte value 127) so
current_hidden_states_scale has correct 1.0 scale bytes; update the creation
site that constructs current_hidden_states_scale (the torch.ones call using
sf_size, dtype=torch.uint8, device=hidden_states.device) to create a tensor
filled with 127 instead (or use the established float8 carrier path used
elsewhere) to ensure profiling uses true 1.0 scales.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: efbff30c-41da-4fa7-aa5d-e6813d2ddae6

📥 Commits

Reviewing files that changed from the base of the PR and between e4dc66f and c22bf0e.

📒 Files selected for processing (1)
  • flashinfer/fused_moe/core.py

Comment on lines +1169 to +1173
current_hidden_states_scale = torch.ones(
(sf_size,),
dtype=torch.uint8,
device=hidden_states.device,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the e8m0 data, maybe using 126-127 (0.5-2).

Suggested change
current_hidden_states_scale = torch.ones(
(sf_size,),
dtype=torch.uint8,
device=hidden_states.device,
)
if current_hidden_states_scale.numel() < sf_size:
current_hidden_states_scale = torch.randint(
126, 128, (sf_size,),
dtype=torch.uint8,
device=hidden_states.device,
)

@@ -1157,7 +1157,20 @@ def forward(
)
elif self.fp8_quantization_type == Fp8QuantizationType.MxFp8:
current_hidden_states_scale = extra_inputs[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
current_hidden_states_scale = extra_inputs[0]
current_hidden_states_scale = hidden_states_scale

aleozlx pushed a commit that referenced this pull request Mar 20, 2026
…2725)

## Summary
SM120 desktop Blackwell GPUs (RTX PRO 6000, RTX 5090) are blocked from
NVFP4 MoE grouped GEMM due to hardcoded SM100-only checks.

**Changes:**
- `jit/fused_moe.py`: Add major version 12 to `supported_major_versions`
- `csrc/trtllm_fused_moe_kernel_launcher.cu`: `ICHECK_EQ(major, 10)` ->
`ICHECK_GE(major, 10)`

**Benchmark** (Qwen3.5-397B on 4x RTX PRO 6000 SM120):
| Config | tok/s | Output |
|--------|-------|--------|
| compute_120f (CUDA 13.0) | 39.0 | Correct |
| compute_120a (CUDA 12.8) | 14.6 | Correct (slow fallback) |
| Marlin W4A16 | 46-49 | Correct |

**Root cause:** All TMA WS grouped GEMM autotuner tactics fail on
`compute_120a`, requiring `compute_120f` (CUDA 13.0).

CuTe DSL `admissible_archs` in vendored CUTLASS also needs
`sm_120a`/`sm_120f` (cpasync/copy.py, tcgen05/mma.py, arch/mbar.py,
etc).

Related: CUTLASS #2820, #2800; vLLM #33416, #33333; FlashInfer #2577

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Broadened GPU architecture checks to accept additional modern compute
capabilities (SM 10.x and 12.x), improving compatibility and clearer SM
reporting.
* Improved compute-capability detection and encoding, preserving
user-provided architecture suffixes and more accurately generating nvcc
architecture flags.
* Expanded JIT module generation to include additional CUDA majors so
fused-MoE kernels run on more recent GPUs.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Brandon Music <brandon.m.music@gmail.com>
Co-authored-by: Brandon Music <brandonmmusic-max@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Brandon Music <brandonmusic@pop-os.tail8674da.ts.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants