[None][feat] Support nvfp4 for gptoss #8956

dongfengy · 2025-11-05T21:47:40Z

Summary by CodeRabbit

Release Notes

New Features
- Introduced NVFP4 quantization support for mixture of experts layers with tunable accuracy parameters
- Extended GPT-OSS model family to leverage NVFP4 quantization
Tests
- Added comprehensive test coverage for NVFP4 quantized inference paths

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

dongfengy · 2025-11-26T00:47:17Z

/bot run

coderabbitai · 2025-11-26T00:48:05Z

📝 Walkthrough

Walkthrough

This PR introduces NVFP4 quantization support for GPT-OSS models by extending the FP4 block scale MOE kernel interfaces with optional bias and activation parameters, refactoring weight alignment and padding logic across quantization utilities, and adding NVFP4-specific weight loading paths with per-expert scale handling.

Changes

Cohort / File(s)	Summary
C++ MOE Kernel Interface `cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp`	Extended function signatures for `run_fp4_block_scale_moe_runner` and wrapper classes to accept optional tensors (`gemm1_bias`, `gemm1_alpha`, `gemm1_beta`, `gemm1_clamp_limit`, `gemm2_bias`). Added validation checks for dtype, dimensionality, and per-expert shapes. Applied WAR to remove tile 256 when `hidden_size == 3072` in `getValidConfigs`. Changed `mSupportedTileN` from const to mutable member.
PyTorch Custom Ops & Runners `tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py`	Extended `FP4BlockScaleMoEInputs` dataclass with new public fields for `gemm1_bias`, `gemm1_alpha`, `gemm1_beta`, `gemm1_clamp_limit`, and `gemm2_bias`. Updated `fp4_block_scale_moe_runner` function signature and `FP4BlockScaleMoERunner.forward` to propagate new tensors through tuning and execution paths. Extended MxE4m3 variants similarly.
Model Weight Loading `tensorrt_llm/_torch/models/modeling_gpt_oss.py`	Added new `load_nvfp4_weights()` method to `GptOssForCausalLM` for NVFP4-specific weight loading with per-expert scale handling, bias arrangements, and optional global scales. Integrated NVFP4 path detection via `quant_mode.has_nvfp4()` with exclusion module support during initialization.
MoE Fused Ops Support `tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`	Extended quantization validation to include `nvfp4` alongside `mxfp4` variants. Added input padding logic via `intermediate_size_per_partition_padded` for nvfp4 and w4a16 paths. Updated MoE forward paths to pass bias tensors and swiglu parameters into FP4/MOE runners with final shape alignment checks.
Quantization & Alignment Logic `tensorrt_llm/_torch/modules/fused_moe/quantization.py`	Renamed `maybe_pad_for_mxfp4()` → `maybe_pad_for_weights()`. Introduced alignment-aware padding via `_get_weight_alignment()` helper. Extended `create_weights()` signature with `bias_dtype` and alignment parameters (`weight_alignment`, `input_hidden_alignment`). Refactored weight shape computations to use padded dimensions for scales and biases across NVFP4, MXFP4, and TRT variants.
Integration Tests `tests/integration/defs/accuracy/references/gsm8k.yaml`, `tests/integration/defs/accuracy/test_llm_api_pytorch.py`	Added GSM8K reference entry for GPT-OSS/20B-NVFP4 with three configurations (base, NVFP4, NVFP4+FP8 cache). Added `test_w4_2gpus_nvfp4()` test method with multi-GPU setup, KV cache quantization, and attention data parallelism.
Unit Tests - MOE & Routing `tests/unittest/_torch/modules/test_fused_moe.py`, `tests/unittest/_torch/thop/serial/test_moe.py`	Refactored `test_fused_moe_nvfp4` into parameterized `run_fused_moe_nvfp4()` runner. Added swiglu support to `RefGatedMLPFusedMoE.__init__()` with custom activation function. Added `test_fused_moe_nvfp4_gptoss_style()` and `test_gptoss_style_nvfp4()` test variants. Extended `moe_args_dequant` to carry bias/activation parameters. Updated `run_moe_fp4_test()` to accept `gptoss_style`, `swiglu_alpha`, `swiglu_beta`, `swiglu_limit` with gated MLP handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Alignment and padding refactor in quantization.py: The introduction of _get_weight_alignment() and changes to padding logic across multiple MOE variants (NVFP4, MXFP4, TRT) require careful verification of dimension calculations and sharding behavior.
New weight loading method in modeling_gpt_oss.py: The load_nvfp4_weights() method handles complex per-expert scale arrangements and MoE-specific weight layouts; verify correctness of weight mapping and dtype consistency.
Kernel interface changes in C++: Validation checks for new optional tensors and the hidden_size-dependent tile WAR need cross-validation with kernel implementation.
Test parameter propagation: Ensure new swiglu and gptoss_style parameters are threaded correctly through MoE runner stacks and reference implementations in unit tests.

Possibly related PRs

NVIDIA/TensorRT-LLM#9011: Modifies the fused MoE operator/runner interface and dispatch paths with related activation handling and operator call signature changes.

Suggested reviewers

dongxuy04
kaiyux
xxi-nv
yuxianq
syuoni

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.70% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	⚠️ Warning	The pull request description is empty or only contains the unmodified template. No actual description, rationale, test coverage, or implementation details were provided by the author.	Please fill in the Description section explaining what changes are made and why. Provide a clear Test Coverage section listing relevant tests. Ensure all PR Checklist items are addressed and justified.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[None][feat] Support nvfp4 for gptoss' clearly summarizes the main feature addition: NVFP4 support for GPT-OSS models.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)
319-338: Update top‑k constraint indices to match new FP4 inputs order

After adding the extra gemm1/gemm2 parameters, the positions of topk_weights and topk_ids in FP4BlockScaleMoEInputs (and in input_tensors_for_tuner) shifted to the end of the list:

Index 16: topk_weights

Index 17: topk_ids

get_constraint_specs is still using TOPK_WEIGHTS_IDX = 11 and TOPK_IDS_IDX = 12, which now refer to gemm2_weights_scale and gemm2_bias. This mis‑constrains those tensors based on the token count and leaves the actual top‑k tensors unconstrained during tuning.

Recommend updating the indices:
-        ROUTER_LOGITS_IDX = 0
-        CONSTRAINED_RL_DIM = 0
-        TOPK_WEIGHTS_IDX = 11
-        TOPK_IDS_IDX = 12
+        ROUTER_LOGITS_IDX = 0
+        CONSTRAINED_RL_DIM = 0
+        TOPK_WEIGHTS_IDX = 16
+        TOPK_IDS_IDX = 17
This restores the intended coupling between num_tokens and the top‑k tensors in the autotuner.

🧹 Nitpick comments (13)

cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)

464-490: Avoid mutating mSupportedTileN inside getValidConfigs

The WAR that pop_back()s tileN 256 when hiddenSize == 3072 permanently removes 256 from mSupportedTileN for the lifetime of the runner instance. That means:

Subsequent getValidConfigs calls for other hidden sizes will never consider tile 256, even if it would be valid.

mRunners still contains a runner for tile 256 that will no longer be used.

Consider keeping mSupportedTileN immutable and instead:

Build a local std::vector<int32_t> supportedTileN = mSupportedTileN; and conditionally remove 256 from that local copy, or

Gate the WAR at the call site of computeSelectedTileN without mutating the member.

This keeps the WAR scoped to the problematic configuration without silently degrading tactics elsewhere.

Also applies to: 536-537
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
205-212: NVFP4 integration looks coherent; consider minor padding robustness

Overall the NVFP4 wiring in TRTLLMGenFusedMoE is consistent:

_check_configs correctly treats nvfp4 / w4a8_nvfp4_fp8 as supported quant modes and restricts bias/swiglu usage to nvfp4 and mxfp4 variants only.

create_weights auto-creates zero w3_w1_bias / w2_bias for nvfp4 and mxfp4 families when bias == False, ensuring the new C++ optional pointers are always valid without changing logical bias semantics.

_quantize_for_post_quant_comm’s nvfp4 branch correctly:

Short‑circuits when x is already an Fp4QuantizedTensor (no re-quantization, just wiring tensor + scaling factor).

Pads unquantized x up to w3_w1_weight.shape[-1] * 2 only when pad_size > 0 before calling fp4_quantize, avoiding negative‑pad issues.

The nvfp4 forward_impl branch mirrors the existing w4a16/mxfp4 pattern:

Handles the no‑post‑quant‑comm path by padding + quantizing, and uses intermediate_size_per_partition_padded = self.w3_w1_weight.shape[-2] // 2.

Passes the new bias and swiglu parameters through to fp4_block_scale_moe_runner.

Defensively slices final_hidden_states to hidden_size if the kernel returns a padded last dimension.

One small robustness suggestion: in the nvfp4 forward_impl path for not post_quant_comm you unconditionally call
pad_size = self.w3_w1_weight.shape[-1] * 2 - x.shape[-1]
x = torch.nn.functional.pad(x, (0, pad_size))
whereas _quantize_for_post_quant_comm guards this with if pad_size > 0. If any caller ever passes an x whose last dimension exceeds the padded hidden size, this will raise from F.pad instead of failing a clear assertion. For symmetry and clearer diagnostics, consider:
pad_size = self.w3_w1_weight.shape[-1] * 2 - x.shape[-1]
if pad_size < 0:
    raise ValueError(f"Hidden dim ({x.shape[-1]}) exceeds NVFP4 padded size ({self.w3_w1_weight.shape[-1] * 2})")
elif pad_size > 0:
    x = torch.nn.functional.pad(x, (0, pad_size))
This keeps behavior aligned with _quantize_for_post_quant_comm while making any unexpected shape mismatch explicit.

Also applies to: 245-258, 281-313, 549-610
tensorrt_llm/_torch/models/modeling_gpt_oss.py (1)
610-619: Guard quant_config and clarify NVFP4 exclude_modules override semantics

Here you fully overwrite quant_config.exclude_modules when quant_algo == "NVFP4", and then remap names via params_map_reverse. That seems intentional, but it also assumes model_config.quant_config is always non-None.

To make this code more robust and explicit, consider:

Guarding against quant_config is None before dereferencing it.

Documenting (or asserting) that for NVFP4 we intentionally ignore any user-provided exclude_modules instead of merging with them.

For example:
-        quant_config = self.model_config.quant_config
-        if quant_config.exclude_modules:
+        quant_config = self.model_config.quant_config
+        if quant_config is not None and quant_config.exclude_modules:
             if quant_config.quant_algo == "NVFP4":
                 quant_config.exclude_modules = [
                     "block.*.attn.qkv",
                     "block.*.attn.out",
                     "block.*.mlp.gate",
                     "embedding",
                     "unembedding",
                 ]
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)
496-525: Unused fake‑kernel parameters are acceptable; consider underscore‑prefix to appease Ruff

The @fp4_block_scale_moe_runner.register_fake stub must mirror the real op’s signature, so the new gemm1_* and gemm2_* parameters are necessarily unused in the fake implementation. That’s fine functionally.

If you want to silence Ruff’s ARG001 warnings without changing behaviour, you could underscore‑prefix them, e.g.:
def _(routing_logits,
-      routing_bias,
-      hidden_states,
-      hidden_states_scale,
-      gemm1_weights,
-      gemm1_weights_scale,
-      gemm1_bias,
-      gemm1_alpha,
-      gemm1_beta,
-      gemm1_clamp_limit,
-      gemm2_weights,
-      gemm2_weights_scale,
-      gemm2_bias,
+      routing_bias,
+      hidden_states,
+      hidden_states_scale,
+      gemm1_weights,
+      gemm1_weights_scale,
+      _gemm1_bias,
+      _gemm1_alpha,
+      _gemm1_beta,
+      _gemm1_clamp_limit,
+      _gemm2_weights,
+      _gemm2_weights_scale,
+      _gemm2_bias,
       output1_scale_scalar,
       ...
Purely cosmetic; no functional change.
tests/unittest/_torch/thop/serial/test_moe.py (1)
1163-1199: NVFP4 GPT‑OSS style test and run_moe_fp4_test gptoss path look sound; tighten types and invariants

The new GPT‑OSS style NVFP4 coverage and gptoss wiring are well done:

test_gptoss_style_nvfp4 parametrizes over swiglu_alpha, swiglu_beta, and swiglu_limit and drives run_moe_fp4_test(..., gptoss_style=True, ...), giving decent coverage of the new activation options.

In run_moe_fp4_test, when gptoss_style is true you:

Create per‑expert gemm1_bias/gemm2_bias and per‑expert swiglu_*_tensor.

Propagate them into moe_args and the fp4 reference path.

Reorder/shuffle biases alongside weights/scales.

Rescale swiglu_beta, swiglu_limit, and both biases with the appropriate global scales before calling fp4_block_scale_moe_runner, mirroring the existing mxe2m1 fp8 path.

Loosen tolerances only for the gptoss_style branch, leaving existing tests unchanged.

Two minor improvements:
Type hints for swiglu parameters (RUF013)
The signature currently has plain float defaults set to None, which Ruff flags. You can clarify intent and satisfy lint by using explicit optionals:
def run_moe_fp4_test(...,
                    gptoss_style: bool = False,
                    swiglu_alpha: float = None,
                    swiglu_beta: float = None,
                    swiglu_limit: float = None) -> None:
def run_moe_fp4_test(...,
                    gptoss_style: bool = False,
                    swiglu_alpha: float | None = None,
                    swiglu_beta: float | None = None,
                    swiglu_limit: float | None = None) -> None:
Assert swiglu_ are provided when gptoss_style is enabled*
Today only test_gptoss_style_nvfp4 calls gptoss_style=True, so the tensors are always set, but future callers could forget to pass them. A quick guard would make the contract explicit:
if gptoss_style:
    assert swiglu_alpha is not None and swiglu_beta is not None and swiglu_limit is not None
    ...
Functionally, the new path and tolerances look correct.

Also applies to: 1263-1273, 1341-1373, 1530-1541, 1547-1557, 1589-1596, 1615-1627
tests/unittest/_torch/modules/test_fused_moe.py (2)

1348-1539: NVFP4 test runner is structurally sound; one subtle behavior to keep in mind

The run_fused_moe_nvfp4 helper cleanly centralizes NVFP4 test setup: per‑expert weights, global/input scales, NVFP4 block scales, and backend selection for create_moe. The reuse by both generic NVFP4 tests and GPT‑OSS style tests looks correct.

For GPT‑OSS style, per‑expert w1/w2/w3.bias are created as torch.float32 and passed with bias=True; TRTLLM‑Gen side uses bias_dtype=torch.float32, so this matches expectations.

SwiGLU config is passed as:

Tensors (swiglu_alpha_tensor, etc., shape (NUM_EXPERTS,)) into create_moe for the fused path.

Scalar floats (swiglu_alpha, etc.) into RefGatedMLPFusedMoE for the reference path.
This separation is reasonable as long as the model side treats module‑level tensors vs ref‑module scalars consistently.

Tolerances split by style (check_accuracy with relaxed (rtol=0.1, atol=0.1, percent=0.95) for GPT‑OSS vs strict assert_close for the non‑GPT‑OSS path) are a sensible compromise given different numerics.

No obvious correctness issues stand out in the setup; just ensure the backend weight loaders (CUTLASS/CUTEDSL/TRTLLM) all expect the same NVFP4 weight/scale key layout used here (*.weight, *.weight_scale, *.input_scale, *.weight_scale_2).

If you want an extra safety net, you could add a lightweight shape assertion block (e.g., assert that every w*_sf_block_unswizzled has .shape == (INTERMEDIATE_SIZE, HIDDEN_SIZE // SCALING_VECTOR_SIZE) / the analogous H dimension) right after quantization to catch accidental API changes in fp4_quantize early.

1565-1593: New NVFP4 tests cover both generic and GPT‑OSS paths appropriately

test_fused_moe_nvfp4 parametrizes {dtype ∈ {fp16,bf16}, backend ∈ {TRTLLM,CUTLASS,CUTEDSL}} and routes everything through run_fused_moe_nvfp4, which keeps logic DRY and aligned.

The GPT‑OSS‑style test drives a realistic large configuration (hidden_size/intermediate_size ∈ {512,2880}, num_experts=32, top_k=4, seq_len=8192) and sweeps swiglu_alpha/beta/limit; this should give good coverage of the new activation and scaling plumbing.

Given how heavy these configs are, you might want to keep an eye on runtime in CI, but there is no correctness issue visible in the test wiring.

tensorrt_llm/_torch/modules/fused_moe/quantization.py (6)

143-152: maybe_pad_for_weights helper is correct for 1D/2D use; consider documenting rank expectations

The padding logic is straightforward: last dim padded to col_alignment, second‑last to optional row_alignment. This matches all current uses (2D weights/scales with both alignments; 1D biases/scales with column‑only padding).

For 1D tensors, the branch without row_alignment is used, so F.pad(weight, (0, col_pad_size)) is safe.

To avoid accidental misuse later, it might be worth a short docstring stating this is intended for rank‑1 or rank‑2 tensors and that row_alignment must only be provided for rank‑2 inputs.

1568-1644: NVFP4FusedMoEMethod: padded shapes and scale tensors align with vectorized layouts

Local round_up plus weight_alignment/input_hidden_alignment yields:

intermediate_size_per_partition_padded,

w3_w1_hidden_size_padded,

w2_hidden_size_padded,
which are then consistently used for both weight and scale tensor shapes.

Weight shapes divide the padded hidden dim by weight_vec_size, and block‑scale shapes divide hidden/intermediate by scaling_vector_size * block_scales_vec_size, matching the “pack 16 FP4s into int64 / 4 FP8s into int32” comments.

Passing bias_dtype down to FusedMoEMethodBase.create_weights allows backends to override bias precision without touching this core logic.

No issues spotted; this is a clean centralization of NVFP4 padding behavior.

If round_up is now duplicated across several classes (NVFP4 and MXFP4), you could later factor it into a small module‑level helper to avoid repetition.

2068-2084: NVFP4 TRTLLM‑Gen w2 path: bias normalization by tp_size is correct but implicit

For 2D weights, you pad to alignment // 2 in the packed dimension and self.weight_alignment in the other before sharding; for 1D biases, you:

pad to self.weight_alignment, and

divide by module.tp_size so that per‑rank biases sum correctly after TP accumulation.

This matches the usual TP bias convention. It might be worth a brief comment near the division noting that the kernel performs “sum across TP ranks after applying per‑rank bias”, to avoid future confusion if someone changes the TP strategy.

2361-2421: MXFP4WeightFusedMoEMethod: alignment‑aware weight/bias shapes look correct

The new create_weights mirrors the NVFP4 logic:

pads intermediate and hidden dimensions to weight_alignment / input_hidden_alignment,

shapes packed weights by dividing padded dims by weight_vec_size,

shapes block scales according to scaling_vector_size * block_scales_vec_size.

Explicit w3_w1_bias_shape and w2_bias_shape use the padded dims, ensuring biases align with the packed layouts.

This sets a clean, unified contract for all MXFP4 backends.

You might eventually want to share the round_up helper between NVFP4 and MXFP4 base classes to avoid subtle divergence.

2835-2853: MXFP4WeightTRTLLMGenFusedMoEMethod.post_load_weights: lean shape computation uses new helper correctly

The proxy w1_weight of unpadded shape is padded with maybe_pad_for_weights using input_hidden_alignment and _get_weight_alignment(...), then passed through load_weight_shard(..., return_slice_indices=True) to derive shard slice bounds.

intermediate_size_per_partition_lean is computed by clamping the shard slices to the original (unpadded) dims, giving a per‑shard “lean” intermediate size that kernels can use to avoid bandwidth waste on trailing padding.

This is a neat way to reuse the same alignment logic for both real and proxy weights without duplicating slice math.

Consider documenting intermediate_size_per_partition_lean in the module/class docstring, since it’s a key contract for kernels.

336-355: _get_weight_alignment correctly enforces per‑shard alignment constraints

Starts from lcm(weight_alignment, scaling_vector_size, tp_size) to avoid fractional scaling factors.

Computes a padded dimension multiple of this alignment and derives per‑shard size.

If the per‑shard size is not a multiple of weight_alignment, it bumps alignment so that:

padded_dim remains multiple of the new alignment, and

per_shard becomes a multiple of weight_alignment.

This matches the comments and fixes the “fractional number of scale blocks per shard” issue described in several docstrings and comments.

You might consider adding a tiny assert (or debug‑only check) that the final per_shard % weight_alignment == 0 before returning to guard future changes.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between afc52d7 and 60eaf96.

📒 Files selected for processing (9)

cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (8 hunks)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (6 hunks)
tensorrt_llm/_torch/models/modeling_gpt_oss.py (3 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (7 hunks)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (23 hunks)
tests/integration/defs/accuracy/references/gsm8k.yaml (1 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
tests/unittest/_torch/modules/test_fused_moe.py (12 hunks)
tests/unittest/_torch/thop/serial/test_moe.py (12 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
tensorrt_llm/_torch/models/modeling_gpt_oss.py
tests/unittest/_torch/modules/test_fused_moe.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
tests/unittest/_torch/thop/serial/test_moe.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
tensorrt_llm/_torch/models/modeling_gpt_oss.py
tests/unittest/_torch/modules/test_fused_moe.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
tests/unittest/_torch/thop/serial/test_moe.py

**/*.{cpp,h,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #define whenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared as const
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an empty for or while loop in a new line
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements)
If and else should always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g., thisIsASubDir and thisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues), though the 'm' pre...

Files:

cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

🧠 Learnings (18)

📓 Common learnings

Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_pytorch.py

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_pytorch.py

📚 Learning: 2025-08-26T09:49:04.956Z

Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

tests/integration/defs/accuracy/test_llm_api_pytorch.py

📚 Learning: 2025-08-14T23:23:27.449Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
tests/unittest/_torch/modules/test_fused_moe.py
tests/unittest/_torch/thop/serial/test_moe.py

📚 Learning: 2025-09-19T21:28:13.751Z

Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

📚 Learning: 2025-08-08T22:03:40.707Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

📚 Learning: 2025-08-08T04:10:19.038Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tests/unittest/_torch/modules/test_fused_moe.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

📚 Learning: 2025-09-29T15:14:28.503Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

📚 Learning: 2025-10-20T17:07:18.745Z

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
tests/unittest/_torch/thop/serial/test_moe.py

📚 Learning: 2025-11-14T11:22:03.729Z

Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Applied to files:

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

📚 Learning: 2025-08-20T07:43:36.447Z

Learnt from: ChristinaZ
Repo: NVIDIA/TensorRT-LLM PR: 7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Applied to files:

cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
tests/unittest/_torch/thop/serial/test_moe.py

📚 Learning: 2025-09-03T13:16:06.824Z

Learnt from: nvpohanh
Repo: NVIDIA/TensorRT-LLM PR: 7478
File: tensorrt_llm/_torch/models/modeling_llama.py:1315-1315
Timestamp: 2025-09-03T13:16:06.824Z
Learning: The Llama4VisionEncoder.load_weights method signature is `def load_weights(self, weights: Dict)` and should not be confused with Llama4ForConditionalGeneration.load_weights which has a different signature including weight_mapper parameter.

Applied to files:

tensorrt_llm/_torch/models/modeling_gpt_oss.py

🧬 Code graph analysis (6)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (4)

tensorrt_llm/_torch/modules/fused_moe/interface.py (5)

has_nvfp4 (589-592)

has_w4a16_mxfp4 (613-616)

has_w4a8_mxfp4_fp8 (601-604)

has_w4a8_mxfp4_mxfp8 (607-610)

has_w4a8_nvfp4_fp8 (595-598)

tensorrt_llm/_torch/modules/linear.py (3)

has_nvfp4 (2008-2011)

has_w4a8_mxfp4_fp8 (2038-2041)

has_w4a8_nvfp4_fp8 (2032-2035)

tensorrt_llm/_torch/utils.py (1)

shape (131-132)

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (4)

top_k (275-275)

n_group (276-276)

topk_group (278-278)

hidden_size (271-271)

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (2)

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (8)

num_experts (269-269)

top_k (275-275)

n_group (276-276)

topk_group (278-278)

intermediate_size (281-281)

local_expert_offset (282-282)

local_num_experts (283-283)

do_finalize (304-305)

tensorrt_llm/_torch/modules/fused_moe/routing.py (6)

routing_method_type (180-181)

routing_method_type (213-214)

routing_method_type (226-227)

routing_method_type (261-262)

routing_method_type (281-282)

routing_method_type (419-420)

cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (3)

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (8)

num_experts (269-269)

top_k (275-275)

n_group (276-276)

topk_group (278-278)

intermediate_size (281-281)

local_expert_offset (282-282)

local_num_experts (283-283)

do_finalize (304-305)

cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (2)

topK (496-498)

topK (588-590)

cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (1)

topK (332-333)

tensorrt_llm/_torch/models/modeling_gpt_oss.py (2)

tensorrt_llm/_torch/modules/fused_moe/interface.py (1)

has_nvfp4 (589-592)

tensorrt_llm/module.py (1)

named_modules (102-114)

tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)

tests/unittest/_torch/modules/test_fused_moe.py (1)

round_up (75-76)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (2)

create_weights (201-222)

create_weights (391-432)

tests/unittest/_torch/thop/serial/test_moe.py (2)

cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (2)

intermediate_size (281-281)

top_k (275-275)

tensorrt_llm/functional.py (1)

stack (1961-2007)

🪛 Ruff (0.14.5)

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py

503-503: Unused function argument: gemm1_bias

(ARG001)

504-504: Unused function argument: gemm1_alpha

(ARG001)

505-505: Unused function argument: gemm1_beta

(ARG001)

506-506: Unused function argument: gemm1_clamp_limit

(ARG001)

507-507: Unused function argument: gemm2_weights

(ARG001)

508-508: Unused function argument: gemm2_weights_scale

(ARG001)

509-509: Unused function argument: gemm2_bias

(ARG001)

tests/unittest/_torch/thop/serial/test_moe.py

1271-1271: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

1272-1272: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

1273-1273: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (20)

tests/integration/defs/accuracy/references/gsm8k.yaml (1)

264-270: GPT‑OSS/20B‑NVFP4 reference entries look consistent

The new GSM8K accuracy entries for GPT-OSS/20B-NVFP4 follow the existing schema (baseline, quant_algo, kv_cache_quant_algo combinations) and align with other GPT‑OSS blocks.

tensorrt_llm/_torch/models/modeling_gpt_oss.py (2)

785-883: NVFP4 MoE loader wiring looks consistent; please double‑check checkpoint key invariants

The new load_nvfp4_weights path mirrors the HF loader structure and wires NVFP4‑specific MoE layout:

Deinterleaving gate_up_proj/gate_up_proj_weight_scale via deinterleave and splitting per expert.

Forcing MoE biases to fp32 for NVFP4 (gate_up_bias/down_bias).

Passing through extra NVFP4 globals like *_weight_scale_2 and *_input_scale untouched.

Two minor points to verify (no required change if invariants hold):

This code assumes gate_up_proj, down_proj, gate_up_proj_bias, and down_proj_bias are always present for NVFP4 MoE: deinterleave(gate_up) and gate_up_bias.dtype will fail if any is missing. If some NVFP4 ckpts can omit these, guarding or using direct indexing ([...]) plus a clearer error may be safer.

The deinterleave pattern for gate_up_proj_weight_scale (operating on the last dimension) should match the NVFP4 kernel’s expected layout; it differs from the MXFP4 path in load_hf_weights (which deinterleaves along dim=-2), so it’s worth re‑confirming against the C++ runner layout.

Functionally the plumbing into module.load_weights(weights=[moe_weights]) looks good.

634-647: I'll help you verify this review comment. Let me start by examining the actual code and understanding the context around the load_weights method.
<function_calls>

explore_file

#!/bin/bash

First, check the file size and then read the relevant section

wc -l tensorrt_llm/_torch/models/modeling_gpt_oss.py

</function_calls>
<function_calls>

read_code_section

#!/bin/bash

Read the code around lines 634-647

sed -n '620,660p' tensorrt_llm/_torch/models/modeling_gpt_oss.py | cat -n

</function_calls>
<function_calls>

find_model_config

#!/bin/bash

Search for model_config definition and quant_config

rg -n "class.Config|quant_config\s=" tensorrt_llm/_torch/models/modeling_gpt_oss.py | head -30

</function_calls>
<function_calls>

search_quant_config

#!/bin/bash

Look for quant_config initialization and default values

rg -n "quant_config" tensorrt_llm/_torch/models/modeling_gpt_oss.py -A 2 -B 2 | head -40

</function_calls>

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)

151-167: FP4 BlockScale MoE NVFP4 parameter plumbing is coherent

The FP4 path now consistently threads the extra NVFP4 tensors:

FP4BlockScaleMoEInputs includes gemm1_bias/alpha/beta/clamp_limit and gemm2_bias.

FP4BlockScaleMoERunner.forward passes these directly to kernel_runner.run_moe in the expected order.

fp4_block_scale_moe_runner custom op signature and both input_tensors_for_tuner and final input_tensors lists include the new tensors, keeping topk_weights/topk_ids as the last two entries so prepare_dummy_topk_and_hook logic still applies.

This wiring looks internally consistent with the C++ runner expectations and the autotuner flow.

Also applies to: 225-247, 355-370, 414-459

tests/unittest/_torch/thop/serial/test_moe.py (1)

550-566: Reference paths correctly extended with bias/activation parameters

The extensions of moe_args_dequant and all the run_moe_reference_* helpers look consistent:

run_moe_reference_fp4 now forwards gemm1_bias, gemm1_alpha, gemm1_beta, gemm1_clamp_limit, and gemm2_bias into moe_args_dequant, which then feeds run_moe_dequant’s activation and Gemm2 bias logic.

The mxe4m3/mxe2m1, e4m3/mxe2m1, and bf16/mxe2m1 reference builders also pass these new fields through, preserving existing behaviour where they were already set on args.

This keeps the Python reference implementations in sync with the new NVFP4 kernel features while maintaining backward compatibility for modes that don’t use the extra fields.

Also applies to: 642-699

tests/unittest/_torch/modules/test_fused_moe.py (2)

14-14: Import of torch.nn.functional as F is appropriate and scoped correctly

The new RefGatedMLPFusedMoE activation path and helper functions rely on F.silu, so this import is necessary and consistent with other usages.

2685-2823: RefGatedMLPFusedMoE: SwiGLU integration and NVFP4/W4A8‑NVFP4 weight loading look coherent

Extending RefGatedMLPFusedMoE.__init__ with bias and swiglu_* matches how tests construct both fused and reference modules. The bias flag is correctly stored and used during load_weights.

The custom_swiglu closure captures swiglu_alpha/beta/limit and handles:

optional clamping on gate and value,

default alpha=1.0, beta=0.0,

expert‑wise activation via per‑expert GatedMLP(activation=...).
This is consistent with the GPT‑OSS–style activation behavior described in the PR summary.

The new NVFP4 / W4A8_NVFP4_FP8 branch in load_weights correctly:

propagates weight_scale, input_scale, and weight_scale_2 to gate_up_proj and down_proj,

keeps existing FP8/FP8‑block‑scales paths untouched.

Assuming GatedMLP’s activation contract matches what custom_swiglu expects (which appears to be the case from the other code in this PR), this reference path should behave consistently with the fused kernels.

If you haven’t already, it’s worth sanity‑checking one GPT‑OSS config by directly comparing GatedMLP + custom_swiglu against the original GPT‑OSS implementation (outside MoE) to confirm the per‑token outputs match numerically within the same tolerances used in these tests.

tensorrt_llm/_torch/modules/fused_moe/quantization.py (13)

218-237: Bias shape handling in FusedMoEMethodBase.create_weights is robust to padding

Defaulting w3_w1_bias_shape and w2_bias_shape to the first two dims of the corresponding weight shapes ensures bias tensors inherit any padding applied to the weight layouts.

Using bias_dtype or module.dtype correctly allows backends like TRTLLM‑Gen to request float32 bias even when activations/weights are lower precision.

This change looks sound and keeps all existing callers compatible.

1961-1984: NVFP4TRTLLMGenFusedMoEMethod: alignment parameters and bias dtype look correct

Setting weight_alignment = input_hidden_alignment = 256 and threading both into create_weights aligns with the comment about TRTLLM‑Gen alignment requirements.

Using bias_dtype=torch.float32 matches how tests construct w*_bias and ensures high‑precision accumulation for biases even with low‑precision weights.

This setup seems consistent with both the kernel expectations and the new NVFP4 tests.

If kernel constraints change later (e.g., relaxed to 128‑alignment), updating these class attributes should be sufficient; worth keeping this in sync with the underlying TRT kernels’ documentation.

2125-2137: NVFP4 TRTLLM‑Gen w3/w1 block‑scale padding matches alignment logic

Alignment is recomputed for the scale rows and maybe_pad_for_weights is applied with:

columns aligned to input_hidden_alignment / scaling_vector_size,

rows aligned to the same alignment used for weights.

This mirrors the “pad before sharding” rationale used for MXFP4 and avoids fractional numbers of scaling vectors per shard.

Given how sensitive these layouts are, this is the right place to centralize the padding instead of sprinkling manual F.pad calls.

Consider adding a small unit test that feeds an adversarial intermediate_size_per_partition (e.g., 2880 with various tp_size) and asserts that per‑shard scale rows are multiples of weight_alignment and scaling_vector_size.

2195-2202: NVFP4 TRTLLM‑Gen w2 block‑scale padding is consistent with w3/w1

Column alignment is computed as alignment // module.scaling_vector_size, with alignment including tp_size via _get_weight_alignment, so per‑shard columns remain multiple of weight_alignment.

Row alignment reuses self.weight_alignment, so the sharded w2 scales line up with the padded w2 weights.

This keeps the scale layout coherent with the weight layout for GEMM2.

2245-2265: Post‑quantization normalization of biases and SwiGLU params is logically consistent

After super().load_quant_scales, you:

set fc31_scale_c = fc2_input_scale * fc31_alpha, and

normalize:

w3_w1_bias by fc31_alpha,

w2_bias by fc2_alpha,

swiglu_beta and swiglu_limit by fc31_alpha.

This matches the comment that kernels expect biases and SwiGLU parameters pre‑scaled by the same global factors that apply to activations/weights.

Shape‑wise this is safe as long as:

w*_bias have shape (expert_size_per_partition, *), and

swiglu_* are 1D per‑expert tensors, which matches how tests construct them.

No bug evident here; just rely on create_moe to always define swiglu_beta/swiglu_limit attributes (set to None when unused), which your if ... is not None checks assume.

2539-2556: MXFP4WeightCutlassFusedMoEMethod: pre‑shard padding matches alignment rationale

For w1/w3, you pad:

columns to weight_alignment // 2 (because two MXFP4s per byte),

rows to the _get_weight_alignment(...) result to keep per‑shard rows multiple of weight_alignment.

For biases (1D), you pad to the same alignment to keep shard shapes consistent.

This implements the “pad before sharding to avoid fractional scaling factors per shard” comment accurately.

2582-2590: MXFP4WeightCutlassFusedMoEMethod: w2 padding is consistent with w1/w3

shard_w2_weight_dim uses the packed K dimension (2 * w2_weight.shape[1] for 2D weights), and _get_weight_alignment is applied similarly.

Padding uses alignment // 2 in the packed dimension and self.weight_alignment in the other, with biases padded only on the single dimension.

This keeps both GEMM2 weights and biases aligned to the same constraints used for GEMM1.

2609-2615: MXFP4WeightCutlassFusedMoEMethod: scale padding mirrors weight padding correctly

w1_weight_scale and w3_weight_scale are padded with:

columns aligned to weight_alignment / scaling_vector_size,

rows aligned to the same _get_weight_alignment(...) used for weights.

After sharding and concatenation, you interleave via block_scale_interleave, which is consistent with the weight layout.

The same pattern is used for w2 scales below, so scale and weight layouts remain in sync.

A small unit test that checks per‑shard scale tensors’ last dimensions are multiples of module.scaling_vector_size and weight_alignment would make regressions here very obvious.

2890-2902: MXFP4WeightTRTLLMGenFusedMoEMethod: w1/w3 padding matches TRTLLM‑Gen alignment policy

Uses the same _get_weight_alignment routine as Cutlass, but pads columns to input_hidden_alignment // 2 (512/2 = 256) instead of weight_alignment // 2, maximizing kernel choice as per the comment.

1D biases are padded and converted to float32, matching TRTLLM‑Gen’s expectation for bias dtype.

This appears consistent with the stated intent for the TRTLLM‑Gen MXFP4 path.

2955-2957: MXFP4WeightTRTLLMGenFusedMoEMethod: per‑rank w2 bias scaling is explicit and correct

For 1D w2 biases, you pad to weight_alignment, cast to float32, and divide by tp_size, which matches the TP convention that each rank applies its own bias before an all‑reduce.

No issue here; comment is clear enough to avoid misunderstandings about why the division is done.

2996-3004: MXFP4WeightTRTLLMGenFusedMoEMethod: w3/w1 scale padding aligns with hidden alignment

Scales are padded on:

columns to input_hidden_alignment / scaling_vector_size, and

rows to _get_weight_alignment(...),
mirroring the weight padding and ensuring integer numbers of scale vectors per shard.

This symmetry between weights and scales is exactly what you want for MXFP4 TRTLLM‑Gen.

3066-3068: MXFP4WeightTRTLLMGenFusedMoEMethod: w2 scale padding mirrors w2 weight alignment

Column alignment is alignment / scaling_vector_size, row alignment is weight_alignment, so sharded w2 scales line up with the padded w2 weights and meet both TMA and kernel constraints.

Looks consistent with the rest of the MXFP4 alignment scheme.

2004-2043: Based on my verification through web search, I can now provide the rewritten review comment:

NVFP4 TRTLLM-Gen w1/w3 weight padding and sharding logic is sound; dtype asserts are appropriate

The code correctly handles NVFP4 weights as packed torch.uint8 tensors (two FP4 values per byte). Alignment computation via _get_weight_alignment() is applied correctly: 2D packed weights are padded on columns to input_hidden_alignment // 2 and rows to alignment, while 1D biases are padded to alignment and cast to float32. Sharding via load_weight_shard(..., TensorParallelMode.COLUMN, device=device) followed by write to dst_w3_w1_weight halves is consistent with the packed weight format. The torch.uint8 dtype assertions for 2D weights are appropriate and necessary, as standard NVFP4 quantization workflows (via TensorRT-LLM's QuantMode and model-optimizer) produce uint8-packed tensors by design.

tests/integration/defs/accuracy/test_llm_api_pytorch.py

tensorrt-cicd · 2025-11-26T00:54:49Z

PR_Github #25789 [ run ] triggered by Bot. Commit: 17afb6f

coderabbitai · 2025-11-26T00:57:44Z

📝 Walkthrough

Walkthrough

This PR adds NVFP4 quantization support to the MoE runtime, extending FP4 kernel interfaces with optional GEMM bias and SwiGlu gating parameters. It introduces weight padding and alignment utilities, implements NVFP4-specific model loading for GPT-OSS, and adds corresponding integration and unit tests to validate the new configuration pathways.

Changes

Cohort / File(s)	Change Summary
Core C++ FP4 MoE Interfaces `cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp`	Added optional GEMM1 parameters (bias, alpha, beta, clamp_limit) and GEMM2 bias to runtime function signatures. Extended runtime validation for new optional tensors with shape/dtype checks. Made mSupportedTileN non-const and added workaround for hiddenSize 3072 with tileN 256. Updated run_fp4_block_scale_moe_runner and runner class methods to thread through new parameters.
Python FP4 Custom Ops `tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py`	Expanded FP4BlockScaleMoEInputs structure with gemm1_bias, gemm1_alpha, gemm1_beta, gemm1_clamp_limit, and gemm2_bias fields. Updated FP4BlockScaleMoeRunner.forward to propagate new parameters through kernel invocation and tuning paths.
NVFP4 Model Loading `tensorrt_llm/_torch/models/modeling_gpt_oss.py`	Added load_nvfp4_weights method for NVFP4-specific weight loading with per-expert structure handling, weight deinterleaving, and FP32 bias conversion. Updated \_\post_init\_\ to exclude NVFP4 modules from quantization when quant_algo is "NVFP4".
Weight Padding & Alignment Utilities `tensorrt_llm/_torch/modules/fused_moe/quantization.py`	Renamed maybe_pad_for_mxfp4 to maybe_pad_for_weights. Extended NVFP4FusedMoEMethod.create_weights and subclass signatures with weight_alignment and input_hidden_alignment parameters. Updated weight shape computations to use padded intermediate sizes. Adjusted weight loading and bias handling for padded dimensions across expert partitions.
MoE Fusion Runtime `tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`	Enabled bias/swiglu support for nvfp4 and mxfp4 in _check_configs. Added pre/post-communication padding for nvfp4 paths. Introduced intermediate_size_per_partition_padded and propagated through MoE runner calls. Added final output clipping to ensure consistent hidden_size dimensionality.
Test References `tests/integration/defs/accuracy/references/gsm8k.yaml`	Added GPT-OSS/20B-NVFP4 accuracy reference entries with and without FP8 kv_cache_quant_algo.
Integration Tests `tests/integration/defs/accuracy/test_llm_api_pytorch.py`	Added test_w4_2gpus_nvfp4 methods in TestQwen3_8B and TestGPTOSS classes with skip guards for CI model availability checks.
Unit Tests - Fused MoE `tests/unittest/_torch/modules/test_fused_moe.py`	Introduced parameterized run_fused_moe_nvfp4 runner with configurable hidden_size, intermediate_size, num_experts, top_k, seq_len, gptoss_style, and swiglu parameters. Updated RefGatedMLPFusedMoE to accept and apply GPT-OSS style gating (swiglu_alpha/beta/limit).
Unit Tests - MoE Thop `tests/unittest/_torch/thop/serial/test_moe.py`	Extended run_moe_fp4_test signature with gptoss_style and swiglu parameters. Added test_gptoss_style_nvfp4 test method. Implemented per-expert bias shuffling, interleaving, and scattering. Updated reference and TRT kernel paths to propagate gemm1_bias, gemm1_alpha/beta/clamp_limit, and gemm2_bias with dynamic accuracy tolerances.

Sequence Diagram(s)

sequenceDiagram
    participant User as Caller
    participant Loader as load_nvfp4_weights
    participant ExpertLoad as Expert Weight Loading
    participant BiasLoad as Bias & Scale Loading
    participant Dict as Weight Dict Construction

    User->>Loader: weights Dict
    Loader->>ExpertLoad: Per-expert gate_up_proj, down_proj
    ExpertLoad->>ExpertLoad: Deinterleave gate/up weights
    ExpertLoad->>BiasLoad: Extract per-expert biases
    BiasLoad->>BiasLoad: Convert to FP32, apply scales
    BiasLoad->>Dict: Accumulate expert weights & scales
    Loader->>Dict: Handle qkv blocks & non-MoE modules
    Dict-->>Loader: Per-expert weight structures
    Loader-->>User: Populated model state

sequenceDiagram
    participant Test as Test Runner
    participant MoE as Fused MoE Module
    participant Quantize as Quantization Path
    participant Padding as Padding & Alignment
    participant Kernel as FP4 Kernel

    Test->>MoE: forward(hidden_states, router_logits)
    MoE->>Quantize: Quantize weights to FP4/NVFP4
    Quantize->>Padding: maybe_pad_for_weights(weight_alignment)
    Padding->>Padding: Round intermediate_size to alignment
    Padding->>Quantize: Padded weight tensor
    Quantize->>Kernel: Pass gemm1_bias, gemm1_alpha/beta/clamp_limit, gemm2_bias
    Kernel->>Kernel: Apply per-expert SwiGlu gating
    Kernel-->>MoE: MoE output
    MoE->>Padding: Clip final_hidden_states to hidden_size
    Padding-->>Test: Output with consistent dimensions

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

quantization.py: Significant refactoring with alignment/padding parameter threading, padded shape computations, and expert-partitioned weight loading logic across multiple methods.
fused_moe_trtllm_gen.py: Multiple control flow branches for nvfp4 padding, intermediate_size_per_partition_padded propagation, and output clipping across diverse code paths.
fp4BlockScaleMoe.cpp & trtllm_gen_custom_ops.py: Linked signature changes requiring careful parameter propagation verification through multiple call chains.
Weight loading (modeling_gpt_oss.py): New load_nvfp4_weights method with complex per-expert structure handling, deinterleaving, and bias conversion logic.
Test coverage (test_moe.py, test_fused_moe.py): GPT-OSS style parameters requiring understanding of SwiGlu activation semantics and accuracy tolerance adjustments.

Key areas requiring extra attention:

Padding alignment logic consistency across quantization, weight loading, and kernel invocation paths
Parameter threading through optional tensor chains (gemm1_bias → intermediate args → kernel)
Per-expert weight interleaving/deinterleaving symmetry and offset calculations
Backward compatibility verification when new optional parameters are None/not provided

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete. The author provided only the template without filling in the required sections: Description (explaining the issue/solution) and Test Coverage (listing relevant tests). While the PR Checklist is marked as reviewed, the substantive content sections are empty.	Complete the Description section explaining what NVFP4 support entails for GPT-OSS and why it's needed. Add Test Coverage section listing the integration tests (test_w4_2gpus_nvfp4, test_gptoss_style_nvfp4) and unit tests that validate the changes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.70% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[None][feat] Support nvfp4 for gptoss' clearly describes the main feature being added - NVFP4 support for GPT-OSS models, which aligns with the extensive changes across multiple files including NVFP4-specific weight loading, MoE runner updates, and integration tests.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)
321-331: Constraint indices for topk_weights/topk_ids in FP4 autotuning are out of sync with the new input layout.

After inserting the new GEMM1/GEMM2 tensors, topk_weights_for_tuner and topk_ids_for_tuner now sit at indices 16 and 17 in input_tensors_for_tuner, but get_constraint_specs still uses:
TOPK_WEIGHTS_IDX = 11
TOPK_IDS_IDX = 12
With the current layout:
input_tensors_for_tuner = [
    0  routing_logits_for_tuner,
    1  routing_bias,
    2  hidden_states,
    3  hidden_states_scale,
    4  gemm1_weights,
    5  gemm1_weights_scale,
    6  gemm1_bias,
    7  gemm1_alpha,
    8  gemm1_beta,
    9  gemm1_clamp_limit,
    10 gemm2_weights,
    11 gemm2_weights_scale,  # currently treated as TOPK_WEIGHTS
    12 gemm2_bias,           # currently treated as TOPK_IDS
    13 output1_scale_scalar,
    14 output1_scale_gate_scalar,
    15 output2_scale_scalar,
    16 topk_weights_for_tuner,
    17 topk_ids_for_tuner,
]
the token‑count constraint specs are being applied to gemm2_weights_scale and gemm2_bias instead of to the topk_* tensors. This can distort shapes for those weight tensors during tuning and leaves topk_* unconstrained (relying only on the pre‑hook to track num_tokens).

You probably want:
-        TOPK_WEIGHTS_IDX = 11
-        TOPK_IDS_IDX = 12
+        TOPK_WEIGHTS_IDX = 16
+        TOPK_IDS_IDX = 17
to match the updated FP4 input ordering.

Also applies to: 415-434

🧹 Nitpick comments (10)

tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

3888-3925: GPT‑OSS NVFP4 2‑GPU test wiring is fine; confirm skip marker semantics and placeholder path before enabling

The new test_w4_2gpus_nvfp4 mirrors the existing W4 2‑GPU tests and correctly drives the GSM8K flow through the NVFP4 checkpoint once enabled. Two minor points before you eventually drop the in‑body pytest.skip:

With the class already gated to Blackwell SMs, adding @pytest.mark.skip_blackwell means this test will be skipped on the only architecture where NVFP4 is expected to run. Please double‑check that this marker’s behavior matches your intent.

Consider switching "./nvfp4ckpt" to a llm_models_root()‑based path (like the other GPT‑OSS tests) so it’s aligned with CI model layout when the real checkpoint is hooked up.

tests/unittest/_torch/thop/serial/test_moe.py (1)

1163-1199: GPT‑OSS‑style NVFP4 MoE path and SwiGLU epilogue wiring look sound; consider a couple of small robustness and lint fixes

The new GPT‑OSS‑style NVFP4 test plus the run_moe_fp4_test extensions appear logically consistent:

test_gptoss_style_nvfp4 drives the NVFP4 MoE kernel with per‑expert SwiGLU alpha/beta/limit and per‑expert biases, and the reference side gets the unscaled parameters via moe_args/run_moe_reference_fp4 before you rescale them for the quantized kernel call.

Bias and SwiGLU data are shaped, interleaved (reorder_rows_for_gated_act_gemm), and shuffled (shuffle_matrix_a / shuffle_matrix_sf_a) in the same way as the existing mxe2m1 path later in this file, so layout handling looks consistent.

Relaxed tolerances when gptoss_style=True (higher atol/rtol, lower percent) are reasonable given the added non‑linearity and quantization noise.

A few minor suggestions:

run_moe_fp4_test treats swiglu_alpha, swiglu_beta, and swiglu_limit as required whenever gptoss_style=True, but the signature allows them to be None. Adding a small guard, e.g. assertions that these are not None when gptoss_style is enabled, would fail fast if the helper is misused.

Ruff’s RUF013 warning about implicit Optional on lines 1271–1273 stems from annotating these as float while defaulting to None. If you care about keeping linters quiet, consider changing the annotations to float | None / Optional[float] or dropping the type annotation on these kwargs in this test helper. (As per static analysis hints.)

When gptoss_style is False, all of gemm1_bias_shuffled, gemm2_bias_shuffled, and the SwiGLU tensors are None at the fp4_block_scale_moe_runner call site. Please double‑check the custom op’s Python binding indeed declares these as optional arguments and that older call sites don’t rely on positional ordering that could be broken by the new parameters.

Also applies to: 1263-1273, 1341-1373, 1428-1450, 1463-1517, 1530-1557, 1589-1595, 1614-1627
tensorrt_llm/_torch/modules/fused_moe/quantization.py (3)
143-152: maybe_pad_for_weights helper looks correct; consider guarding assumptions.

The padding helper correctly handles 1D/2D (and 3D via last two dims) and is reused consistently across NVFP4/MXFP4 paths. To make it more robust long‑term, you might:

Assert col_alignment > 0 (and row_alignment > 0 when provided) to avoid accidental modulo‑by‑zero.

Optionally assert weight.dim() <= 3 or document that only 1D–3D tensors are supported, since F.pad with a 4‑element pad will fail for 4D+ inputs.

1568-1598: NVFP4 base create_weights alignment/padding is coherent with vectorization requirements.

The new alignment parameters and padded shapes for w3_w1_weight, w2_weight, and their block scales are internally consistent and match the later use of _get_weight_alignment and packed FP4/fp8 layouts. Two minor nits:

The comment “Divide by 4 because we use int32 to pack 4 fp8 values” is no longer accurate for the TRTLLM‑Gen path where block_scales_vec_size can be 1; consider rephrasing this to describe the general block_scales_vec_size logic instead of hard‑coding “4”.

If these padded dimensions are semantically important to downstream code, exposing them as explicit module attributes (e.g., module.intermediate_size_per_partition_padded) could make debugging easier, but that’s optional.

Also applies to: 1603-1619, 1639-1643

2125-2137: Scale padding and post‑load normalization look correct; guard swiglu_* attributes for safety.

Padding w1/w3/w2 block scales with the same alignment scheme as the weights before sharding, then applying permutation + block_scale_interleave, matches the NVFP4 kernel expectations and mirrors the MXFP4 path.

The later normalization:

fc31_scale_c = fc2_input_scale * fc31_alpha

dividing w3_w1_bias and w2_bias by fc31_alpha / fc2_alpha

scaling swiglu_beta / swiglu_limit by 1 / fc31_alpha

is consistent with pushing global scale factors into the parameters so the kernel can assume unscaled biases/activations.

One robustness suggestion: module.swiglu_beta / module.swiglu_limit are accessed directly and only checked for is not None, which will raise AttributeError if the attributes are absent. To make this method safer for other modules that might reuse NVFP4TRTLLMGenFusedMoEMethod, you could use getattr:
swiglu_beta = getattr(module, "swiglu_beta", None)
if swiglu_beta is not None:
    swiglu_beta.data.div_(module.fc31_alpha.data)

swiglu_limit = getattr(module, "swiglu_limit", None)
if swiglu_limit is not None:
    swiglu_limit.data.div_(module.fc31_alpha.data)
Please double‑check that all current users of NVFP4TRTLLMGenFusedMoEMethod always define swiglu_beta and swiglu_limit; if not, the current code will raise at load time.

Also applies to: 2195-2202, 2246-2265
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)
464-471: mSupportedTileN WAR is global & persistent; consider avoiding in‑place mutation.

The hidden‑size‑specific WAR:
if (hiddenSize == 3072 && mSupportedTileN.back() == 256) {
    mSupportedTileN.pop_back();
}
now mutates mSupportedTileN on the shared FP4BlockScaleMoeRunner instance. Once a 3072‑hidden‑size model is seen, tileN 256 is dropped for all subsequent calls, including other hidden sizes where 256 might be valid and beneficial.

A less surprising approach would be to keep mSupportedTileN immutable and apply the WAR to a local copy, e.g.:
auto supportedTileN = mSupportedTileN;
if (hiddenSize == 3072) {
    supportedTileN.erase(
        std::remove(supportedTileN.begin(), supportedTileN.end(), 256),
        supportedTileN.end());
}
// use supportedTileN for computeSelectedTileN and iteration
This preserves the workaround for the problematic hidden size without globally constraining other configurations.

Please confirm whether the intent is to disable tileN=256 only for GPT‑OSS 3072 or for all models once such a model is encountered in the process.

Also applies to: 536-537
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)
503-509: Fake FP4 op has intentionally unused new arguments; consider silencing Ruff ARG001.

The @fp4_block_scale_moe_runner.register_fake implementation ignores the new GEMM1/GEMM2 parameters, which is fine for a fake kernel that only needs to synthesize shapes, but Ruff now flags them as unused.

If you want to keep lint clean without changing behavior, you can prefix these parameters with _ in the fake function’s signature, e.g.:
def _(
    routing_logits,
    routing_bias,
    hidden_states,
    hidden_states_scale,
    gemm1_weights,
    gemm1_weights_scale,
    _gemm1_bias,
    _gemm1_alpha,
    _gemm1_beta,
    _gemm1_clamp_limit,
    gemm2_weights,
    gemm2_weights_scale,
    _gemm2_bias,
    ...
):
    ...
This keeps the positional API intact while satisfying the linter.
tensorrt_llm/_torch/models/modeling_gpt_oss.py (2)
806-811: Local function deinterleave is well-defined but could benefit from a brief docstring.

The function performs gate/up deinterleaving correctly. Consider adding a brief comment explaining the interleaved format being handled.

785-882: Significant code duplication between load_nvfp4_weights and load_hf_weights.

The non-MoE weight loading logic (lines 861-882) is nearly identical to load_hf_weights (lines 762-783). Consider extracting the common logic into a helper method to reduce duplication and maintenance burden.

Example refactor approach:
def _load_non_moe_weights(self, name, module, module_weights, weights):
    """Common weight loading logic for non-MoE modules."""
    if hasattr(module, "load_weights"):
        if 'qkv' in name:
            q_weight_bias = filter_weights(
                name.replace('qkv_proj', 'q_proj'), weights)
            k_weight_bias = filter_weights(
                name.replace('qkv_proj', 'k_proj'), weights)
            v_weight_bias = filter_weights(
                name.replace('qkv_proj', 'v_proj'), weights)
            module.load_weights(
                weights=[q_weight_bias, k_weight_bias, v_weight_bias])
        else:
            module.load_weights(weights=[module_weights])
    else:
        if 'next_layer_layernorm' in name:
            return
        for n, p in module._parameters.items():
            if p is not None:
                p.data.copy_(module_weights[n][:])
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
554-555: Duplicate padding logic should use a helper function.

The same padding calculation appears in both _quantize_for_post_quant_comm (lines 288-290) and forward_impl (lines 554-555). Consider extracting to a helper method for consistency.
def _get_nvfp4_input_pad_size(self, x: torch.Tensor) -> int:
    return self.w3_w1_weight.shape[-1] * 2 - x.shape[-1]

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between afc52d7 and 17afb6f.

📒 Files selected for processing (9)

cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (8 hunks)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (6 hunks)
tensorrt_llm/_torch/models/modeling_gpt_oss.py (3 hunks)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (7 hunks)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (23 hunks)
tests/integration/defs/accuracy/references/gsm8k.yaml (1 hunks)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
tests/unittest/_torch/modules/test_fused_moe.py (12 hunks)
tests/unittest/_torch/thop/serial/test_moe.py (12 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
tensorrt_llm/_torch/models/modeling_gpt_oss.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tests/unittest/_torch/thop/serial/test_moe.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
tests/unittest/_torch/modules/test_fused_moe.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py

**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
tensorrt_llm/_torch/models/modeling_gpt_oss.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tests/unittest/_torch/thop/serial/test_moe.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
tests/unittest/_torch/modules/test_fused_moe.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

**/*.{cpp,h,cu}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #define whenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared as const
Except 0 (only used in comparison for checking signness/existence/emptiness) and nullptr, true, false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an empty for or while loop in a new line
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements)
If and else should always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g., thisIsASubDir and thisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g., localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues), though the 'm' pre...

Files:

cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

🧠 Learnings (18)

📚 Learning: 2025-09-19T21:28:13.751Z

Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.

Applied to files:

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

📚 Learning: 2025-08-08T22:03:40.707Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

📚 Learning: 2025-11-14T11:22:03.729Z

Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Applied to files:

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

📚 Learning: 2025-08-14T15:38:01.771Z

Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py

📚 Learning: 2025-08-19T12:45:11.997Z

Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

📚 Learning: 2025-08-21T02:39:12.009Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

📚 Learning: 2025-08-09T20:57:04.084Z

Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

📚 Learning: 2025-08-08T04:10:19.038Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tensorrt_llm/_torch/modules/fused_moe/quantization.py
tests/unittest/_torch/modules/test_fused_moe.py

📚 Learning: 2025-08-19T03:35:20.866Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.

Applied to files:

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

📚 Learning: 2025-10-20T17:07:18.745Z

Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.

Applied to files:

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tests/unittest/_torch/thop/serial/test_moe.py

📚 Learning: 2025-09-03T13:16:06.824Z

Learnt from: nvpohanh
Repo: NVIDIA/TensorRT-LLM PR: 7478
File: tensorrt_llm/_torch/models/modeling_llama.py:1315-1315
Timestamp: 2025-09-03T13:16:06.824Z
Learning: The Llama4VisionEncoder.load_weights method signature is `def load_weights(self, weights: Dict)` and should not be confused with Llama4ForConditionalGeneration.load_weights which has a different signature including weight_mapper parameter.

Applied to files:

tensorrt_llm/_torch/models/modeling_gpt_oss.py

📚 Learning: 2025-08-14T23:23:27.449Z

Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
tests/unittest/_torch/thop/serial/test_moe.py
tests/unittest/_torch/modules/test_fused_moe.py
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

📚 Learning: 2025-09-23T15:13:48.819Z

Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.

Applied to files:

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py

📚 Learning: 2025-08-20T07:43:36.447Z

Learnt from: ChristinaZ
Repo: NVIDIA/TensorRT-LLM PR: 7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.

Applied to files:

tests/unittest/_torch/thop/serial/test_moe.py
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_pytorch.py

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_pytorch.py

📚 Learning: 2025-09-17T02:48:52.732Z

Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7781
File: tests/integration/test_lists/waives.txt:313-313
Timestamp: 2025-09-17T02:48:52.732Z
Learning: In TensorRT-LLM, `tests/integration/test_lists/waives.txt` is specifically for waiving/skipping tests, while other test list files like those in `test-db/` and `qa/` directories are for different test execution contexts (pre-merge, post-merge, QA tests). The same test appearing in both waives.txt and execution list files is intentional - the test is part of test suites but will be skipped due to the waiver.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_pytorch.py

📚 Learning: 2025-08-26T09:49:04.956Z

Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").

Applied to files:

tests/integration/defs/accuracy/test_llm_api_pytorch.py

🧬 Code graph analysis (3)

tensorrt_llm/_torch/models/modeling_gpt_oss.py (4)

tensorrt_llm/models/modeling_utils.py (1)

quant_algo (550-551)

tensorrt_llm/_torch/modules/fused_moe/interface.py (1)

has_nvfp4 (589-592)

tensorrt_llm/_torch/modules/linear.py (1)

has_nvfp4 (2008-2011)

tensorrt_llm/_torch/models/modeling_hunyuan_moe.py (1)

filter_weights (347-353)

tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (3)

tensorrt_llm/_torch/modules/fused_moe/interface.py (5)

has_nvfp4 (589-592)

has_w4a16_mxfp4 (613-616)

has_w4a8_mxfp4_fp8 (601-604)

has_w4a8_mxfp4_mxfp8 (607-610)

has_w4a8_nvfp4_fp8 (595-598)

tensorrt_llm/_torch/utils.py (1)

shape (131-132)

tensorrt_llm/functional.py (1)

pad (1327-1398)

tests/integration/defs/accuracy/test_llm_api_pytorch.py (7)

tensorrt_llm/llmapi/llm_args.py (1)

ep_size (318-322)

tensorrt_llm/_torch/models/modeling_llama.py (1)

dtype (1098-1099)

tensorrt_llm/_utils.py (2)

dtype (985-986)

dtype (993-1003)

tensorrt_llm/llmapi/llm.py (1)

LLM (1104-1120)

tensorrt_llm/_torch/auto_deploy/llm.py (1)

LLM (103-152)

tensorrt_llm/_torch/attention_backend/trtllm.py (2)

max_seq_len (636-646)

max_seq_len (649-653)

tensorrt_llm/evaluate/lm_eval.py (2)

GSM8K (455-510)

evaluate (395-430)

🪛 Ruff (0.14.5)

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py

503-503: Unused function argument: gemm1_bias

(ARG001)

504-504: Unused function argument: gemm1_alpha

(ARG001)

505-505: Unused function argument: gemm1_beta

(ARG001)

506-506: Unused function argument: gemm1_clamp_limit

(ARG001)

507-507: Unused function argument: gemm2_weights

(ARG001)

508-508: Unused function argument: gemm2_weights_scale

(ARG001)

509-509: Unused function argument: gemm2_bias

(ARG001)

tests/unittest/_torch/thop/serial/test_moe.py

1271-1271: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

1272-1272: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

1273-1273: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (26)

tests/integration/defs/accuracy/references/gsm8k.yaml (1)

264-270: Reference entry for NVFP4 is properly structured and consistent with the file.

The new GPT-OSS/20B-NVFP4 entry follows the established pattern in the file, with appropriate configurations for baseline, NVFP4 quantization, and combined NVFP4 + FP8 KV cache quantization. YAML syntax is correct.

Verify that the accuracy reference values (85.0 for all configurations) match the actual integration test results for the NVFP4 implementation. These reference values will be used by the test framework to validate the new NVFP4 quantization pathway.

tests/unittest/_torch/thop/serial/test_moe.py (1)

550-566: FP4 reference now correctly sees bias and SwiGLU parameters

Forwarding gemm1_bias, gemm1_alpha, gemm1_beta, gemm1_clamp_limit, and gemm2_bias into moe_args_dequant ensures the FP4 reference path exercises the same epilogue (bias + SwiGLU gating) as the kernel. The wiring and argument ordering look consistent with the class definition and run_moe_dequant usage.

tensorrt_llm/_torch/modules/fused_moe/quantization.py (3)

220-225: Bias shape defaults correctly follow (possibly padded) weight shapes.

Using w3_w1_weight_shape[:2] and w2_weight_shape[:2] when explicit bias shapes are not provided is the right choice now that the weight dimensions can be padded; it keeps biases compatible with the actual allocated tensors.

2004-2024: NVFP4 TRTLLM‑Gen weight/bias padding before sharding is aligned with scaling‑factor logic.

Padding w1/w3 (weights and biases) and w2 (weights and biases) with maybe_pad_for_weights using _get_weight_alignment before load_weight_shard is consistent with the MXFP4 path and avoids fractional scaling vectors per shard. The 1D‑bias branches (casting to float32 and dividing w2 bias by tp_size) also match the intended row‑parallel semantics.

Also applies to: 2068-2084

2543-2556: Consolidated MXFP4 padding via maybe_pad_for_weights is clean and behavior‑preserving.

Refactoring the MXFP4 Cutlass/TRTLLM‑Gen load paths to use maybe_pad_for_weights with _get_weight_alignment for both weights and scales removes duplicated padding logic and keeps the MXFP4 alignment rules in sync with the NVFP4 implementation. The chosen column/row alignments (accounting for packed MXFP4 elements and TP sharding) look consistent with the comments and previous behavior.

Also applies to: 2584-2590, 2610-2615, 2653-2655, 2835-2837, 2889-2896, 2901-2902, 2948-2950, 2955-2957, 2996-3003, 3067-3069

cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (2)

32-45: New GEMM1/GEMM2 optional parameters are validated and wired correctly.

The extended runner signature and arg wiring for gemm1_bias, gemm1_alpha, gemm1_beta, gemm1_clamp_limit, and gemm2_bias look consistent:

Dtype/shape checks match kernel expectations: [local_num_experts, 2 * intermediate_size] for gemm1_bias, [local_num_experts] for gemm1_* vectors, and [local_num_experts, hidden_size] for gemm2_bias.

Pointers are stored as float* and defaulted to nullptr when the optionals are empty, matching the existing pattern for other args.

No functional issues spotted here.

Also applies to: 162-171, 321-351, 362-369

492-505: Python–C++ runner interop for new FP4 parameters is consistent.

The updated FP4BlockScaleMoeRunner::run and FP8FP4BlockScaleMoeRunner::run signatures, and their calls into run_fp4_block_scale_moe_runner, correctly pass the new optional GEMM arguments (or std::nullopt where not applicable) in a way that matches the Python bindings.

Also applies to: 525-530, 611-616

tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)

151-167: End‑to‑end wiring of new FP4 GEMM bias/alpha/beta/clamp_limit params is consistent.

The FP4 path now cleanly threads gemm1_bias, gemm1_alpha, gemm1_beta, gemm1_clamp_limit, and gemm2_bias:

Dataclass FP4BlockScaleMoEInputs matches the positional order used in FP4BlockScaleMoERunner.forward.

Both input_tensors_for_tuner and final input_tensors lists keep the same ordering, so autotuning and execution see identical argument layouts.

The updated Python custom op signature aligns with the C++ runner’s run method.

This looks correct and keeps the Python/C++ interfaces in sync.

Also applies to: 235-247, 355-384, 415-434, 445-464
tensorrt_llm/_torch/models/modeling_gpt_oss.py (5)
612-619: Exclusion modules list looks correct for NVFP4.

The modules excluded from quantization (QKV, attention output, MLP gate, embedding, unembedding) align with typical NVFP4 configurations where attention and embedding layers remain in higher precision.

640-645: NVFP4 loading path correctly prioritized over HF loading.

The branching logic correctly routes to load_nvfp4_weights when is_nvfp4 is true and the model is not an original model format.

813-817: FP32 bias conversion is appropriate for NVFP4 MoE.

The explicit conversion of biases to FP32 aligns with the comment and the NVFP4 MoE requirements.

850-858: Global scale parameters correctly propagated for NVFP4.

The module-level globals for NVFP4 loaders (weight_scale_2, input_scale) are correctly copied from module_weights to moe_weights.

798-800: Assertion may fail unexpectedly if quant_config is None.

The assertion at line 799-800 accesses module.quant_config.quant_mode.has_nvfp4() without first checking if module.quant_config is not None. If the module doesn't have a quant_config, this will raise an AttributeError rather than the intended AssertionError.

Consider guarding the assertion:
             if isinstance(module, MoE):
-                assert getattr(module, "quant_config", None) is not None and \
-                   module.quant_config.quant_mode.has_nvfp4()
+                assert getattr(module, "quant_config", None) is not None, \
+                    "MoE module must have quant_config for NVFP4 loading"
+                assert module.quant_config.quant_mode.has_nvfp4(), \
+                    "MoE module quant_config must have NVFP4 mode"
Likely an incorrect or invalid review comment.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (5)

210-211: Bias/swiglu assertion correctly extended for NVFP4.

The assertion now properly includes self.has_nvfp4 alongside other MXFP4 variants that support bias/swiglu parameters.

245-257: Zero bias initialization for FP4 variants without explicit bias is appropriate.

When bias is not provided but the quantization method requires bias parameters, creating zero-filled FP32 bias tensors ensures the kernel interface is satisfied without affecting computation.

288-295: Padding calculation and application for NVFP4 quantization looks correct.

The padding ensures input dimensions align with the expected weight dimensions before quantization. The formula self.w3_w1_weight.shape[-1] * 2 - x.shape[-1] correctly calculates the required padding based on the packed weight shape.

566-590: New parameters correctly passed to fp4_block_scale_moe_runner.

The addition of w3_w1_bias, swiglu_alpha/beta/limit, w2_bias, and intermediate_size_per_partition_padded aligns with the extended NVFP4 kernel interface.

606-609: Output clipping ensures consistent dimensionality.

The clipping of final_hidden_states to self.hidden_size correctly handles cases where padding was applied during computation, ensuring the output matches the expected model hidden size.

tests/unittest/_torch/modules/test_fused_moe.py (8)

14-14: Import added appropriately for F.silu usage.

The torch.nn.functional as F import supports the new activation logic in RefGatedMLPFusedMoE.

1348-1358: Function signature refactored with sensible defaults.

The new parameters for run_fused_moe_nvfp4 provide flexibility for testing different configurations while maintaining backward compatibility through default values.

1407-1419: Bias initialization for GPT-OSS style correctly added.

The FP32 bias tensors are created with appropriate shapes when gptoss_style=True, matching the NVFP4 MoE requirements.

1425-1438: Weight quantization updated to use unswizzled scales.

The change from fp4_quantize with swizzling to explicitly unswizzled scales (using False for is_scale_factor_swizzled) aligns with the NVFP4 loader expectations.

1523-1539: Tolerance values appropriately relaxed for GPT-OSS style.

The 95% match threshold with 0.1 rtol/atol is reasonable for NVFP4 quantized models with custom activation functions, given the accumulation of quantization errors.

1574-1593: Test parameterization covers key GPT-OSS configurations.

The test covers:

Two hidden/intermediate sizes (512, 2880)

Two swiglu_alpha values (1, 0.1)

Two swiglu_beta values (0, 1)

Two swiglu_limit values (inf, 1)

This provides good coverage of the activation function behavior.

2728-2730: Conditional activation selection is clean and readable.

Using the ternary to select between custom_swiglu and F.silu based on swiglu_alpha presence is clear and maintainable.

2707-2718: I'll help you verify whether the custom_swiglu implementation matches the production kernels. Let me search for the production implementation to compare the clamping order.
<function_calls>

#!/bin/bash

Search for swiglu implementations in C++ kernels

Looking for swiglu and clamping patterns in production code

find . -type f ( -name ".cpp" -o -name ".h" -o -name ".cu" ) -path "/kernels/*" | head -20

</function_calls>

<function_calls>

#!/bin/bash

Search for swiglu references in the codebase

rg -n "swiglu" --type cpp --type c -i -C 5 2>/dev/null | head -100

</function_calls>

tensorrt_llm/_torch/models/modeling_gpt_oss.py

dongfengy · 2026-01-02T05:57:59Z

/bot run

tensorrt-cicd · 2026-01-02T06:03:37Z

PR_Github #30374 [ run ] triggered by Bot. Commit: 7de4bdd

tensorrt-cicd · 2026-01-02T07:49:04Z

PR_Github #30374 [ run ] completed with state SUCCESS. Commit: 7de4bdd
/LLM/main/L0_MergeRequest_PR pipeline #23404 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

dongfengy · 2026-01-02T08:59:00Z

/bot run --detailed-log --stage-list "B300-PyTorch-1"

tensorrt-cicd · 2026-01-02T09:05:23Z

PR_Github #30379 [ run ] triggered by Bot. Commit: f224a65

dongfengy · 2026-01-02T09:16:04Z

/bot run --detailed-log --stage-list "B300-PyTorch-1"

tensorrt-cicd · 2026-01-02T09:22:12Z

PR_Github #30380 [ run ] triggered by Bot. Commit: a843dc6

tensorrt-cicd · 2026-01-02T10:43:29Z

PR_Github #30380 [ run ] completed with state SUCCESS. Commit: a843dc6
/LLM/main/L0_MergeRequest_PR pipeline #23409 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

dongfengy · 2026-01-02T10:58:22Z

/bot run --detailed-log --stage-list "B300-PyTorch-1"

tensorrt-cicd · 2026-01-02T11:04:14Z

PR_Github #30381 [ run ] triggered by Bot. Commit: 4c89ee2

tensorrt-cicd · 2026-01-02T12:00:49Z

PR_Github #30381 [ run ] completed with state SUCCESS. Commit: 4c89ee2
/LLM/main/L0_MergeRequest_PR pipeline #23410 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

dongfengy · 2026-01-02T12:11:37Z

/bot run --detailed-log --stage-list "B300-PyTorch-1"

tensorrt-cicd · 2026-01-02T12:17:51Z

PR_Github #30384 [ run ] triggered by Bot. Commit: 08d1762

tensorrt-cicd · 2026-01-02T13:03:44Z

PR_Github #30384 [ run ] completed with state SUCCESS. Commit: 08d1762
/LLM/main/L0_MergeRequest_PR pipeline #23413 (Partly Tested) completed with status: 'SUCCESS'

dongfengy · 2026-01-02T15:42:27Z

/bot run

tensorrt-cicd · 2026-01-02T15:48:59Z

PR_Github #30389 [ run ] triggered by Bot. Commit: 1045e6f

tensorrt-cicd · 2026-01-02T19:43:56Z

PR_Github #30389 [ run ] completed with state SUCCESS. Commit: 1045e6f
/LLM/main/L0_MergeRequest_PR pipeline #23418 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Signed-off-by: Dongfeng Yu <[email protected]>

dongfengy · 2026-01-02T22:29:34Z

/bot run

tensorrt-cicd · 2026-01-02T22:35:48Z

PR_Github #30417 [ run ] triggered by Bot. Commit: 4082577

dongfengy changed the title ~~[None][Feature] Make torch.ops.trtllm.fp4_block_scale_moe_runner accept swiglu alpha and beta~~ [None][Feature] Make torch.ops.trtllm.fp4_block_scale_moe_runner accept swiglu alpha, beta, limit and bias Nov 6, 2025

dongfengy force-pushed the user/dongfengy/nvfp4-work branch from 51be8cf to 47089d9 Compare November 10, 2025 22:41

dongfengy changed the title ~~[None][Feature] Make torch.ops.trtllm.fp4_block_scale_moe_runner accept swiglu alpha, beta, limit and bias~~ [None][Feature] Support nvfp4 for gptoss Nov 13, 2025

dongfengy force-pushed the user/dongfengy/nvfp4-work branch 2 times, most recently from f185054 to 7cc74c4 Compare November 14, 2025 21:31

dongfengy changed the title ~~[None][Feature] Support nvfp4 for gptoss~~ [None][feat] Support nvfp4 for gptoss Nov 14, 2025

dongfengy self-assigned this Nov 17, 2025

dongfengy force-pushed the user/dongfengy/nvfp4-work branch 6 times, most recently from 7459ec9 to 796d2ed Compare November 21, 2025 22:44

dongfengy force-pushed the user/dongfengy/nvfp4-work branch from 3de4d5d to 71c0558 Compare November 25, 2025 20:34

dongfengy marked this pull request as ready for review November 26, 2025 00:31

dongfengy requested review from a team as code owners November 26, 2025 00:31

dongfengy requested review from hlu1, liji-nv, longlee0622, mikeiovine, nekorobov and yechank-nvidia November 26, 2025 00:31

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

tests/integration/defs/accuracy/test_llm_api_pytorch.py Show resolved Hide resolved

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

tensorrt_llm/_torch/models/modeling_gpt_oss.py Show resolved Hide resolved

dongfengy force-pushed the user/dongfengy/nvfp4-work branch from c4e53d3 to 7de4bdd Compare January 2, 2026 05:54

dongfengy disabled auto-merge January 2, 2026 08:10

dongfengy requested review from a team as code owners January 2, 2026 08:58

dongfengy requested review from kxdc and mlefeb01 January 2, 2026 08:58

dongfengy force-pushed the user/dongfengy/nvfp4-work branch from 08d1762 to 1045e6f Compare January 2, 2026 15:41

dongfengy enabled auto-merge (squash) January 2, 2026 15:48

dongfengy added 2 commits January 3, 2026 06:29

Squash all commits

5220a38

Signed-off-by: Dongfeng Yu <[email protected]>

reduce test memory usage

4082577

Signed-off-by: Dongfeng Yu <[email protected]>

dongfengy force-pushed the user/dongfengy/nvfp4-work branch from 1045e6f to 4082577 Compare January 2, 2026 22:29

[None][feat] Support nvfp4 for gptoss #8956

Are you sure you want to change the base?

[None][feat] Support nvfp4 for gptoss #8956

Conversation

dongfengy commented Nov 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

dongfengy commented Nov 26, 2025

Uh oh!

coderabbitai bot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

First, check the file size and then read the relevant section

Read the code around lines 634-647

Search for model_config definition and quant_config

Look for quant_config initialization and default values

Uh oh!

Uh oh!

tensorrt-cicd commented Nov 26, 2025

Uh oh!

coderabbitai bot commented Nov 26, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Search for swiglu implementations in C++ kernels

Looking for swiglu and clamping patterns in production code

Search for swiglu references in the codebase

Uh oh!

Uh oh!

dongfengy commented Jan 2, 2026

Uh oh!

tensorrt-cicd commented Jan 2, 2026

Uh oh!

tensorrt-cicd commented Jan 2, 2026

Uh oh!

dongfengy commented Jan 2, 2026

Uh oh!

tensorrt-cicd commented Jan 2, 2026

Uh oh!

dongfengy commented Jan 2, 2026

Uh oh!

tensorrt-cicd commented Jan 2, 2026

Uh oh!

tensorrt-cicd commented Jan 2, 2026

Uh oh!

dongfengy commented Jan 2, 2026

Uh oh!

tensorrt-cicd commented Jan 2, 2026

Uh oh!

tensorrt-cicd commented Jan 2, 2026

Uh oh!

dongfengy commented Jan 2, 2026

Uh oh!

tensorrt-cicd commented Jan 2, 2026

Uh oh!

tensorrt-cicd commented Jan 2, 2026

Uh oh!

dongfengy commented Jan 2, 2026

dongfengy commented Nov 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 26, 2025 •

edited

Loading