-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][feat] Support nvfp4 for gptoss #8956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
51be8cf to
47089d9
Compare
f185054 to
7cc74c4
Compare
7459ec9 to
796d2ed
Compare
3de4d5d to
71c0558
Compare
|
/bot run |
📝 WalkthroughWalkthroughThis PR introduces NVFP4 quantization support for GPT-OSS models by extending the FP4 block scale MOE kernel interfaces with optional bias and activation parameters, refactoring weight alignment and padding logic across quantization utilities, and adding NVFP4-specific weight loading paths with per-expert scale handling. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)
319-338: Update top‑k constraint indices to match new FP4 inputs orderAfter adding the extra gemm1/gemm2 parameters, the positions of
topk_weightsandtopk_idsinFP4BlockScaleMoEInputs(and ininput_tensors_for_tuner) shifted to the end of the list:
- Index 16:
topk_weights- Index 17:
topk_ids
get_constraint_specsis still usingTOPK_WEIGHTS_IDX = 11andTOPK_IDS_IDX = 12, which now refer togemm2_weights_scaleandgemm2_bias. This mis‑constrains those tensors based on the token count and leaves the actual top‑k tensors unconstrained during tuning.Recommend updating the indices:
- ROUTER_LOGITS_IDX = 0 - CONSTRAINED_RL_DIM = 0 - TOPK_WEIGHTS_IDX = 11 - TOPK_IDS_IDX = 12 + ROUTER_LOGITS_IDX = 0 + CONSTRAINED_RL_DIM = 0 + TOPK_WEIGHTS_IDX = 16 + TOPK_IDS_IDX = 17This restores the intended coupling between
num_tokensand the top‑k tensors in the autotuner.
🧹 Nitpick comments (13)
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)
464-490: Avoid mutatingmSupportedTileNinsidegetValidConfigsThe WAR that
pop_back()s tileN 256 whenhiddenSize == 3072permanently removes 256 frommSupportedTileNfor the lifetime of the runner instance. That means:
- Subsequent
getValidConfigscalls for other hidden sizes will never consider tile 256, even if it would be valid.mRunnersstill contains a runner for tile 256 that will no longer be used.Consider keeping
mSupportedTileNimmutable and instead:
- Build a local
std::vector<int32_t> supportedTileN = mSupportedTileN;and conditionally remove 256 from that local copy, or- Gate the WAR at the call site of
computeSelectedTileNwithout mutating the member.This keeps the WAR scoped to the problematic configuration without silently degrading tactics elsewhere.
Also applies to: 536-537
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
205-212: NVFP4 integration looks coherent; consider minor padding robustnessOverall the NVFP4 wiring in
TRTLLMGenFusedMoEis consistent:
_check_configscorrectly treats nvfp4 / w4a8_nvfp4_fp8 as supported quant modes and restricts bias/swiglu usage to nvfp4 and mxfp4 variants only.create_weightsauto-creates zerow3_w1_bias/w2_biasfor nvfp4 and mxfp4 families whenbias == False, ensuring the new C++ optional pointers are always valid without changing logical bias semantics._quantize_for_post_quant_comm’s nvfp4 branch correctly:
- Short‑circuits when
xis already anFp4QuantizedTensor(no re-quantization, just wiring tensor + scaling factor).- Pads unquantized
xup tow3_w1_weight.shape[-1] * 2only whenpad_size > 0before callingfp4_quantize, avoiding negative‑pad issues.- The nvfp4
forward_implbranch mirrors the existing w4a16/mxfp4 pattern:
- Handles the no‑post‑quant‑comm path by padding + quantizing, and uses
intermediate_size_per_partition_padded = self.w3_w1_weight.shape[-2] // 2.- Passes the new bias and swiglu parameters through to
fp4_block_scale_moe_runner.- Defensively slices
final_hidden_statestohidden_sizeif the kernel returns a padded last dimension.One small robustness suggestion: in the nvfp4
forward_implpath fornot post_quant_commyou unconditionally callpad_size = self.w3_w1_weight.shape[-1] * 2 - x.shape[-1] x = torch.nn.functional.pad(x, (0, pad_size))whereas
_quantize_for_post_quant_commguards this withif pad_size > 0. If any caller ever passes anxwhose last dimension exceeds the padded hidden size, this will raise fromF.padinstead of failing a clear assertion. For symmetry and clearer diagnostics, consider:pad_size = self.w3_w1_weight.shape[-1] * 2 - x.shape[-1] if pad_size < 0: raise ValueError(f"Hidden dim ({x.shape[-1]}) exceeds NVFP4 padded size ({self.w3_w1_weight.shape[-1] * 2})") elif pad_size > 0: x = torch.nn.functional.pad(x, (0, pad_size))This keeps behavior aligned with
_quantize_for_post_quant_commwhile making any unexpected shape mismatch explicit.Also applies to: 245-258, 281-313, 549-610
tensorrt_llm/_torch/models/modeling_gpt_oss.py (1)
610-619: Guardquant_configand clarify NVFP4exclude_modulesoverride semanticsHere you fully overwrite
quant_config.exclude_moduleswhenquant_algo == "NVFP4", and then remap names viaparams_map_reverse. That seems intentional, but it also assumesmodel_config.quant_configis always non-None.To make this code more robust and explicit, consider:
- Guarding against
quant_config is Nonebefore dereferencing it.- Documenting (or asserting) that for NVFP4 we intentionally ignore any user-provided
exclude_modulesinstead of merging with them.For example:
- quant_config = self.model_config.quant_config - if quant_config.exclude_modules: + quant_config = self.model_config.quant_config + if quant_config is not None and quant_config.exclude_modules: if quant_config.quant_algo == "NVFP4": quant_config.exclude_modules = [ "block.*.attn.qkv", "block.*.attn.out", "block.*.mlp.gate", "embedding", "unembedding", ]tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)
496-525: Unused fake‑kernel parameters are acceptable; consider underscore‑prefix to appease RuffThe
@fp4_block_scale_moe_runner.register_fakestub must mirror the real op’s signature, so the newgemm1_*andgemm2_*parameters are necessarily unused in the fake implementation. That’s fine functionally.If you want to silence Ruff’s
ARG001warnings without changing behaviour, you could underscore‑prefix them, e.g.:def _(routing_logits, - routing_bias, - hidden_states, - hidden_states_scale, - gemm1_weights, - gemm1_weights_scale, - gemm1_bias, - gemm1_alpha, - gemm1_beta, - gemm1_clamp_limit, - gemm2_weights, - gemm2_weights_scale, - gemm2_bias, + routing_bias, + hidden_states, + hidden_states_scale, + gemm1_weights, + gemm1_weights_scale, + _gemm1_bias, + _gemm1_alpha, + _gemm1_beta, + _gemm1_clamp_limit, + _gemm2_weights, + _gemm2_weights_scale, + _gemm2_bias, output1_scale_scalar, ...Purely cosmetic; no functional change.
tests/unittest/_torch/thop/serial/test_moe.py (1)
1163-1199: NVFP4 GPT‑OSS style test andrun_moe_fp4_testgptoss path look sound; tighten types and invariantsThe new GPT‑OSS style NVFP4 coverage and gptoss wiring are well done:
test_gptoss_style_nvfp4parametrizes overswiglu_alpha,swiglu_beta, andswiglu_limitand drivesrun_moe_fp4_test(..., gptoss_style=True, ...), giving decent coverage of the new activation options.- In
run_moe_fp4_test, whengptoss_styleis true you:
- Create per‑expert
gemm1_bias/gemm2_biasand per‑expertswiglu_*_tensor.- Propagate them into
moe_argsand the fp4 reference path.- Reorder/shuffle biases alongside weights/scales.
- Rescale
swiglu_beta,swiglu_limit, and both biases with the appropriate global scales before callingfp4_block_scale_moe_runner, mirroring the existing mxe2m1 fp8 path.- Loosen tolerances only for the gptoss_style branch, leaving existing tests unchanged.
Two minor improvements:
Type hints for swiglu parameters (RUF013)
The signature currently has plainfloatdefaults set toNone, which Ruff flags. You can clarify intent and satisfy lint by using explicit optionals:
- def run_moe_fp4_test(...,
gptoss_style: bool = False,swiglu_alpha: float = None,swiglu_beta: float = None,swiglu_limit: float = None) -> None:
- def run_moe_fp4_test(...,
gptoss_style: bool = False,swiglu_alpha: float | None = None,swiglu_beta: float | None = None,swiglu_limit: float | None = None) -> None:
Assert swiglu_ are provided when
gptoss_styleis enabled*
Today onlytest_gptoss_style_nvfp4callsgptoss_style=True, so the tensors are always set, but future callers could forget to pass them. A quick guard would make the contract explicit:if gptoss_style: assert swiglu_alpha is not None and swiglu_beta is not None and swiglu_limit is not None ...Functionally, the new path and tolerances look correct.
Also applies to: 1263-1273, 1341-1373, 1530-1541, 1547-1557, 1589-1596, 1615-1627
tests/unittest/_torch/modules/test_fused_moe.py (2)
1348-1539: NVFP4 test runner is structurally sound; one subtle behavior to keep in mind
- The
run_fused_moe_nvfp4helper cleanly centralizes NVFP4 test setup: per‑expert weights, global/input scales, NVFP4 block scales, and backend selection forcreate_moe. The reuse by both generic NVFP4 tests and GPT‑OSS style tests looks correct.- For GPT‑OSS style, per‑expert
w1/w2/w3.biasare created astorch.float32and passed withbias=True; TRTLLM‑Gen side usesbias_dtype=torch.float32, so this matches expectations.- SwiGLU config is passed as:
- Tensors (
swiglu_alpha_tensor, etc., shape(NUM_EXPERTS,)) intocreate_moefor the fused path.- Scalar floats (
swiglu_alpha, etc.) intoRefGatedMLPFusedMoEfor the reference path.
This separation is reasonable as long as the model side treats module‑level tensors vs ref‑module scalars consistently.- Tolerances split by style (
check_accuracywith relaxed(rtol=0.1, atol=0.1, percent=0.95)for GPT‑OSS vs strictassert_closefor the non‑GPT‑OSS path) are a sensible compromise given different numerics.No obvious correctness issues stand out in the setup; just ensure the backend weight loaders (CUTLASS/CUTEDSL/TRTLLM) all expect the same NVFP4 weight/scale key layout used here (
*.weight,*.weight_scale,*.input_scale,*.weight_scale_2).If you want an extra safety net, you could add a lightweight shape assertion block (e.g., assert that every
w*_sf_block_unswizzledhas.shape == (INTERMEDIATE_SIZE, HIDDEN_SIZE // SCALING_VECTOR_SIZE)/ the analogous H dimension) right after quantization to catch accidental API changes infp4_quantizeearly.
1565-1593: New NVFP4 tests cover both generic and GPT‑OSS paths appropriately
test_fused_moe_nvfp4parametrizes{dtype ∈ {fp16,bf16}, backend ∈ {TRTLLM,CUTLASS,CUTEDSL}}and routes everything throughrun_fused_moe_nvfp4, which keeps logic DRY and aligned.- The GPT‑OSS‑style test drives a realistic large configuration (
hidden_size/intermediate_size ∈ {512,2880},num_experts=32,top_k=4,seq_len=8192) and sweepsswiglu_alpha/beta/limit; this should give good coverage of the new activation and scaling plumbing.Given how heavy these configs are, you might want to keep an eye on runtime in CI, but there is no correctness issue visible in the test wiring.
tensorrt_llm/_torch/modules/fused_moe/quantization.py (6)
143-152:maybe_pad_for_weightshelper is correct for 1D/2D use; consider documenting rank expectations
- The padding logic is straightforward: last dim padded to
col_alignment, second‑last to optionalrow_alignment. This matches all current uses (2D weights/scales with both alignments; 1D biases/scales with column‑only padding).- For 1D tensors, the branch without
row_alignmentis used, soF.pad(weight, (0, col_pad_size))is safe.To avoid accidental misuse later, it might be worth a short docstring stating this is intended for rank‑1 or rank‑2 tensors and that
row_alignmentmust only be provided for rank‑2 inputs.
1568-1644: NVFP4FusedMoEMethod: padded shapes and scale tensors align with vectorized layouts
- Local
round_upplusweight_alignment/input_hidden_alignmentyields:
intermediate_size_per_partition_padded,w3_w1_hidden_size_padded,w2_hidden_size_padded,
which are then consistently used for both weight and scale tensor shapes.- Weight shapes divide the padded hidden dim by
weight_vec_size, and block‑scale shapes divide hidden/intermediate byscaling_vector_size * block_scales_vec_size, matching the “pack 16 FP4s into int64 / 4 FP8s into int32” comments.- Passing
bias_dtypedown toFusedMoEMethodBase.create_weightsallows backends to override bias precision without touching this core logic.No issues spotted; this is a clean centralization of NVFP4 padding behavior.
If
round_upis now duplicated across several classes (NVFP4 and MXFP4), you could later factor it into a small module‑level helper to avoid repetition.
2068-2084: NVFP4 TRTLLM‑Gen w2 path: bias normalization bytp_sizeis correct but implicit
- For 2D weights, you pad to
alignment // 2in the packed dimension andself.weight_alignmentin the other before sharding; for 1D biases, you:
- pad to
self.weight_alignment, and- divide by
module.tp_sizeso that per‑rank biases sum correctly after TP accumulation.This matches the usual TP bias convention. It might be worth a brief comment near the division noting that the kernel performs “sum across TP ranks after applying per‑rank bias”, to avoid future confusion if someone changes the TP strategy.
2361-2421: MXFP4WeightFusedMoEMethod: alignment‑aware weight/bias shapes look correct
- The new
create_weightsmirrors the NVFP4 logic:
- pads intermediate and hidden dimensions to
weight_alignment/input_hidden_alignment,- shapes packed weights by dividing padded dims by
weight_vec_size,- shapes block scales according to
scaling_vector_size * block_scales_vec_size.- Explicit
w3_w1_bias_shapeandw2_bias_shapeuse the padded dims, ensuring biases align with the packed layouts.This sets a clean, unified contract for all MXFP4 backends.
You might eventually want to share the
round_uphelper between NVFP4 and MXFP4 base classes to avoid subtle divergence.
2835-2853: MXFP4WeightTRTLLMGenFusedMoEMethod.post_load_weights: lean shape computation uses new helper correctly
- The proxy
w1_weightof unpadded shape is padded withmaybe_pad_for_weightsusinginput_hidden_alignmentand_get_weight_alignment(...), then passed throughload_weight_shard(..., return_slice_indices=True)to derive shard slice bounds.intermediate_size_per_partition_leanis computed by clamping the shard slices to the original (unpadded) dims, giving a per‑shard “lean” intermediate size that kernels can use to avoid bandwidth waste on trailing padding.This is a neat way to reuse the same alignment logic for both real and proxy weights without duplicating slice math.
Consider documenting
intermediate_size_per_partition_leanin the module/class docstring, since it’s a key contract for kernels.
336-355:_get_weight_alignmentcorrectly enforces per‑shard alignment constraints
- Starts from
lcm(weight_alignment, scaling_vector_size, tp_size)to avoid fractional scaling factors.- Computes a padded dimension multiple of this alignment and derives per‑shard size.
- If the per‑shard size is not a multiple of
weight_alignment, it bumpsalignmentso that:
padded_dimremains multiple of the new alignment, andper_shardbecomes a multiple ofweight_alignment.This matches the comments and fixes the “fractional number of scale blocks per shard” issue described in several docstrings and comments.
You might consider adding a tiny assert (or debug‑only check) that the final
per_shard % weight_alignment == 0before returning to guard future changes.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp(8 hunks)tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py(6 hunks)tensorrt_llm/_torch/models/modeling_gpt_oss.py(3 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py(7 hunks)tensorrt_llm/_torch/modules/fused_moe/quantization.py(23 hunks)tests/integration/defs/accuracy/references/gsm8k.yaml(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py(1 hunks)tests/unittest/_torch/modules/test_fused_moe.py(12 hunks)tests/unittest/_torch/thop/serial/test_moe.py(12 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., usefrom package.subpackage import fooand thenfoo.SomeClass()instead offrom package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g.,some_file.py)
Python class names should use PascalCase (e.g.,class SomeClass)
Python function and method names should use snake_case (e.g.,def my_awesome_function():)
Python local variable names should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile = ...)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g.,MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g.,self.x = 5followed by"""<type>: Description of 'x'""")
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pytensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pytensorrt_llm/_torch/models/modeling_gpt_oss.pytests/unittest/_torch/modules/test_fused_moe.pytensorrt_llm/_torch/modules/fused_moe/quantization.pytests/unittest/_torch/thop/serial/test_moe.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pytensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pycpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpptensorrt_llm/_torch/models/modeling_gpt_oss.pytests/unittest/_torch/modules/test_fused_moe.pytensorrt_llm/_torch/modules/fused_moe/quantization.pytests/unittest/_torch/thop/serial/test_moe.py
**/*.{cpp,h,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g.,} // namespace foo)
Preferconstorconstexprvariables over#definewhenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared asconst
Except0(only used in comparison for checking signness/existence/emptiness) andnullptr,true,false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an emptyfororwhileloop in a new line
The statement forming the body of aswitch,while,do .. whileorforstatement shall be a compound statement (use brace-delimited statements)
Ifandelseshould always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g.,thisIsASubDirandthisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g.,FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g.,localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g.,gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g.,sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g.,static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g.,mNbFooValues), though the 'm' pre...
Files:
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
🧠 Learnings (18)
📓 Common learnings
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").
Applied to files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pytensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pycpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpptests/unittest/_torch/modules/test_fused_moe.pytests/unittest/_torch/thop/serial/test_moe.py
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pycpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pytensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pycpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pytensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pycpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pycpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pytests/unittest/_torch/modules/test_fused_moe.pytensorrt_llm/_torch/modules/fused_moe/quantization.py
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pytensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pycpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
📚 Learning: 2025-10-20T17:07:18.745Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pytensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pytests/unittest/_torch/thop/serial/test_moe.py
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.
Applied to files:
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
Repo: NVIDIA/TensorRT-LLM PR: 7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.
Applied to files:
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpptests/unittest/_torch/thop/serial/test_moe.py
📚 Learning: 2025-09-03T13:16:06.824Z
Learnt from: nvpohanh
Repo: NVIDIA/TensorRT-LLM PR: 7478
File: tensorrt_llm/_torch/models/modeling_llama.py:1315-1315
Timestamp: 2025-09-03T13:16:06.824Z
Learning: The Llama4VisionEncoder.load_weights method signature is `def load_weights(self, weights: Dict)` and should not be confused with Llama4ForConditionalGeneration.load_weights which has a different signature including weight_mapper parameter.
Applied to files:
tensorrt_llm/_torch/models/modeling_gpt_oss.py
🧬 Code graph analysis (6)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (4)
tensorrt_llm/_torch/modules/fused_moe/interface.py (5)
has_nvfp4(589-592)has_w4a16_mxfp4(613-616)has_w4a8_mxfp4_fp8(601-604)has_w4a8_mxfp4_mxfp8(607-610)has_w4a8_nvfp4_fp8(595-598)tensorrt_llm/_torch/modules/linear.py (3)
has_nvfp4(2008-2011)has_w4a8_mxfp4_fp8(2038-2041)has_w4a8_nvfp4_fp8(2032-2035)tensorrt_llm/_torch/utils.py (1)
shape(131-132)cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (4)
top_k(275-275)n_group(276-276)topk_group(278-278)hidden_size(271-271)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (2)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (8)
num_experts(269-269)top_k(275-275)n_group(276-276)topk_group(278-278)intermediate_size(281-281)local_expert_offset(282-282)local_num_experts(283-283)do_finalize(304-305)tensorrt_llm/_torch/modules/fused_moe/routing.py (6)
routing_method_type(180-181)routing_method_type(213-214)routing_method_type(226-227)routing_method_type(261-262)routing_method_type(281-282)routing_method_type(419-420)
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (3)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (8)
num_experts(269-269)top_k(275-275)n_group(276-276)topk_group(278-278)intermediate_size(281-281)local_expert_offset(282-282)local_num_experts(283-283)do_finalize(304-305)cpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cpp (2)
topK(496-498)topK(588-590)cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp (1)
topK(332-333)
tensorrt_llm/_torch/models/modeling_gpt_oss.py (2)
tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
has_nvfp4(589-592)tensorrt_llm/module.py (1)
named_modules(102-114)
tensorrt_llm/_torch/modules/fused_moe/quantization.py (2)
tests/unittest/_torch/modules/test_fused_moe.py (1)
round_up(75-76)tensorrt_llm/_torch/modules/fused_moe/fused_moe_triton.py (2)
create_weights(201-222)create_weights(391-432)
tests/unittest/_torch/thop/serial/test_moe.py (2)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.h (2)
intermediate_size(281-281)top_k(275-275)tensorrt_llm/functional.py (1)
stack(1961-2007)
🪛 Ruff (0.14.5)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
503-503: Unused function argument: gemm1_bias
(ARG001)
504-504: Unused function argument: gemm1_alpha
(ARG001)
505-505: Unused function argument: gemm1_beta
(ARG001)
506-506: Unused function argument: gemm1_clamp_limit
(ARG001)
507-507: Unused function argument: gemm2_weights
(ARG001)
508-508: Unused function argument: gemm2_weights_scale
(ARG001)
509-509: Unused function argument: gemm2_bias
(ARG001)
tests/unittest/_torch/thop/serial/test_moe.py
1271-1271: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
1272-1272: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
1273-1273: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (20)
tests/integration/defs/accuracy/references/gsm8k.yaml (1)
264-270: GPT‑OSS/20B‑NVFP4 reference entries look consistentThe new GSM8K accuracy entries for
GPT-OSS/20B-NVFP4follow the existing schema (baseline, quant_algo, kv_cache_quant_algo combinations) and align with other GPT‑OSS blocks.tensorrt_llm/_torch/models/modeling_gpt_oss.py (2)
785-883: NVFP4 MoE loader wiring looks consistent; please double‑check checkpoint key invariantsThe new
load_nvfp4_weightspath mirrors the HF loader structure and wires NVFP4‑specific MoE layout:
- Deinterleaving
gate_up_proj/gate_up_proj_weight_scaleviadeinterleaveand splitting per expert.- Forcing MoE biases to fp32 for NVFP4 (
gate_up_bias/down_bias).- Passing through extra NVFP4 globals like
*_weight_scale_2and*_input_scaleuntouched.Two minor points to verify (no required change if invariants hold):
- This code assumes
gate_up_proj,down_proj,gate_up_proj_bias, anddown_proj_biasare always present for NVFP4 MoE:deinterleave(gate_up)andgate_up_bias.dtypewill fail if any is missing. If some NVFP4 ckpts can omit these, guarding or using direct indexing ([...]) plus a clearer error may be safer.- The deinterleave pattern for
gate_up_proj_weight_scale(operating on the last dimension) should match the NVFP4 kernel’s expected layout; it differs from the MXFP4 path inload_hf_weights(which deinterleaves alongdim=-2), so it’s worth re‑confirming against the C++ runner layout.Functionally the plumbing into
module.load_weights(weights=[moe_weights])looks good.
634-647: I'll help you verify this review comment. Let me start by examining the actual code and understanding the context around theload_weightsmethod.
<function_calls>
explore_file
#!/bin/bashFirst, check the file size and then read the relevant section
wc -l tensorrt_llm/_torch/models/modeling_gpt_oss.py
</function_calls>
<function_calls>
read_code_section
#!/bin/bashRead the code around lines 634-647
sed -n '620,660p' tensorrt_llm/_torch/models/modeling_gpt_oss.py | cat -n
</function_calls>
<function_calls>
find_model_config
#!/bin/bashSearch for model_config definition and quant_config
rg -n "class.Config|quant_config\s=" tensorrt_llm/_torch/models/modeling_gpt_oss.py | head -30
</function_calls>
<function_calls>
search_quant_config
#!/bin/bashLook for quant_config initialization and default values
rg -n "quant_config" tensorrt_llm/_torch/models/modeling_gpt_oss.py -A 2 -B 2 | head -40
</function_calls>tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)
151-167: FP4 BlockScale MoE NVFP4 parameter plumbing is coherentThe FP4 path now consistently threads the extra NVFP4 tensors:
FP4BlockScaleMoEInputsincludesgemm1_bias/alpha/beta/clamp_limitandgemm2_bias.FP4BlockScaleMoERunner.forwardpasses these directly tokernel_runner.run_moein the expected order.fp4_block_scale_moe_runnercustom op signature and bothinput_tensors_for_tunerand finalinput_tensorslists include the new tensors, keepingtopk_weights/topk_idsas the last two entries soprepare_dummy_topk_and_hooklogic still applies.This wiring looks internally consistent with the C++ runner expectations and the autotuner flow.
Also applies to: 225-247, 355-370, 414-459
tests/unittest/_torch/thop/serial/test_moe.py (1)
550-566: Reference paths correctly extended with bias/activation parametersThe extensions of
moe_args_dequantand all therun_moe_reference_*helpers look consistent:
run_moe_reference_fp4now forwardsgemm1_bias,gemm1_alpha,gemm1_beta,gemm1_clamp_limit, andgemm2_biasintomoe_args_dequant, which then feedsrun_moe_dequant’s activation and Gemm2 bias logic.- The mxe4m3/mxe2m1, e4m3/mxe2m1, and bf16/mxe2m1 reference builders also pass these new fields through, preserving existing behaviour where they were already set on
args.This keeps the Python reference implementations in sync with the new NVFP4 kernel features while maintaining backward compatibility for modes that don’t use the extra fields.
Also applies to: 642-699
tests/unittest/_torch/modules/test_fused_moe.py (2)
14-14: Import oftorch.nn.functional as Fis appropriate and scoped correctlyThe new
RefGatedMLPFusedMoEactivation path and helper functions rely onF.silu, so this import is necessary and consistent with other usages.
2685-2823: RefGatedMLPFusedMoE: SwiGLU integration and NVFP4/W4A8‑NVFP4 weight loading look coherent
- Extending
RefGatedMLPFusedMoE.__init__withbiasandswiglu_*matches how tests construct both fused and reference modules. Thebiasflag is correctly stored and used duringload_weights.- The
custom_swigluclosure capturesswiglu_alpha/beta/limitand handles:
- optional clamping on
gateandvalue,- default
alpha=1.0,beta=0.0,- expert‑wise activation via per‑expert
GatedMLP(activation=...).
This is consistent with the GPT‑OSS–style activation behavior described in the PR summary.- The new NVFP4 /
W4A8_NVFP4_FP8branch inload_weightscorrectly:
- propagates
weight_scale,input_scale, andweight_scale_2togate_up_projanddown_proj,- keeps existing FP8/FP8‑block‑scales paths untouched.
Assuming
GatedMLP’sactivationcontract matches whatcustom_swigluexpects (which appears to be the case from the other code in this PR), this reference path should behave consistently with the fused kernels.If you haven’t already, it’s worth sanity‑checking one GPT‑OSS config by directly comparing
GatedMLP+custom_swigluagainst the original GPT‑OSS implementation (outside MoE) to confirm the per‑token outputs match numerically within the same tolerances used in these tests.tensorrt_llm/_torch/modules/fused_moe/quantization.py (13)
218-237: Bias shape handling inFusedMoEMethodBase.create_weightsis robust to padding
- Defaulting
w3_w1_bias_shapeandw2_bias_shapeto the first two dims of the corresponding weight shapes ensures bias tensors inherit any padding applied to the weight layouts.- Using
bias_dtype or module.dtypecorrectly allows backends like TRTLLM‑Gen to requestfloat32bias even when activations/weights are lower precision.This change looks sound and keeps all existing callers compatible.
1961-1984: NVFP4TRTLLMGenFusedMoEMethod: alignment parameters and bias dtype look correct
- Setting
weight_alignment = input_hidden_alignment = 256and threading both intocreate_weightsaligns with the comment about TRTLLM‑Gen alignment requirements.- Using
bias_dtype=torch.float32matches how tests constructw*_biasand ensures high‑precision accumulation for biases even with low‑precision weights.This setup seems consistent with both the kernel expectations and the new NVFP4 tests.
If kernel constraints change later (e.g., relaxed to 128‑alignment), updating these class attributes should be sufficient; worth keeping this in sync with the underlying TRT kernels’ documentation.
2125-2137: NVFP4 TRTLLM‑Gen w3/w1 block‑scale padding matches alignment logic
- Alignment is recomputed for the scale rows and
maybe_pad_for_weightsis applied with:
- columns aligned to
input_hidden_alignment / scaling_vector_size,- rows aligned to the same
alignmentused for weights.- This mirrors the “pad before sharding” rationale used for MXFP4 and avoids fractional numbers of scaling vectors per shard.
Given how sensitive these layouts are, this is the right place to centralize the padding instead of sprinkling manual
F.padcalls.Consider adding a small unit test that feeds an adversarial
intermediate_size_per_partition(e.g., 2880 with varioustp_size) and asserts that per‑shard scale rows are multiples ofweight_alignmentandscaling_vector_size.
2195-2202: NVFP4 TRTLLM‑Gen w2 block‑scale padding is consistent with w3/w1
- Column alignment is computed as
alignment // module.scaling_vector_size, withalignmentincludingtp_sizevia_get_weight_alignment, so per‑shard columns remain multiple ofweight_alignment.- Row alignment reuses
self.weight_alignment, so the sharded w2 scales line up with the padded w2 weights.This keeps the scale layout coherent with the weight layout for GEMM2.
2245-2265: Post‑quantization normalization of biases and SwiGLU params is logically consistent
- After
super().load_quant_scales, you:
- set
fc31_scale_c = fc2_input_scale * fc31_alpha, and- normalize:
w3_w1_biasbyfc31_alpha,w2_biasbyfc2_alpha,swiglu_betaandswiglu_limitbyfc31_alpha.- This matches the comment that kernels expect biases and SwiGLU parameters pre‑scaled by the same global factors that apply to activations/weights.
Shape‑wise this is safe as long as:
w*_biashave shape(expert_size_per_partition, *), andswiglu_*are 1D per‑expert tensors, which matches how tests construct them.No bug evident here; just rely on
create_moeto always defineswiglu_beta/swiglu_limitattributes (set toNonewhen unused), which yourif ... is not Nonechecks assume.
2539-2556: MXFP4WeightCutlassFusedMoEMethod: pre‑shard padding matches alignment rationale
- For
w1/w3, you pad:
- columns to
weight_alignment // 2(because two MXFP4s per byte),- rows to the
_get_weight_alignment(...)result to keep per‑shard rows multiple ofweight_alignment.- For biases (1D), you pad to the same
alignmentto keep shard shapes consistent.This implements the “pad before sharding to avoid fractional scaling factors per shard” comment accurately.
2582-2590: MXFP4WeightCutlassFusedMoEMethod: w2 padding is consistent with w1/w3
shard_w2_weight_dimuses the packed K dimension (2 * w2_weight.shape[1]for 2D weights), and_get_weight_alignmentis applied similarly.- Padding uses
alignment // 2in the packed dimension andself.weight_alignmentin the other, with biases padded only on the single dimension.This keeps both GEMM2 weights and biases aligned to the same constraints used for GEMM1.
2609-2615: MXFP4WeightCutlassFusedMoEMethod: scale padding mirrors weight padding correctly
w1_weight_scaleandw3_weight_scaleare padded with:
- columns aligned to
weight_alignment / scaling_vector_size,- rows aligned to the same
_get_weight_alignment(...)used for weights.- After sharding and concatenation, you interleave via
block_scale_interleave, which is consistent with the weight layout.The same pattern is used for w2 scales below, so scale and weight layouts remain in sync.
A small unit test that checks per‑shard scale tensors’ last dimensions are multiples of
module.scaling_vector_sizeandweight_alignmentwould make regressions here very obvious.
2890-2902: MXFP4WeightTRTLLMGenFusedMoEMethod: w1/w3 padding matches TRTLLM‑Gen alignment policy
- Uses the same
_get_weight_alignmentroutine as Cutlass, but pads columns toinput_hidden_alignment // 2(512/2 = 256) instead ofweight_alignment // 2, maximizing kernel choice as per the comment.- 1D biases are padded and converted to
float32, matching TRTLLM‑Gen’s expectation for bias dtype.This appears consistent with the stated intent for the TRTLLM‑Gen MXFP4 path.
2955-2957: MXFP4WeightTRTLLMGenFusedMoEMethod: per‑rank w2 bias scaling is explicit and correct
- For 1D w2 biases, you pad to
weight_alignment, cast tofloat32, and divide bytp_size, which matches the TP convention that each rank applies its own bias before an all‑reduce.No issue here; comment is clear enough to avoid misunderstandings about why the division is done.
2996-3004: MXFP4WeightTRTLLMGenFusedMoEMethod: w3/w1 scale padding aligns with hidden alignment
- Scales are padded on:
- columns to
input_hidden_alignment / scaling_vector_size, and- rows to
_get_weight_alignment(...),
mirroring the weight padding and ensuring integer numbers of scale vectors per shard.This symmetry between weights and scales is exactly what you want for MXFP4 TRTLLM‑Gen.
3066-3068: MXFP4WeightTRTLLMGenFusedMoEMethod: w2 scale padding mirrors w2 weight alignment
- Column alignment is
alignment / scaling_vector_size, row alignment isweight_alignment, so sharded w2 scales line up with the padded w2 weights and meet both TMA and kernel constraints.Looks consistent with the rest of the MXFP4 alignment scheme.
2004-2043: Based on my verification through web search, I can now provide the rewritten review comment:NVFP4 TRTLLM-Gen w1/w3 weight padding and sharding logic is sound; dtype asserts are appropriate
The code correctly handles NVFP4 weights as packed
torch.uint8tensors (two FP4 values per byte). Alignment computation via_get_weight_alignment()is applied correctly: 2D packed weights are padded on columns toinput_hidden_alignment // 2and rows toalignment, while 1D biases are padded toalignmentand cast tofloat32. Sharding viaload_weight_shard(..., TensorParallelMode.COLUMN, device=device)followed by write todst_w3_w1_weighthalves is consistent with the packed weight format. Thetorch.uint8dtype assertions for 2D weights are appropriate and necessary, as standard NVFP4 quantization workflows (via TensorRT-LLM's QuantMode and model-optimizer) produce uint8-packed tensors by design.
|
PR_Github #25789 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis PR adds NVFP4 quantization support to the MoE runtime, extending FP4 kernel interfaces with optional GEMM bias and SwiGlu gating parameters. It introduces weight padding and alignment utilities, implements NVFP4-specific model loading for GPT-OSS, and adds corresponding integration and unit tests to validate the new configuration pathways. Changes
Sequence Diagram(s)sequenceDiagram
participant User as Caller
participant Loader as load_nvfp4_weights
participant ExpertLoad as Expert Weight Loading
participant BiasLoad as Bias & Scale Loading
participant Dict as Weight Dict Construction
User->>Loader: weights Dict
Loader->>ExpertLoad: Per-expert gate_up_proj, down_proj
ExpertLoad->>ExpertLoad: Deinterleave gate/up weights
ExpertLoad->>BiasLoad: Extract per-expert biases
BiasLoad->>BiasLoad: Convert to FP32, apply scales
BiasLoad->>Dict: Accumulate expert weights & scales
Loader->>Dict: Handle qkv blocks & non-MoE modules
Dict-->>Loader: Per-expert weight structures
Loader-->>User: Populated model state
sequenceDiagram
participant Test as Test Runner
participant MoE as Fused MoE Module
participant Quantize as Quantization Path
participant Padding as Padding & Alignment
participant Kernel as FP4 Kernel
Test->>MoE: forward(hidden_states, router_logits)
MoE->>Quantize: Quantize weights to FP4/NVFP4
Quantize->>Padding: maybe_pad_for_weights(weight_alignment)
Padding->>Padding: Round intermediate_size to alignment
Padding->>Quantize: Padded weight tensor
Quantize->>Kernel: Pass gemm1_bias, gemm1_alpha/beta/clamp_limit, gemm2_bias
Kernel->>Kernel: Apply per-expert SwiGlu gating
Kernel-->>MoE: MoE output
MoE->>Padding: Clip final_hidden_states to hidden_size
Padding-->>Test: Output with consistent dimensions
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Key areas requiring extra attention:
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)
321-331: Constraint indices fortopk_weights/topk_idsin FP4 autotuning are out of sync with the new input layout.After inserting the new GEMM1/GEMM2 tensors,
topk_weights_for_tunerandtopk_ids_for_tunernow sit at indices 16 and 17 ininput_tensors_for_tuner, butget_constraint_specsstill uses:TOPK_WEIGHTS_IDX = 11 TOPK_IDS_IDX = 12With the current layout:
input_tensors_for_tuner = [ 0 routing_logits_for_tuner, 1 routing_bias, 2 hidden_states, 3 hidden_states_scale, 4 gemm1_weights, 5 gemm1_weights_scale, 6 gemm1_bias, 7 gemm1_alpha, 8 gemm1_beta, 9 gemm1_clamp_limit, 10 gemm2_weights, 11 gemm2_weights_scale, # currently treated as TOPK_WEIGHTS 12 gemm2_bias, # currently treated as TOPK_IDS 13 output1_scale_scalar, 14 output1_scale_gate_scalar, 15 output2_scale_scalar, 16 topk_weights_for_tuner, 17 topk_ids_for_tuner, ]the token‑count constraint specs are being applied to
gemm2_weights_scaleandgemm2_biasinstead of to thetopk_*tensors. This can distort shapes for those weight tensors during tuning and leavestopk_*unconstrained (relying only on the pre‑hook to tracknum_tokens).You probably want:
- TOPK_WEIGHTS_IDX = 11 - TOPK_IDS_IDX = 12 + TOPK_WEIGHTS_IDX = 16 + TOPK_IDS_IDX = 17to match the updated FP4 input ordering.
Also applies to: 415-434
🧹 Nitpick comments (10)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
3888-3925: GPT‑OSS NVFP4 2‑GPU test wiring is fine; confirm skip marker semantics and placeholder path before enablingThe new
test_w4_2gpus_nvfp4mirrors the existing W4 2‑GPU tests and correctly drives the GSM8K flow through the NVFP4 checkpoint once enabled. Two minor points before you eventually drop the in‑bodypytest.skip:
- With the class already gated to Blackwell SMs, adding
@pytest.mark.skip_blackwellmeans this test will be skipped on the only architecture where NVFP4 is expected to run. Please double‑check that this marker’s behavior matches your intent.- Consider switching
"./nvfp4ckpt"to allm_models_root()‑based path (like the other GPT‑OSS tests) so it’s aligned with CI model layout when the real checkpoint is hooked up.tests/unittest/_torch/thop/serial/test_moe.py (1)
1163-1199: GPT‑OSS‑style NVFP4 MoE path and SwiGLU epilogue wiring look sound; consider a couple of small robustness and lint fixesThe new GPT‑OSS‑style NVFP4 test plus the
run_moe_fp4_testextensions appear logically consistent:
test_gptoss_style_nvfp4drives the NVFP4 MoE kernel with per‑expert SwiGLUalpha/beta/limitand per‑expert biases, and the reference side gets the unscaled parameters viamoe_args/run_moe_reference_fp4before you rescale them for the quantized kernel call.- Bias and SwiGLU data are shaped, interleaved (
reorder_rows_for_gated_act_gemm), and shuffled (shuffle_matrix_a/shuffle_matrix_sf_a) in the same way as the existing mxe2m1 path later in this file, so layout handling looks consistent.- Relaxed tolerances when
gptoss_style=True(higheratol/rtol, lowerpercent) are reasonable given the added non‑linearity and quantization noise.A few minor suggestions:
run_moe_fp4_testtreatsswiglu_alpha,swiglu_beta, andswiglu_limitas required whenevergptoss_style=True, but the signature allows them to beNone. Adding a small guard, e.g. assertions that these are notNonewhengptoss_styleis enabled, would fail fast if the helper is misused.- Ruff’s RUF013 warning about implicit
Optionalon lines 1271–1273 stems from annotating these asfloatwhile defaulting toNone. If you care about keeping linters quiet, consider changing the annotations tofloat | None/Optional[float]or dropping the type annotation on these kwargs in this test helper. (As per static analysis hints.)- When
gptoss_styleisFalse, all ofgemm1_bias_shuffled,gemm2_bias_shuffled, and the SwiGLU tensors areNoneat thefp4_block_scale_moe_runnercall site. Please double‑check the custom op’s Python binding indeed declares these as optional arguments and that older call sites don’t rely on positional ordering that could be broken by the new parameters.Also applies to: 1263-1273, 1341-1373, 1428-1450, 1463-1517, 1530-1557, 1589-1595, 1614-1627
tensorrt_llm/_torch/modules/fused_moe/quantization.py (3)
143-152:maybe_pad_for_weightshelper looks correct; consider guarding assumptions.The padding helper correctly handles 1D/2D (and 3D via last two dims) and is reused consistently across NVFP4/MXFP4 paths. To make it more robust long‑term, you might:
- Assert
col_alignment > 0(androw_alignment > 0when provided) to avoid accidental modulo‑by‑zero.- Optionally assert
weight.dim() <= 3or document that only 1D–3D tensors are supported, sinceF.padwith a 4‑element pad will fail for 4D+ inputs.
1568-1598: NVFP4 basecreate_weightsalignment/padding is coherent with vectorization requirements.The new alignment parameters and padded shapes for
w3_w1_weight,w2_weight, and their block scales are internally consistent and match the later use of_get_weight_alignmentand packed FP4/fp8 layouts. Two minor nits:
- The comment “Divide by 4 because we use int32 to pack 4 fp8 values” is no longer accurate for the TRTLLM‑Gen path where
block_scales_vec_sizecan be1; consider rephrasing this to describe the generalblock_scales_vec_sizelogic instead of hard‑coding “4”.- If these padded dimensions are semantically important to downstream code, exposing them as explicit module attributes (e.g.,
module.intermediate_size_per_partition_padded) could make debugging easier, but that’s optional.Also applies to: 1603-1619, 1639-1643
2125-2137: Scale padding and post‑load normalization look correct; guardswiglu_*attributes for safety.Padding
w1/w3/w2block scales with the same alignment scheme as the weights before sharding, then applying permutation +block_scale_interleave, matches the NVFP4 kernel expectations and mirrors the MXFP4 path.The later normalization:
fc31_scale_c = fc2_input_scale * fc31_alpha- dividing
w3_w1_biasandw2_biasbyfc31_alpha/fc2_alpha- scaling
swiglu_beta/swiglu_limitby1 / fc31_alphais consistent with pushing global scale factors into the parameters so the kernel can assume unscaled biases/activations.
One robustness suggestion:
module.swiglu_beta/module.swiglu_limitare accessed directly and only checked foris not None, which will raiseAttributeErrorif the attributes are absent. To make this method safer for other modules that might reuseNVFP4TRTLLMGenFusedMoEMethod, you could usegetattr:swiglu_beta = getattr(module, "swiglu_beta", None) if swiglu_beta is not None: swiglu_beta.data.div_(module.fc31_alpha.data) swiglu_limit = getattr(module, "swiglu_limit", None) if swiglu_limit is not None: swiglu_limit.data.div_(module.fc31_alpha.data)Please double‑check that all current users of
NVFP4TRTLLMGenFusedMoEMethodalways defineswiglu_betaandswiglu_limit; if not, the current code will raise at load time.Also applies to: 2195-2202, 2246-2265
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (1)
464-471:mSupportedTileNWAR is global & persistent; consider avoiding in‑place mutation.The hidden‑size‑specific WAR:
if (hiddenSize == 3072 && mSupportedTileN.back() == 256) { mSupportedTileN.pop_back(); }now mutates
mSupportedTileNon the sharedFP4BlockScaleMoeRunnerinstance. Once a 3072‑hidden‑size model is seen, tileN256is dropped for all subsequent calls, including other hidden sizes where256might be valid and beneficial.A less surprising approach would be to keep
mSupportedTileNimmutable and apply the WAR to a local copy, e.g.:auto supportedTileN = mSupportedTileN; if (hiddenSize == 3072) { supportedTileN.erase( std::remove(supportedTileN.begin(), supportedTileN.end(), 256), supportedTileN.end()); } // use supportedTileN for computeSelectedTileN and iterationThis preserves the workaround for the problematic hidden size without globally constraining other configurations.
Please confirm whether the intent is to disable tileN=256 only for GPT‑OSS 3072 or for all models once such a model is encountered in the process.
Also applies to: 536-537
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)
503-509: Fake FP4 op has intentionally unused new arguments; consider silencing Ruff ARG001.The
@fp4_block_scale_moe_runner.register_fakeimplementation ignores the new GEMM1/GEMM2 parameters, which is fine for a fake kernel that only needs to synthesize shapes, but Ruff now flags them as unused.If you want to keep lint clean without changing behavior, you can prefix these parameters with
_in the fake function’s signature, e.g.:def _( routing_logits, routing_bias, hidden_states, hidden_states_scale, gemm1_weights, gemm1_weights_scale, _gemm1_bias, _gemm1_alpha, _gemm1_beta, _gemm1_clamp_limit, gemm2_weights, gemm2_weights_scale, _gemm2_bias, ... ): ...This keeps the positional API intact while satisfying the linter.
tensorrt_llm/_torch/models/modeling_gpt_oss.py (2)
806-811: Local functiondeinterleaveis well-defined but could benefit from a brief docstring.The function performs gate/up deinterleaving correctly. Consider adding a brief comment explaining the interleaved format being handled.
785-882: Significant code duplication betweenload_nvfp4_weightsandload_hf_weights.The non-MoE weight loading logic (lines 861-882) is nearly identical to
load_hf_weights(lines 762-783). Consider extracting the common logic into a helper method to reduce duplication and maintenance burden.Example refactor approach:
def _load_non_moe_weights(self, name, module, module_weights, weights): """Common weight loading logic for non-MoE modules.""" if hasattr(module, "load_weights"): if 'qkv' in name: q_weight_bias = filter_weights( name.replace('qkv_proj', 'q_proj'), weights) k_weight_bias = filter_weights( name.replace('qkv_proj', 'k_proj'), weights) v_weight_bias = filter_weights( name.replace('qkv_proj', 'v_proj'), weights) module.load_weights( weights=[q_weight_bias, k_weight_bias, v_weight_bias]) else: module.load_weights(weights=[module_weights]) else: if 'next_layer_layernorm' in name: return for n, p in module._parameters.items(): if p is not None: p.data.copy_(module_weights[n][:])tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
554-555: Duplicate padding logic should use a helper function.The same padding calculation appears in both
_quantize_for_post_quant_comm(lines 288-290) andforward_impl(lines 554-555). Consider extracting to a helper method for consistency.def _get_nvfp4_input_pad_size(self, x: torch.Tensor) -> int: return self.w3_w1_weight.shape[-1] * 2 - x.shape[-1]
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp(8 hunks)tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py(6 hunks)tensorrt_llm/_torch/models/modeling_gpt_oss.py(3 hunks)tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py(7 hunks)tensorrt_llm/_torch/modules/fused_moe/quantization.py(23 hunks)tests/integration/defs/accuracy/references/gsm8k.yaml(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py(1 hunks)tests/unittest/_torch/modules/test_fused_moe.py(12 hunks)tests/unittest/_torch/thop/serial/test_moe.py(12 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., usefrom package.subpackage import fooand thenfoo.SomeClass()instead offrom package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g.,some_file.py)
Python class names should use PascalCase (e.g.,class SomeClass)
Python function and method names should use snake_case (e.g.,def my_awesome_function():)
Python local variable names should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile = ...)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g.,MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g.,self.x = 5followed by"""<type>: Description of 'x'""")
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic
Files:
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pytensorrt_llm/_torch/models/modeling_gpt_oss.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pytests/unittest/_torch/thop/serial/test_moe.pytensorrt_llm/_torch/modules/fused_moe/quantization.pytests/unittest/_torch/modules/test_fused_moe.pytests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top
Files:
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pytensorrt_llm/_torch/models/modeling_gpt_oss.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pytests/unittest/_torch/thop/serial/test_moe.pytensorrt_llm/_torch/modules/fused_moe/quantization.pytests/unittest/_torch/modules/test_fused_moe.pytests/integration/defs/accuracy/test_llm_api_pytorch.pycpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
**/*.{cpp,h,cu}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{cpp,h,cu}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g.,} // namespace foo)
Preferconstorconstexprvariables over#definewhenever possible, as the latter are not visible to the compiler
A variable that is not modified after its initialization should be declared asconst
Except0(only used in comparison for checking signness/existence/emptiness) andnullptr,true,false, all other literals should only be used for variable initialization and should be replaced with named constants
Use Allman indentation style for braces in C++
Put the semicolon for an emptyfororwhileloop in a new line
The statement forming the body of aswitch,while,do .. whileorforstatement shall be a compound statement (use brace-delimited statements)
Ifandelseshould always be followed by brace-delimited statements, even if empty or a single statement
C++ filenames should use camel case with first letter lowercase (e.g.,thisIsASubDirandthisIsAFilename.cpp)
All filenames involved in compilation of a compilation target must have case-insensitive unique filenames
All types (including class names) should use camel case with uppercase first letter (e.g.,FooBarClass)
Local variables, methods and namespaces should use camel case with first letter lowercase (e.g.,localFooBar)
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g.,gDontUseGlobalFoos)
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g.,sMutableStaticGlobal)
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter of the name (e.g.,static std::once_flag sFlag;)
Public, private and protected class member variables should use camel case prefixed with 'm' (e.g.,mNbFooValues), though the 'm' pre...
Files:
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
🧠 Learnings (18)
📚 Learning: 2025-09-19T21:28:13.751Z
Learnt from: jhaotingc
Repo: NVIDIA/TensorRT-LLM PR: 7856
File: cpp/tensorrt_llm/thop/fp8BlockScaleMoe.cpp:159-166
Timestamp: 2025-09-19T21:28:13.751Z
Learning: In TensorRT-LLM blockScaleMoe routing (cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu), the DeepSeek routing method performs reinterpret_cast<float*>(routingLogits) at line 89, which could cause issues if routing_logits are BF16. However, Qwen3-FP8 models use RenormalizeNaive routing method and are not affected by this dtype casting issue.
Applied to files:
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pycpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pycpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.
Applied to files:
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.
Applied to files:
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pycpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pytensorrt_llm/_torch/modules/fused_moe/quantization.pycpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Applied to files:
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pytensorrt_llm/_torch/modules/fused_moe/quantization.pytests/unittest/_torch/modules/test_fused_moe.py
📚 Learning: 2025-08-19T03:35:20.866Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4616-4626
Timestamp: 2025-08-19T03:35:20.866Z
Learning: In the MOE profiler TMA workspace preparation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu), the overlapping of TMA WS regions for NONE and FINALIZE variants is deliberate design to save memory space, as confirmed by djns99. The comment "reuse the same pointers to save space" reflects this intentional behavior.
Applied to files:
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pycpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
📚 Learning: 2025-10-20T17:07:18.745Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py:98-116
Timestamp: 2025-10-20T17:07:18.745Z
Learning: In NemotronH models (tensorrt_llm/_torch/auto_deploy/models/patches/nemotron_h.py), the gate (self.gate) returns topk_indices and topk_weights that are already in the correct shape to be passed directly to torch_ops.auto_deploy.torch_moe without needing to reshape them when hidden_states is flattened.
Applied to files:
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pytests/unittest/_torch/thop/serial/test_moe.py
📚 Learning: 2025-09-03T13:16:06.824Z
Learnt from: nvpohanh
Repo: NVIDIA/TensorRT-LLM PR: 7478
File: tensorrt_llm/_torch/models/modeling_llama.py:1315-1315
Timestamp: 2025-09-03T13:16:06.824Z
Learning: The Llama4VisionEncoder.load_weights method signature is `def load_weights(self, weights: Dict)` and should not be confused with Llama4ForConditionalGeneration.load_weights which has a different signature including weight_mapper parameter.
Applied to files:
tensorrt_llm/_torch/models/modeling_gpt_oss.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pytests/unittest/_torch/thop/serial/test_moe.pytests/unittest/_torch/modules/test_fused_moe.pycpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
📚 Learning: 2025-09-23T15:13:48.819Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/multimem.h:20-30
Timestamp: 2025-09-23T15:13:48.819Z
Learning: TRT-LLM targets modern CUDA toolkits that support FP8 datatypes, so cuda_fp8.h can be included unconditionally without version guards in TRT-LLM code.
Applied to files:
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py
📚 Learning: 2025-08-20T07:43:36.447Z
Learnt from: ChristinaZ
Repo: NVIDIA/TensorRT-LLM PR: 7068
File: cpp/tensorrt_llm/kernels/moeTopKFuncs.cuh:169-172
Timestamp: 2025-08-20T07:43:36.447Z
Learning: In TensorRT-LLM MOE kernels, when processing up to 128 experts across 32 threads, each thread handles at most 4 experts (N < 5 constraint), where N represents candidates per thread rather than total system capacity.
Applied to files:
tests/unittest/_torch/thop/serial/test_moe.pycpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
📚 Learning: 2025-09-17T02:48:52.732Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7781
File: tests/integration/test_lists/waives.txt:313-313
Timestamp: 2025-09-17T02:48:52.732Z
Learning: In TensorRT-LLM, `tests/integration/test_lists/waives.txt` is specifically for waiving/skipping tests, while other test list files like those in `test-db/` and `qa/` directories are for different test execution contexts (pre-merge, post-merge, QA tests). The same test appearing in both waives.txt and execution list files is intentional - the test is part of test suites but will be skipped due to the waiver.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
📚 Learning: 2025-08-26T09:49:04.956Z
Learnt from: pengbowang-nv
Repo: NVIDIA/TensorRT-LLM PR: 7192
File: tests/integration/test_lists/test-db/l0_dgx_b200.yml:56-72
Timestamp: 2025-08-26T09:49:04.956Z
Learning: In TensorRT-LLM test configuration files, the test scheduling system handles wildcard matching with special rules that prevent duplicate test execution even when the same tests appear in multiple yaml files with overlapping GPU wildcards (e.g., "*b200*" and "*gb200*").
Applied to files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/models/modeling_gpt_oss.py (4)
tensorrt_llm/models/modeling_utils.py (1)
quant_algo(550-551)tensorrt_llm/_torch/modules/fused_moe/interface.py (1)
has_nvfp4(589-592)tensorrt_llm/_torch/modules/linear.py (1)
has_nvfp4(2008-2011)tensorrt_llm/_torch/models/modeling_hunyuan_moe.py (1)
filter_weights(347-353)
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (3)
tensorrt_llm/_torch/modules/fused_moe/interface.py (5)
has_nvfp4(589-592)has_w4a16_mxfp4(613-616)has_w4a8_mxfp4_fp8(601-604)has_w4a8_mxfp4_mxfp8(607-610)has_w4a8_nvfp4_fp8(595-598)tensorrt_llm/_torch/utils.py (1)
shape(131-132)tensorrt_llm/functional.py (1)
pad(1327-1398)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (7)
tensorrt_llm/llmapi/llm_args.py (1)
ep_size(318-322)tensorrt_llm/_torch/models/modeling_llama.py (1)
dtype(1098-1099)tensorrt_llm/_utils.py (2)
dtype(985-986)dtype(993-1003)tensorrt_llm/llmapi/llm.py (1)
LLM(1104-1120)tensorrt_llm/_torch/auto_deploy/llm.py (1)
LLM(103-152)tensorrt_llm/_torch/attention_backend/trtllm.py (2)
max_seq_len(636-646)max_seq_len(649-653)tensorrt_llm/evaluate/lm_eval.py (2)
GSM8K(455-510)evaluate(395-430)
🪛 Ruff (0.14.5)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py
503-503: Unused function argument: gemm1_bias
(ARG001)
504-504: Unused function argument: gemm1_alpha
(ARG001)
505-505: Unused function argument: gemm1_beta
(ARG001)
506-506: Unused function argument: gemm1_clamp_limit
(ARG001)
507-507: Unused function argument: gemm2_weights
(ARG001)
508-508: Unused function argument: gemm2_weights_scale
(ARG001)
509-509: Unused function argument: gemm2_bias
(ARG001)
tests/unittest/_torch/thop/serial/test_moe.py
1271-1271: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
1272-1272: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
1273-1273: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (26)
tests/integration/defs/accuracy/references/gsm8k.yaml (1)
264-270: Reference entry for NVFP4 is properly structured and consistent with the file.The new GPT-OSS/20B-NVFP4 entry follows the established pattern in the file, with appropriate configurations for baseline, NVFP4 quantization, and combined NVFP4 + FP8 KV cache quantization. YAML syntax is correct.
Verify that the accuracy reference values (85.0 for all configurations) match the actual integration test results for the NVFP4 implementation. These reference values will be used by the test framework to validate the new NVFP4 quantization pathway.
tests/unittest/_torch/thop/serial/test_moe.py (1)
550-566: FP4 reference now correctly sees bias and SwiGLU parametersForwarding
gemm1_bias,gemm1_alpha,gemm1_beta,gemm1_clamp_limit, andgemm2_biasintomoe_args_dequantensures the FP4 reference path exercises the same epilogue (bias + SwiGLU gating) as the kernel. The wiring and argument ordering look consistent with the class definition andrun_moe_dequantusage.tensorrt_llm/_torch/modules/fused_moe/quantization.py (3)
220-225: Bias shape defaults correctly follow (possibly padded) weight shapes.Using
w3_w1_weight_shape[:2]andw2_weight_shape[:2]when explicit bias shapes are not provided is the right choice now that the weight dimensions can be padded; it keeps biases compatible with the actual allocated tensors.
2004-2024: NVFP4 TRTLLM‑Gen weight/bias padding before sharding is aligned with scaling‑factor logic.Padding
w1/w3(weights and biases) andw2(weights and biases) withmaybe_pad_for_weightsusing_get_weight_alignmentbeforeload_weight_shardis consistent with the MXFP4 path and avoids fractional scaling vectors per shard. The 1D‑bias branches (casting tofloat32and dividingw2bias bytp_size) also match the intended row‑parallel semantics.Also applies to: 2068-2084
2543-2556: Consolidated MXFP4 padding viamaybe_pad_for_weightsis clean and behavior‑preserving.Refactoring the MXFP4 Cutlass/TRTLLM‑Gen load paths to use
maybe_pad_for_weightswith_get_weight_alignmentfor both weights and scales removes duplicated padding logic and keeps the MXFP4 alignment rules in sync with the NVFP4 implementation. The chosen column/row alignments (accounting for packed MXFP4 elements and TP sharding) look consistent with the comments and previous behavior.Also applies to: 2584-2590, 2610-2615, 2653-2655, 2835-2837, 2889-2896, 2901-2902, 2948-2950, 2955-2957, 2996-3003, 3067-3069
cpp/tensorrt_llm/thop/fp4BlockScaleMoe.cpp (2)
32-45: New GEMM1/GEMM2 optional parameters are validated and wired correctly.The extended runner signature and arg wiring for
gemm1_bias,gemm1_alpha,gemm1_beta,gemm1_clamp_limit, andgemm2_biaslook consistent:
- Dtype/shape checks match kernel expectations:
[local_num_experts, 2 * intermediate_size]forgemm1_bias,[local_num_experts]forgemm1_*vectors, and[local_num_experts, hidden_size]forgemm2_bias.- Pointers are stored as
float*and defaulted tonullptrwhen the optionals are empty, matching the existing pattern for other args.No functional issues spotted here.
Also applies to: 162-171, 321-351, 362-369
492-505: Python–C++ runner interop for new FP4 parameters is consistent.The updated
FP4BlockScaleMoeRunner::runandFP8FP4BlockScaleMoeRunner::runsignatures, and their calls intorun_fp4_block_scale_moe_runner, correctly pass the new optional GEMM arguments (orstd::nulloptwhere not applicable) in a way that matches the Python bindings.Also applies to: 525-530, 611-616
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)
151-167: End‑to‑end wiring of new FP4 GEMM bias/alpha/beta/clamp_limit params is consistent.The FP4 path now cleanly threads
gemm1_bias,gemm1_alpha,gemm1_beta,gemm1_clamp_limit, andgemm2_bias:
- Dataclass
FP4BlockScaleMoEInputsmatches the positional order used inFP4BlockScaleMoERunner.forward.- Both
input_tensors_for_tunerand finalinput_tensorslists keep the same ordering, so autotuning and execution see identical argument layouts.- The updated Python custom op signature aligns with the C++ runner’s
runmethod.This looks correct and keeps the Python/C++ interfaces in sync.
Also applies to: 235-247, 355-384, 415-434, 445-464
tensorrt_llm/_torch/models/modeling_gpt_oss.py (5)
612-619: Exclusion modules list looks correct for NVFP4.The modules excluded from quantization (QKV, attention output, MLP gate, embedding, unembedding) align with typical NVFP4 configurations where attention and embedding layers remain in higher precision.
640-645: NVFP4 loading path correctly prioritized over HF loading.The branching logic correctly routes to
load_nvfp4_weightswhenis_nvfp4is true and the model is not an original model format.
813-817: FP32 bias conversion is appropriate for NVFP4 MoE.The explicit conversion of biases to FP32 aligns with the comment and the NVFP4 MoE requirements.
850-858: Global scale parameters correctly propagated for NVFP4.The module-level globals for NVFP4 loaders (weight_scale_2, input_scale) are correctly copied from module_weights to moe_weights.
798-800: Assertion may fail unexpectedly if quant_config is None.The assertion at line 799-800 accesses
module.quant_config.quant_mode.has_nvfp4()without first checking ifmodule.quant_configis not None. If the module doesn't have a quant_config, this will raise anAttributeErrorrather than the intendedAssertionError.Consider guarding the assertion:
if isinstance(module, MoE): - assert getattr(module, "quant_config", None) is not None and \ - module.quant_config.quant_mode.has_nvfp4() + assert getattr(module, "quant_config", None) is not None, \ + "MoE module must have quant_config for NVFP4 loading" + assert module.quant_config.quant_mode.has_nvfp4(), \ + "MoE module quant_config must have NVFP4 mode"Likely an incorrect or invalid review comment.
tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (5)
210-211: Bias/swiglu assertion correctly extended for NVFP4.The assertion now properly includes
self.has_nvfp4alongside other MXFP4 variants that support bias/swiglu parameters.
245-257: Zero bias initialization for FP4 variants without explicit bias is appropriate.When bias is not provided but the quantization method requires bias parameters, creating zero-filled FP32 bias tensors ensures the kernel interface is satisfied without affecting computation.
288-295: Padding calculation and application for NVFP4 quantization looks correct.The padding ensures input dimensions align with the expected weight dimensions before quantization. The formula
self.w3_w1_weight.shape[-1] * 2 - x.shape[-1]correctly calculates the required padding based on the packed weight shape.
566-590: New parameters correctly passed to fp4_block_scale_moe_runner.The addition of
w3_w1_bias,swiglu_alpha/beta/limit,w2_bias, andintermediate_size_per_partition_paddedaligns with the extended NVFP4 kernel interface.
606-609: Output clipping ensures consistent dimensionality.The clipping of
final_hidden_statestoself.hidden_sizecorrectly handles cases where padding was applied during computation, ensuring the output matches the expected model hidden size.tests/unittest/_torch/modules/test_fused_moe.py (8)
14-14: Import added appropriately for F.silu usage.The
torch.nn.functional as Fimport supports the new activation logic inRefGatedMLPFusedMoE.
1348-1358: Function signature refactored with sensible defaults.The new parameters for
run_fused_moe_nvfp4provide flexibility for testing different configurations while maintaining backward compatibility through default values.
1407-1419: Bias initialization for GPT-OSS style correctly added.The FP32 bias tensors are created with appropriate shapes when
gptoss_style=True, matching the NVFP4 MoE requirements.
1425-1438: Weight quantization updated to use unswizzled scales.The change from
fp4_quantizewith swizzling to explicitly unswizzled scales (usingFalsefor is_scale_factor_swizzled) aligns with the NVFP4 loader expectations.
1523-1539: Tolerance values appropriately relaxed for GPT-OSS style.The 95% match threshold with 0.1 rtol/atol is reasonable for NVFP4 quantized models with custom activation functions, given the accumulation of quantization errors.
1574-1593: Test parameterization covers key GPT-OSS configurations.The test covers:
- Two hidden/intermediate sizes (512, 2880)
- Two swiglu_alpha values (1, 0.1)
- Two swiglu_beta values (0, 1)
- Two swiglu_limit values (inf, 1)
This provides good coverage of the activation function behavior.
2728-2730: Conditional activation selection is clean and readable.Using the ternary to select between
custom_swigluandF.silubased onswiglu_alphapresence is clear and maintainable.
2707-2718: I'll help you verify whether thecustom_swigluimplementation matches the production kernels. Let me search for the production implementation to compare the clamping order.
<function_calls>
#!/bin/bashSearch for swiglu implementations in C++ kernels
Looking for swiglu and clamping patterns in production code
find . -type f ( -name ".cpp" -o -name ".h" -o -name ".cu" ) -path "/kernels/*" | head -20
</function_calls><function_calls>
#!/bin/bashSearch for swiglu references in the codebase
rg -n "swiglu" --type cpp --type c -i -C 5 2>/dev/null | head -100
</function_calls>
c4e53d3 to
7de4bdd
Compare
|
/bot run |
|
PR_Github #30374 [ run ] triggered by Bot. Commit: |
|
PR_Github #30374 [ run ] completed with state
|
|
/bot run --detailed-log --stage-list "B300-PyTorch-1" |
|
PR_Github #30379 [ run ] triggered by Bot. Commit: |
|
/bot run --detailed-log --stage-list "B300-PyTorch-1" |
|
PR_Github #30380 [ run ] triggered by Bot. Commit: |
|
PR_Github #30380 [ run ] completed with state
|
|
/bot run --detailed-log --stage-list "B300-PyTorch-1" |
|
PR_Github #30381 [ run ] triggered by Bot. Commit: |
|
PR_Github #30381 [ run ] completed with state
|
|
/bot run --detailed-log --stage-list "B300-PyTorch-1" |
|
PR_Github #30384 [ run ] triggered by Bot. Commit: |
|
PR_Github #30384 [ run ] completed with state |
08d1762 to
1045e6f
Compare
|
/bot run |
|
PR_Github #30389 [ run ] triggered by Bot. Commit: |
|
PR_Github #30389 [ run ] completed with state
|
Signed-off-by: Dongfeng Yu <[email protected]>
Signed-off-by: Dongfeng Yu <[email protected]>
1045e6f to
4082577
Compare
|
/bot run |
|
PR_Github #30417 [ run ] triggered by Bot. Commit: |
Summary by CodeRabbit
Release Notes
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.