[OMNIML-2932] Fusing pre_quant_scale for NVFP4 AWQ #421

meenchen · 2025-10-09T22:28:45Z

What does this PR do?

Type of change: ?

Overview:

This PR and NVIDIA/TensorRT-LLM#8698 enable NVFP4 AWQ deployment for TRT-LLM. Specifically, this PR fuses pre_quant_scale in following two cases:

For MLP, pre_quant_scale of gate_proj layer is fused into up_proj's weight, so we don't need an extra handle in downstream fused moe kernels.
For attention, we will try to fuse the pre_quant_scale of o_proj to v_proj if their dimensions match, which means we will skip fusion for MQA/GQA models.

Usage

# Add a code snippet demonstrating how to use this

Testing

unit test, e2e test for Qwen3 dense and moe models.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

copy-pr-bot · 2025-10-09T22:28:49Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2025-10-09T22:28:55Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch weimingc/fuse_pqs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2025-10-09T22:41:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.44%. Comparing base (32d168c) to head (a591330).
⚠️ Report is 39 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #421      +/-   ##
==========================================
+ Coverage   73.38%   73.44%   +0.06%     
==========================================
  Files         180      180              
  Lines       17934    18147     +213     
==========================================
+ Hits        13160    13328     +168     
- Misses       4774     4819      +45

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: weimingc <[email protected]>

cjluo-nv · 2025-11-04T19:36:17Z

modelopt/torch/export/quant_utils.py

+                        kv_head_dim = linear_fuse_into.weight.shape[0] // num_kv_heads
+                        n_rep = pre_quant_scale.numel() // num_kv_heads // kv_head_dim
+
+                        # Reshape:(num_kv_heads, n_rep, kv_head_dim)


what's n_rep here?

cjluo-nv · 2025-11-04T19:37:01Z

modelopt/torch/export/quant_utils.py

+                            old_pre_quant_scale = module.input_quantizer._pre_quant_scale
+                            module.weight = nn.Parameter(
+                                module.weight
+                                * old_pre_quant_scale.to(


do we want to cast to fp32 for this manipulation?

cjluo-nv · 2025-11-04T19:49:25Z

tests/gpu/torch/export/test_quant_utils.py

@@ -0,0 +1,193 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Do we need update https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/tests/gpu/torch/export/test_unified_hf_export_and_check_safetensors.py#L43 as well?

cjluo-nv · 2025-11-04T19:52:40Z

modelopt/torch/export/quant_utils.py

+                            .reshape(-1)
+                        )
+
+                        def _update_pre_quant_scale(module, new_pre_quant_scale):


can we merge duplicated code with line 1090?

meenchen force-pushed the weimingc/fuse_pqs branch from 6da3636 to cd036ed Compare October 14, 2025 18:46

meenchen self-assigned this Oct 14, 2025

meenchen added 6 commits October 14, 2025 18:48

pattern-based fusion

9bd8e41

Signed-off-by: weimingc <[email protected]>

fix GQA

13042fa

Signed-off-by: weimingc <[email protected]>

minor

d1c5d19

Signed-off-by: weimingc <[email protected]>

unit test

e599d43

Signed-off-by: weimingc <[email protected]>

fix doc

26f2eb7

Signed-off-by: weimingc <[email protected]>

revert unintended change

c5d9682

Signed-off-by: weimingc <[email protected]>

meenchen force-pushed the weimingc/fuse_pqs branch from cd036ed to c5d9682 Compare October 14, 2025 18:48

meenchen added 2 commits October 17, 2025 21:49

minor

6dd1b87

Signed-off-by: weimingc <[email protected]>

resmooth

a5a6e39

Signed-off-by: weimingc <[email protected]>

meenchen force-pushed the weimingc/fuse_pqs branch from d9dfc39 to a5a6e39 Compare October 27, 2025 19:28

meenchen requested a review from cjluo-nv October 27, 2025 19:29

meenchen mentioned this pull request Oct 27, 2025

[OMNIML-2932] [feat] nvfp4 awq support NVIDIA/TensorRT-LLM#8698

Open

1 task

fix moe fusion

6020e94

Signed-off-by: weimingc <[email protected]>

meenchen force-pushed the weimingc/fuse_pqs branch from ae2a32c to 6020e94 Compare November 3, 2025 20:57

meenchen changed the title ~~Pattern-based fusion for pre_quant_scale~~ Fusing pre_quant_scale for NVFP4 AWQ Nov 3, 2025

meenchen changed the title ~~Fusing pre_quant_scale for NVFP4 AWQ~~ [OMNIML-2932] Fusing pre_quant_scale for NVFP4 AWQ Nov 3, 2025

meenchen marked this pull request as ready for review November 3, 2025 23:39

meenchen requested a review from a team as a code owner November 3, 2025 23:39

meenchen added 2 commits November 3, 2025 23:46

fix test

9339223

Signed-off-by: weimingc <[email protected]>

only fuse for nvfp4 awq

a591330

Signed-off-by: weimingc <[email protected]>

meenchen requested review from Edwardf0t1 and realAsma November 4, 2025 00:11

cjluo-nv reviewed Nov 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OMNIML-2932] Fusing pre_quant_scale for NVFP4 AWQ #421

[OMNIML-2932] Fusing pre_quant_scale for NVFP4 AWQ #421

meenchen commented Oct 9, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Oct 9, 2025

Uh oh!

coderabbitai bot commented Oct 9, 2025 •

edited

Loading

Review skipped

Uh oh!

codecov bot commented Oct 9, 2025 •

edited

Loading

Uh oh!

cjluo-nv Nov 4, 2025

Uh oh!

cjluo-nv Nov 4, 2025

Uh oh!

cjluo-nv Nov 4, 2025

Uh oh!

cjluo-nv Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,193 @@
		# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

[OMNIML-2932] Fusing pre_quant_scale for NVFP4 AWQ #421

Are you sure you want to change the base?

[OMNIML-2932] Fusing pre_quant_scale for NVFP4 AWQ #421

Conversation

meenchen commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Oct 9, 2025

Uh oh!

coderabbitai bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

codecov bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cjluo-nv Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

meenchen commented Oct 9, 2025 •

edited

Loading

coderabbitai bot commented Oct 9, 2025 •

edited

Loading

codecov bot commented Oct 9, 2025 •

edited

Loading