-
Notifications
You must be signed in to change notification settings - Fork 190
[OMNIML-2932] Fusing pre_quant_scale for NVFP4 AWQ #421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #421 +/- ##
==========================================
+ Coverage 73.38% 73.44% +0.06%
==========================================
Files 180 180
Lines 17934 18147 +213
==========================================
+ Hits 13160 13328 +168
- Misses 4774 4819 +45 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6da3636 to
cd036ed
Compare
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
cd036ed to
c5d9682
Compare
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
d9dfc39 to
a5a6e39
Compare
Signed-off-by: weimingc <[email protected]>
ae2a32c to
6020e94
Compare
Signed-off-by: weimingc <[email protected]>
Signed-off-by: weimingc <[email protected]>
| kv_head_dim = linear_fuse_into.weight.shape[0] // num_kv_heads | ||
| n_rep = pre_quant_scale.numel() // num_kv_heads // kv_head_dim | ||
|
|
||
| # Reshape:(num_kv_heads, n_rep, kv_head_dim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's n_rep here?
| old_pre_quant_scale = module.input_quantizer._pre_quant_scale | ||
| module.weight = nn.Parameter( | ||
| module.weight | ||
| * old_pre_quant_scale.to( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to cast to fp32 for this manipulation?
| @@ -0,0 +1,193 @@ | |||
| # SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| .reshape(-1) | ||
| ) | ||
|
|
||
| def _update_pre_quant_scale(module, new_pre_quant_scale): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we merge duplicated code with line 1090?
What does this PR do?
Type of change: ?
Overview:
This PR and NVIDIA/TensorRT-LLM#8698 enable NVFP4 AWQ deployment for TRT-LLM. Specifically, this PR fuses pre_quant_scale in following two cases:
Usage
# Add a code snippet demonstrating how to use thisTesting
unit test, e2e test for Qwen3 dense and moe models.
Before your PR is "Ready for review"
Additional Information