Feature/calibrate weights dfs fused modules#2394
Feature/calibrate weights dfs fused modules#2394GOavi101 wants to merge 5 commits intovllm-project:mainfrom
Conversation
- Add calibrate_weights(model) with single DFS traversal: replaces three separate loops (update_weight_global_scale, update_fused_layer_weight_global_scales, update_weight_zp_scale) for better cache locality and fewer CPU->GPU onloads. - API: named_modules or (targets, ignore), update_zp_scale, optional progress; DDP-friendly via named_modules subset (vllm-project#2220). - Use calibrate_weights in QuantizationModifier, AWQModifier, GPTQModifier. - Add fused_modules.py as central source of truth for vLLM-aligned fused layouts: attention (traditional q/k/v, fused qkv, MLA q_a + kv_a), MLP (gate/up, fused gate_up). - Add MLA (Multi-head Latent Attention) fused global-scale support: q_a_proj (or q_proj) + kv_a_proj_with_mqa. - Refactor update_fused_layer_weight_global_scales to use get_fused_*_linears(). Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello @GOavi101, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the weight calibration mechanism by consolidating disparate calibration loops into a single, efficient Depth-First Search traversal. This change is designed to boost performance by optimizing cache usage and minimizing CPU-to-GPU data movement during offloading. Concurrently, it establishes a standardized framework for identifying and managing various fused vLLM modules, ensuring uniform handling of global scales across diverse attention and MLP architectures, including the newly integrated Multi-head Latent Attention. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request introduces a significant improvement by consolidating the weight calibration process into a single DFS traversal function, calibrate_weights. This change enhances cache locality and reduces CPU-GPU overhead, especially beneficial for offloading scenarios. The introduction of fused_modules.py centralizes the definitions for various fused attention and MLP modules, which is a good step towards better maintainability and consistency. The modifications to AWQModifier, GPTQModifier, and QuantizationModifier to utilize this new unified calibration function are well-implemented. The handling of MLA fused global-scale support is also a valuable addition. Overall, the changes are well-structured and improve the efficiency and clarity of the calibration workflow.
Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
…/github.com/GOavi101/llm-compressor into feature/calibrate-weights-dfs-fused-modules
kylesayrs
left a comment
There was a problem hiding this comment.
I think this looks nice, I especially like the fused module utilities. However, there still a few things to consider.
- Weights are still onloaded twice, once to calculate
update_weight_global_scaleand a second time to calculateupdate_weight_zp_scale. Ideally, we'd be able to onload the weight once and see significant performance benefits for NVFP4.
weight = module.weight # onload once
weight_global_scale = calculate_gparam(weight)
weight_scale, weight_zero_point = calculate_qparams(weight)- It's still unclear how this algorithm could be used to perform parallel weight calibration. Answering this problem is definitely tricky, and we can leave it aside for now, but something to consider.
| try: | ||
| import tqdm | ||
| except ImportError: | ||
| tqdm = None | ||
|
|
||
| if show_progress and desc is not None and tqdm is not None and total_targets > 0: | ||
| pbar = tqdm.tqdm(total=total_targets, desc=desc) | ||
| else: | ||
| pbar = None |
There was a problem hiding this comment.
Assume that tqdm is available
| else: | ||
| named_modules = list(named_modules) | ||
|
|
||
| target_set = {id(m) for _, m in named_modules} |
There was a problem hiding this comment.
A set of modules is equivalent to a set of ids of modules (more or less). You don't need the id call.
| # Pre-order: global scale for target modules (FP4 / TENSOR_GROUP) | ||
| if id(module) in target_set: | ||
| update_weight_global_scale(module) | ||
| stack.append((module, True)) |
There was a problem hiding this comment.
Please test this traversal with models which have shared modules. You may have to memoize modules to avoid traversing twice.
| "calibrate_weights requires either named_modules or both " | ||
| "targets and ignore" | ||
| ) | ||
| from compressed_tensors.utils import match_named_modules |
There was a problem hiding this comment.
Please put imports at the top
| if targets is None or ignore is None: | ||
| raise ValueError( | ||
| "calibrate_weights requires either named_modules or both " | ||
| "targets and ignore" | ||
| ) |
There was a problem hiding this comment.
Can you avoid this check by using default tuple values?
| module, children_done = stack.pop() | ||
|
|
||
| if not children_done: | ||
| # Pre-order: global scale for target modules (FP4 / TENSOR_GROUP) |
There was a problem hiding this comment.
Really great job with defining pre-order and post-order functions.
| call_observer(module=module, base_name="weight") | ||
|
|
||
|
|
||
| def calibrate_weights( |
There was a problem hiding this comment.
I think the pre-order and post-order framing is very elegant, but it may get in the way of sharing weight offloading between both calculate_gparam and calculate_qparam.
There was a problem hiding this comment.
the pre/post structure was getting in the way of a single onload. I’ve added _update_weight_calibration_once(module, update_zp_scale) which onloads module.weight once and calls call_observer(..., value=value, should_calculate_gparam=..., should_calculate_qparams=...) so both use the same tensor. The sequential DFS now uses this in pre-order for target modules and no longer calls update_weight_zp_scale in post-order for them, so we get one onload per module for NVFP4.
fae990b to
6e7a4f5
Compare
…, add parallel path - Move fused_modules.py to modeling/ and update imports - In update_fused_layer_weight_global_scales, rescale weight_scale s' = s*g'/g when applying fused global scale so q unchanged - Add calibrate_weights(..., parallel=True, max_workers=N) for two-phase parallel weight calibration Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
6e7a4f5 to
b3d2de9
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
ok, taking a look |
| def _apply_fused_global_scale(lin: Linear, g_prime: torch.Tensor) -> None: | ||
| """Set weight_global_scale to g'; rescale weight_scale so q = x/(s*g) unchanged.""" | ||
| old_g = lin.weight_global_scale.data | ||
| update_parameter_data(lin, g_prime, "weight_global_scale") |
There was a problem hiding this comment.
this way is deprecated
| update_parameter_data(lin, g_prime, "weight_global_scale") | |
| update_offload_parameter(lin, "weight_global_scale", g_prime) |
There was a problem hiding this comment.
you have it correct below
| update_fused_layer_weight_global_scales(module) | ||
| if update_zp_scale and module in target_set and module not in seen_post: | ||
| seen_post.add(module) | ||
| if pbar is not None: |
There was a problem hiding this comment.
i think this progress bar update should be unindented or else it only updates when zp_scale is hit
| item_weight_fn=lambda m: m.weight.numel(); see GPTQ DDP #2333 which uses | ||
| hessian shape for the same idea). | ||
|
|
||
| Benchmark: See tests/benchmark_calibrate_weights.py for onload count and |
| update_offload_parameter(submodule.up_proj, "weight_global_scale", global_scale) | ||
|
|
||
| del global_scale | ||
| def _apply_fused_global_scale(lin: Linear, g_prime: torch.Tensor) -> None: |
There was a problem hiding this comment.
probably need unit test for this and other new functions that are going to be in the hotpath
| (order-independent, parallelizable). Phase 2 applies fused global scales | ||
| and rescales per-tensor scale s' = s * (g' / g). | ||
|
|
||
| DDP: Works with distributed setups. Pass named_modules as this rank's |
There was a problem hiding this comment.
I'm not seeing any of this functionality
Summary
calibrate_weights(model)) for better cache locality and fewer CPU→GPU onloads when using offloading.fused_modules.py): traditional attention (q/k/v), fused QKV, MLA (q_a + kv_a_proj_with_mqa), and MLP (gate/up, fused gate_up).Changes
calibrate_weights(model, named_modules=..., update_zp_scale=..., desc=..., show_progress=...)incalibration.py: single stack-based DFS; pre-orderupdate_weight_global_scalefor targets, post-orderupdate_fused_layer_weight_global_scalesandupdate_weight_zp_scalefor targets. API supports DDP vianamed_modulessubset ([Performance Refactor] Extend modifiers to support weight-parallel optimization - QuantizationModifier #2220).fused_modules.py:get_fused_attention_linears(),get_fused_mlp_linears(), plusis_fused_attention_module/is_fused_mlp_module. Defines traditional, fused QKV, MLA, and gate/up layouts.helpers.py:update_fused_layer_weight_global_scales()refactored to useget_fused_*_linears(); MLA handled in the same path as traditional attention.calibrate_weights()instead of the three loops.