Add e2e tests for non-uniform quantization examples#2321
Add e2e tests for non-uniform quantization examples#2321dsikka merged 3 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
Summary of ChangesHello @saurabhaloneai, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces comprehensive end-to-end tests to validate advanced non-uniform quantization strategies within the vLLM framework. It ensures the correct application and functionality of mixed precision quantization, specifically for NVFP4+FP8 and GPTQ+AWQ configurations, thereby enhancing the robustness and reliability of the quantization pipeline for large language models. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds two new end-to-end test cases for non-uniform quantization, which is a great addition for improving test coverage. My review focuses on the new YAML configuration files. I've identified a few areas where the configurations can be improved for robustness and performance. Specifically, some of the regular expressions for targeting layers are a bit too broad and could be made more specific. Additionally, one of the GPTQ configurations could be adjusted for better accuracy.
51a7099 to
4baf690
Compare
HDCharles
left a comment
There was a problem hiding this comment.
looks good to me, note, you can request reviews from contributors once your PR is ready,
dsikka
left a comment
There was a problem hiding this comment.
Thanks! Do you have any sample checkpoints from the recipes / configs?
Signed-off-by: saurabhaloneai <saurabhaloney85@gmail.com>
- Escape literal dots and add $ anchors to prevent unintended matches - Consolidate attention/gate/up proj patterns for readability Co-authored-by: gemini-code-assist[bot] Signed-off-by: saurabhaloneai <saurabhaloney85@gmail.com>
75fab6f to
7962e7b
Compare
|
@dsikka here are the checkpoints(I just uploaded on personal hf account): NVFP4+FP8: https://huggingface.co/zenzen9/TinyLlama-1.1B-nvfp4-fp8-mixed |
dsikka
left a comment
There was a problem hiding this comment.
This is perfect - thank you for sharing the checkpoints as well! Great work
SUMMARY: Adds two new e2e test cases for non-uniform quantization examples. - NVFP4+FP8 Mixed: Uses NVFP4 for attention/gate/up layers, FP8 for down_proj - GPTQ+AWQ Mixed: Uses AWQ W4A16 for attention, GPTQ W8A8 for MLP Closes vllm-project#2315 Depends on vllm-project#2317 Files added: - tests/e2e/vLLM/configs/nvfp4_fp8_mixed.yaml - tests/e2e/vLLM/configs/multiple_modifiers_gptq_awq.yaml - tests/e2e/vLLM/recipes/non_uniform/recipe_nvfp4_fp8_mixed.yaml - tests/e2e/vLLM/recipes/non_uniform/recipe_gptq_awq.yaml TEST PLAN: Tested locally on GPU (RTX 6000 Pro Blackwell): 1. NVFP4+FP8 test: ```bash CADENCE=nightly TEST_DATA_FILE=tests/e2e/vLLM/configs/nvfp4_fp8_mixed.yaml pytest tests/e2e/vLLM/test_vllm.py -v ``` Result: Compression passed (154 layers) 2. GPTQ+AWQ test: ```bash CADENCE=nightly TEST_DATA_FILE=tests/e2e/vLLM/configs/multiple_modifiers_gptq_awq.yaml pytest tests/e2e/vLLM/test_vllm.py -v ``` Result: Compression passed for both modifiers --------- Signed-off-by: saurabhaloneai <saurabhaloney85@gmail.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
SUMMARY:
Adds two new e2e test cases for non-uniform quantization examples.
Closes #2315
Depends on #2317
Files added:
TEST PLAN:
Tested locally on GPU (RTX 6000 Pro Blackwell):
Result: Compression passed (154 layers)
Result: Compression passed for both modifiers