Skip to content

Add e2e tests for non-uniform quantization examples#2321

Merged
dsikka merged 3 commits intovllm-project:mainfrom
saurabhaloneai:add-non-uniform-e2e-tests
Feb 3, 2026
Merged

Add e2e tests for non-uniform quantization examples#2321
dsikka merged 3 commits intovllm-project:mainfrom
saurabhaloneai:add-non-uniform-e2e-tests

Conversation

@saurabhaloneai
Copy link
Copy Markdown
Contributor

SUMMARY:

Adds two new e2e test cases for non-uniform quantization examples.

  • NVFP4+FP8 Mixed: Uses NVFP4 for attention/gate/up layers, FP8 for down_proj
  • GPTQ+AWQ Mixed: Uses AWQ W4A16 for attention, GPTQ W8A8 for MLP

Closes #2315
Depends on #2317

Files added:

  • tests/e2e/vLLM/configs/nvfp4_fp8_mixed.yaml
  • tests/e2e/vLLM/configs/multiple_modifiers_gptq_awq.yaml
  • tests/e2e/vLLM/recipes/non_uniform/recipe_nvfp4_fp8_mixed.yaml
  • tests/e2e/vLLM/recipes/non_uniform/recipe_gptq_awq.yaml

TEST PLAN:

Tested locally on GPU (RTX 6000 Pro Blackwell):

  1. NVFP4+FP8 test:
CADENCE=nightly TEST_DATA_FILE=tests/e2e/vLLM/configs/nvfp4_fp8_mixed.yaml pytest tests/e2e/vLLM/test_vllm.py -v

Result: Compression passed (154 layers)

  1. GPTQ+AWQ test:
CADENCE=nightly TEST_DATA_FILE=tests/e2e/vLLM/configs/multiple_modifiers_gptq_awq.yaml pytest tests/e2e/vLLM/test_vllm.py -v

Result: Compression passed for both modifiers

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @saurabhaloneai, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces comprehensive end-to-end tests to validate advanced non-uniform quantization strategies within the vLLM framework. It ensures the correct application and functionality of mixed precision quantization, specifically for NVFP4+FP8 and GPTQ+AWQ configurations, thereby enhancing the robustness and reliability of the quantization pipeline for large language models.

Highlights

  • New E2E Test Cases: Two new end-to-end test cases have been added to validate non-uniform quantization examples within the vLLM framework.
  • NVFP4+FP8 Mixed Quantization: One test case specifically targets a mixed NVFP4+FP8 quantization scheme, applying NVFP4 to attention, gate, and up layers, and FP8 to the down_proj layer.
  • GPTQ+AWQ Mixed Quantization: Another test case validates a mixed GPTQ+AWQ quantization, utilizing AWQ W4A16 for attention layers and GPTQ W8A8 for MLP layers.
  • Configuration and Recipe Files: New YAML configuration and recipe files were introduced to define the parameters and layer targets for these specific non-uniform quantization strategies.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds two new end-to-end test cases for non-uniform quantization, which is a great addition for improving test coverage. My review focuses on the new YAML configuration files. I've identified a few areas where the configurations can be improved for robustness and performance. Specifically, some of the regular expressions for targeting layers are a bit too broad and could be made more specific. Additionally, one of the GPTQ configurations could be adjusted for better accuracy.

@saurabhaloneai saurabhaloneai force-pushed the add-non-uniform-e2e-tests branch from 51a7099 to 4baf690 Compare February 1, 2026 08:40
Copy link
Copy Markdown
Collaborator

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me, note, you can request reviews from contributors once your PR is ready,

@HDCharles HDCharles added ready When a PR is ready for review gptq For any PR / issue related to GPTQ support awq For any issue / PR related to AWQ support labels Feb 2, 2026
Copy link
Copy Markdown
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Do you have any sample checkpoints from the recipes / configs?

Signed-off-by: saurabhaloneai <saurabhaloney85@gmail.com>
- Escape literal dots and add $ anchors to prevent unintended matches
- Consolidate attention/gate/up proj patterns for readability

Co-authored-by: gemini-code-assist[bot]
Signed-off-by: saurabhaloneai <saurabhaloney85@gmail.com>
@saurabhaloneai saurabhaloneai force-pushed the add-non-uniform-e2e-tests branch from 75fab6f to 7962e7b Compare February 3, 2026 03:51
@saurabhaloneai
Copy link
Copy Markdown
Contributor Author

saurabhaloneai commented Feb 3, 2026

@dsikka here are the checkpoints(I just uploaded on personal hf account):

NVFP4+FP8: https://huggingface.co/zenzen9/TinyLlama-1.1B-nvfp4-fp8-mixed
GPTQ+AWQ: https://huggingface.co/zenzen9/TinyLlama-1.1B-gptq-awq-mixed

Copy link
Copy Markdown
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is perfect - thank you for sharing the checkpoints as well! Great work

@dsikka dsikka merged commit ea3bfa4 into vllm-project:main Feb 3, 2026
10 of 11 checks passed
cajeonrh pushed a commit to cajeonrh/llm-compressor that referenced this pull request Feb 10, 2026
SUMMARY:

Adds two new e2e test cases for non-uniform quantization examples.

- NVFP4+FP8 Mixed: Uses NVFP4 for attention/gate/up layers, FP8 for
down_proj
- GPTQ+AWQ Mixed: Uses AWQ W4A16 for attention, GPTQ W8A8 for MLP

Closes vllm-project#2315
Depends on vllm-project#2317

Files added:
- tests/e2e/vLLM/configs/nvfp4_fp8_mixed.yaml
- tests/e2e/vLLM/configs/multiple_modifiers_gptq_awq.yaml
- tests/e2e/vLLM/recipes/non_uniform/recipe_nvfp4_fp8_mixed.yaml
- tests/e2e/vLLM/recipes/non_uniform/recipe_gptq_awq.yaml


TEST PLAN:

Tested locally on GPU (RTX 6000 Pro Blackwell):

1. NVFP4+FP8 test:
```bash
CADENCE=nightly TEST_DATA_FILE=tests/e2e/vLLM/configs/nvfp4_fp8_mixed.yaml pytest tests/e2e/vLLM/test_vllm.py -v
```
Result: Compression passed (154 layers)

2. GPTQ+AWQ test:
```bash
CADENCE=nightly TEST_DATA_FILE=tests/e2e/vLLM/configs/multiple_modifiers_gptq_awq.yaml pytest tests/e2e/vLLM/test_vllm.py -v
```
Result: Compression passed for both modifiers

---------

Signed-off-by: saurabhaloneai <saurabhaloney85@gmail.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awq For any issue / PR related to AWQ support gptq For any PR / issue related to GPTQ support ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add non-uniform test cases to the e2e tests

3 participants