Skip to content

revert awq dtype change for vllm inference limitation#1613

Merged
WeiweiZhang1 merged 5 commits intomainfrom
revert_awq_dtype_change_for_vllm_infer_limitation
Mar 26, 2026
Merged

revert awq dtype change for vllm inference limitation#1613
WeiweiZhang1 merged 5 commits intomainfrom
revert_awq_dtype_change_for_vllm_infer_limitation

Conversation

@WeiweiZhang1
Copy link
Copy Markdown
Contributor

@WeiweiZhang1 WeiweiZhang1 commented Mar 25, 2026

Description

The performance of vllm awq inference varies across different devices; the CUDA restrictions on the float16 data type still apply on the A100, so revert the data type change to ensure the robustness of the inference.

Type of Change

  • Bug fix
  • New feature
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Other (please specify):

Related Issues

Fixes or relates to #

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.

Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com>
Copilot AI review requested due to automatic review settings March 25, 2026 07:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to improve robustness of vLLM AWQ inference across different CUDA devices by ensuring the exported model’s dtype metadata aligns with vLLM’s AWQ kernel limitations.

Changes:

  • Force torch.float16 dtype metadata during AWQ export to improve vLLM compatibility.
  • Extend AutoRound export dtype selection to prefer FP16 when the packing format is AWQ.
  • Update the vLLM AWQ integration test to pass an explicit dtype argument to LLM(...).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
test/test_cuda/integrations/test_vllm.py Adjusts vLLM initialization for the AWQ integration test.
auto_round/export/export_to_awq/export.py Forces AWQ exports to write FP16 dtype metadata via save_model(..., dtype=...).
auto_round/export/export_to_autoround/export.py Selects FP16 dtype metadata when the packing format indicates AWQ.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@xin3he
Copy link
Copy Markdown
Contributor

xin3he commented Mar 25, 2026

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@chensuyue chensuyue added this to the 0.12.0 milestone Mar 25, 2026
Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com>
@xin3he
Copy link
Copy Markdown
Contributor

xin3he commented Mar 25, 2026

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: WeiweiZhang1 <weiwei1.zhang@intel.com>
@xin3he
Copy link
Copy Markdown
Contributor

xin3he commented Mar 25, 2026

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@XuehaoSun
Copy link
Copy Markdown
Contributor

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@WeiweiZhang1 WeiweiZhang1 merged commit 72d5c4d into main Mar 26, 2026
40 checks passed
@WeiweiZhang1 WeiweiZhang1 deleted the revert_awq_dtype_change_for_vllm_infer_limitation branch March 26, 2026 05:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants