Skip to content

Conversation

@mxinO
Copy link

@mxinO mxinO commented Oct 30, 2025

What does this PR do?

Type of change: Bug fix

Overview:
Fix or improve the vllm PTQ.

  1. Now support ray, and can run on multiple nodes.
  2. MoE typo, and better folding weight for large MoE layers.
  3. Add the layer SharedFusedMoE
  4. Support vllm > 0.11 (not released yet)
  5. Add os env to specify quant configs

Usage

Testing

Tested with latest vllm.

Additional Information

The vllm >0.11.0 changed the low-level API significantly. Some changes needs to be removed when vllm<=0.11.0 is outdated.

@mxinO mxinO self-assigned this Oct 30, 2025
@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 30, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@mxinO mxinO changed the title [Draft] Fix/Improve vllm PTQ Fix/Improve vllm PTQ, and support latest vllm Nov 4, 2025
@mxinO mxinO changed the title Fix/Improve vllm PTQ, and support latest vllm Fix/Improve vllm PTQ and Support multi-node with ray Nov 4, 2025
@mxinO mxinO marked this pull request as ready for review November 4, 2025 06:07
@mxinO mxinO requested review from a team as code owners November 4, 2025 06:07
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mxinO does this maintain the support for non-ray + vLLM ?

Comment on lines +186 to +206
model.load_state_dict(current_state_dict)
torch.distributed.barrier()

if amax_file_path is None:
# Sync amax across TP can be done here if needed
pass
# for name, buffer in model.named_buffers():
# if name.endswith("_amax"):
# print("syncing amax across TP for", name)
# torch.distributed.all_reduce(
# buffer, op=torch.distributed.ReduceOp.MAX, group=get_tp_group().device_group
# )
# torch.distributed.barrier()

if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
mtq.print_quant_summary(model)

mtq.fold_weight(model)
for name, module in model.named_modules():
if name.endswith("weight_quantizer"):
assert not module.is_enabled, f"quantizer {name} is still enabled"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do this under disable_compilation context?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants