Commit 961404d
[dist][moe] fix add moe_context for big models (vllm-project#2405)
Summary:
large models like Qwen/Qwen3-VL-235B-A22B-Instruct, when they add moe
calibration context, different threads can take different lengths of
time, for larger models this difference can be longer than the nccl
timeout.
fix: add a sync point at each module since we're rate limited to the
slowest thread as is. at some point this should be changed to add moe
calibration context in parallel and broadcast the updated modules
TEST PLAN:
tested e2e
<details>
```
###############################################################################
# This script quantizes Qwen3-VL-235B-MoE with GPTQ + INT4 using DDP.
# run this with `torchrun --nproc_per_node=8
qwen3_vl_235b_moe_gptq_int4_ddp_example.py`
# or change nproc_per_node to your desired configuration
# NOTE: Currently uses data-free GPTQ as only data-free quantization is
supported for Qwen3-VL-MoE
###############################################################################
from compressed_tensors.offload import init_dist, load_offloaded_model
from transformers import AutoProcessor,
Qwen3VLMoeForConditionalGeneration
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
MODEL_ID = "Qwen/Qwen3-VL-235B-A22B-Instruct"
###### DDP MODEL LOAD CHANGE #####
init_dist()
with load_offloaded_model():
model = Qwen3VLMoeForConditionalGeneration.from_pretrained(
MODEL_ID, dtype="auto", device_map="auto_offload"
)
##################################
processor = AutoProcessor.from_pretrained(MODEL_ID)
# Recipe: GPTQ + INT4 (data-free)
# NOTE: only datafree quantization is supported for Qwen3-VL-MoE
currently
recipe = GPTQModifier(
targets="Linear",
scheme="W4A16",
ignore=[
"re:.*lm_head",
"re:visual.*",
"re:model.visual.*",
"re:.*mlp.gate$",
],
)
# Apply quantization (no dataset needed for data-free GPTQ)
oneshot(model=model, recipe=recipe)
import torch
# Save to disk in compressed-tensors format.
SAVE_DIR = (
MODEL_ID.rstrip("/").split("/")[-1]
+ "-GPTQ-W4A16-G128-DDP"
+ str(torch.distributed.get_world_size())
)
model.save_pretrained(SAVE_DIR, save_compressed=True)
processor.save_pretrained(SAVE_DIR)
```
<\details>
Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>1 parent af31a16 commit 961404d
1 file changed
+4
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
15 | 15 | | |
16 | 16 | | |
17 | 17 | | |
| 18 | + | |
| 19 | + | |
18 | 20 | | |
19 | 21 | | |
20 | 22 | | |
| |||
111 | 113 | | |
112 | 114 | | |
113 | 115 | | |
| 116 | + | |
| 117 | + | |
114 | 118 | | |
115 | 119 | | |
116 | 120 | | |
| |||
0 commit comments