-
Notifications
You must be signed in to change notification settings - Fork 453
Description
Has anyone performed inference on a GLM-5 model quantized with AWQ+INT4 using VLLM?
代码如下:
vllm serve /path_to/glm5_bf16-W4A16-SYM-AWQ-cuda-compressed-tensors \ --tensor-parallel-size 8 \ --enable-expert-parallel \ --host 172.16.20.29 \ --port 8009 \ --no-enable-prefix-caching \ --max_num_seqs 32
遇到了如下问题:
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] WorkerProc failed to start.
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] Traceback (most recent call last):
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] File "/mnt/nvme/liaotj/zhanghn/vllm/vllm/v1/executor/multiproc_executor.py", line 821, in worker_main
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] worker = WorkerProc(*args, **kwargs)
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] File "/mnt/nvme/liaotj/zhanghn/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] return func(*args, **kwargs)
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] File "/mnt/nvme/liaotj/zhanghn/vllm/vllm/v1/executor/multiproc_executor.py", line 619, in init
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] self.worker.load_model()
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] File "/mnt/nvme/liaotj/zhanghn/vllm/vllm/v1/worker/gpu_worker.py", line 335, in load_model
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] self.model_runner.load_model(load_dummy_weights=dummy_weights)
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] File "/mnt/nvme/liaotj/zhanghn/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] return func(*args, **kwargs)
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] File "/mnt/nvme/liaotj/zhanghn/vllm/vllm/v1/worker/gpu_model_runner.py", line 4508, in load_model
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] self.model = model_loader.load_model(
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] File "/mnt/nvme/liaotj/zhanghn/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] return func(*args, **kwargs)
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] File "/mnt/nvme/liaotj/zhanghn/vllm/vllm/model_executor/model_loader/base_loader.py", line 62, in load_model
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] self.load_weights(model, model_config)
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] File "/mnt/nvme/liaotj/zhanghn/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] return func(*args, **kwargs)
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] File "/mnt/nvme/liaotj/zhanghn/vllm/vllm/model_executor/model_loader/default_loader.py", line 311, in load_weights
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] loaded_weights = model.load_weights(self.get_all_weights(model_config, model))
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] File "/mnt/nvme/liaotj/zhanghn/vllm/vllm/model_executor/models/deepseek_v2.py", line 1624, in load_weights
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] param = params_dict[name]
(Worker_TP1_EP1 pid=1874769) ERROR 03-24 09:46:59 [multiproc_executor.py:852] KeyError: 'model.layers.0.self_attn.indexer.weights_proj.weight_packed'
[rank0]:[W324 09:46:59.786534733 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())