-
Notifications
You must be signed in to change notification settings - Fork 23
Is there Any way to infer awq marlin model #26
Description
First thanks to this awesome work with marlin kernel, curent I didn't find a way infer awq_marlin model, need help.
quant
I quant qwen2-72B with
quant_config = { "zero_point": False, "q_group_size": 128, "w_bit": 4, "version": "Marlin" }
Found that model.layers.0.self_attn.q_proj.qzeros does not exist with diff to other version.
infer with awq demo
with vllm
vllm-project/vllm#6612
I build vllm from current main source. Got error, with debugpy I found in layer model.layers.0.self_attn.q_proj.qzeros
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/work/miniconda3/envs/vllm/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[rank0]: return _run_code(code, main_globals, None,
[rank0]: File "/home/work/miniconda3/envs/vllm/lib/python3.8/runpy.py", line 87, in _run_code
[rank0]: exec(code, run_globals)
[rank0]: File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/entrypoints/openai/api_server.py", line 317, in <module>
[rank0]: run_server(args)
[rank0]: File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/entrypoints/openai/api_server.py", line 231, in run_server
[rank0]: if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]: File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]: self.engine = self._init_engine(*args, **kwargs)
[rank0]: File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]: return engine_class(*args, **kwargs)
[rank0]: File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/engine/llm_engine.py", line 251, in __init__
[rank0]: self.model_executor = executor_class(
[rank0]: File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/executor/executor_base.py", line 47, in __init__
[rank0]: self._init_executor()
[rank0]: File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/executor/gpu_executor.py", line 36, in _init_executor
[rank0]: self.driver_worker.load_model()
[rank0]: File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/worker/worker.py", line 139, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/worker/model_runner.py", line 681, in load_model
[rank0]: self.model = get_model(model_config=self.model_config,
[rank0]: File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/model_executor/model_loader/loader.py", line 278, in load_model
[rank0]: model.load_weights(
[rank0]: File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/model_executor/models/qwen2.py", line 392, in load_weights
[rank0]: weight_loader(param, loaded_weight)
[rank0]: File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/model_executor/layers/linear.py", line 758, in weight_loader
[rank0]: loaded_weight = loaded_weight.narrow(input_dim, start_idx,
[rank0]: RuntimeError: start (0) + length (29568) exceeds dimension size (1848).
try use offical demo
https://github.com/casper-hansen/AutoAWQ/blob/main/docs/examples.md#transformers
first run I modify code, because there is no qzeros layer.
# awq/utils/fused_utils.py:155
del (layer.qweight, layer.scales)
if hasattr(layer, "qzeros"):
del layer.qzerosnext, I got this error
AssertionError: Marlin kernels are not installed. Please install AWQ compatible Marlin kernels from AutoAWQ_kernels.
I cannot import marlin_cuda In this repo no this file.
but I found in gptq:
https://github.com/AutoGPTQ/AutoGPTQ/blob/main/autogptq_extension/marlin/marlin_cuda.cpp
Anyway I want to know a way to run model.