Skip to content
This repository was archived by the owner on May 11, 2025. It is now read-only.

Is there Any way to infer awq marlin model #26

@DeJoker

Description

@DeJoker

First thanks to this awesome work with marlin kernel, curent I didn't find a way infer awq_marlin model, need help.

quant

I quant qwen2-72B with
quant_config = { "zero_point": False, "q_group_size": 128, "w_bit": 4, "version": "Marlin" }

Found that model.layers.0.self_attn.q_proj.qzeros does not exist with diff to other version.

infer with awq demo

with vllm

vllm-project/vllm#6612
I build vllm from current main source. Got error, with debugpy I found in layer model.layers.0.self_attn.q_proj.qzeros

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/work/miniconda3/envs/vllm/lib/python3.8/runpy.py", line 194, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/home/work/miniconda3/envs/vllm/lib/python3.8/runpy.py", line 87, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/entrypoints/openai/api_server.py", line 317, in <module>
[rank0]:     run_server(args)
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/entrypoints/openai/api_server.py", line 231, in run_server
[rank0]:     if llm_engine is not None else AsyncLLMEngine.from_engine_args(
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/engine/async_llm_engine.py", line 466, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/engine/async_llm_engine.py", line 380, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/engine/async_llm_engine.py", line 547, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/engine/llm_engine.py", line 251, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/executor/executor_base.py", line 47, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/executor/gpu_executor.py", line 36, in _init_executor
[rank0]:     self.driver_worker.load_model()
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/worker/worker.py", line 139, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/worker/model_runner.py", line 681, in load_model
[rank0]:     self.model = get_model(model_config=self.model_config,
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/model_executor/model_loader/loader.py", line 278, in load_model
[rank0]:     model.load_weights(
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/model_executor/models/qwen2.py", line 392, in load_weights
[rank0]:     weight_loader(param, loaded_weight)
[rank0]:   File "/aigc-nas02/workspace/online/llm_infer/vllm/vllm/model_executor/layers/linear.py", line 758, in weight_loader
[rank0]:     loaded_weight = loaded_weight.narrow(input_dim, start_idx,
[rank0]: RuntimeError: start (0) + length (29568) exceeds dimension size (1848).

try use offical demo

https://github.com/casper-hansen/AutoAWQ/blob/main/docs/examples.md#transformers

first run I modify code, because there is no qzeros layer.

# awq/utils/fused_utils.py:155
        del (layer.qweight, layer.scales)
        if hasattr(layer, "qzeros"):
            del layer.qzeros

next, I got this error

AssertionError: Marlin kernels are not installed. Please install AWQ compatible Marlin kernels from AutoAWQ_kernels.

I cannot import marlin_cuda In this repo no this file.
but I found in gptq:
https://github.com/AutoGPTQ/AutoGPTQ/blob/main/autogptq_extension/marlin/marlin_cuda.cpp

Anyway I want to know a way to run model.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions