Skip to content

Has anyone successfully reproduced the throughput result? #82

@HsChen-sys

Description

@HsChen-sys

Using model relaxml/Llama-2-7b-E8PRVQ-4Bit
On A6000, I only got ~82 toks/s which doesn't match 95 toks/s in the paper.
On 6000 Ada, I got ~109 toks/s while in the paper it's 140 toks/s.

command: python eval/eval_speed.py --hf_path relaxml/Llama-2-7b-E8PRVQ-4Bit

I also tried python interactive_gen.py --hf_path relaxml/Llama-2-7b-chat-E8PRVQ-4Bit but the throughput is strangely slow, only at 5.77 toks/s

The 3rd party repository QuIP-for-all contains some bugs so couldn't be run.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions