Using model relaxml/Llama-2-7b-E8PRVQ-4Bit
On A6000, I only got ~82 toks/s which doesn't match 95 toks/s in the paper.
On 6000 Ada, I got ~109 toks/s while in the paper it's 140 toks/s.
command: python eval/eval_speed.py --hf_path relaxml/Llama-2-7b-E8PRVQ-4Bit
I also tried python interactive_gen.py --hf_path relaxml/Llama-2-7b-chat-E8PRVQ-4Bit
but the throughput is strangely slow, only at 5.77 toks/s
The 3rd party repository QuIP-for-all contains some bugs so couldn't be run.