Skip to content

Instructions for Reproducing Table 1 of the HIGGS Paper #31

@badeok0716

Description

@badeok0716

Dear Han Guo,

Hello, I have read the paper "Pushing the Limits of Large Language Model Quantization via the Linearity Theorem", which utilizes FLUTE kernel as a backbone kernel for the efficient inference of non-uniformly quantized LLMs.

I would like to reproduce the “FLUTE” results reported in Table 1 for Llama‑3.1‑8B on an RTX 4090 (shown below for convenience):

2 bits 3 bits 4 bits
bs = 1 173 150 139
bs = 4 687 592 548
bs = 16 2432 2122 1979

Could you please share the exact steps needed to reproduce these numbers? Any scripts or configuration files you can provide would be greatly appreciated.

Thank you for your time and for the excellent work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions