Instructions for Reproducing Table 1 of the HIGGS Paper

Dear Han Guo,

Hello, I have read the paper "Pushing the Limits of Large Language Model Quantization via the Linearity Theorem", which utilizes FLUTE kernel as a backbone kernel for the efficient inference of non-uniformly quantized LLMs. 

I would like to reproduce the “FLUTE” results reported in Table 1 for Llama‑3.1‑8B on an RTX 4090 (shown below for convenience):

|               | 2 bits | 3 bits | 4 bits |
|--------------:|-------:|-------:|-------:|
| **bs = 1**    |   173  |   150  |   139  |
| **bs = 4**    |   687  |   592  |   548  |
| **bs = 16**   |  2432  |  2122  |  1979  |

Could you please share the exact steps needed to reproduce these numbers? Any scripts or configuration files you can provide would be greatly appreciated.

Thank you for your time and for the excellent work.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instructions for Reproducing Table 1 of the HIGGS Paper #31

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	2 bits	3 bits	4 bits
bs = 1	173	150	139
bs = 4	687	592	548
bs = 16	2432	2122	1979

Instructions for Reproducing Table 1 of the HIGGS Paper #31

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions