-
Notifications
You must be signed in to change notification settings - Fork 18
Open
Description
Dear Han Guo,
Hello, I have read the paper "Pushing the Limits of Large Language Model Quantization via the Linearity Theorem", which utilizes FLUTE kernel as a backbone kernel for the efficient inference of non-uniformly quantized LLMs.
I would like to reproduce the “FLUTE” results reported in Table 1 for Llama‑3.1‑8B on an RTX 4090 (shown below for convenience):
| 2 bits | 3 bits | 4 bits | |
|---|---|---|---|
| bs = 1 | 173 | 150 | 139 |
| bs = 4 | 687 | 592 | 548 |
| bs = 16 | 2432 | 2122 | 1979 |
Could you please share the exact steps needed to reproduce these numbers? Any scripts or configuration files you can provide would be greatly appreciated.
Thank you for your time and for the excellent work.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels