Replies: 1 comment
-
An automated per-model strategy for distributing the bits would be great to have. I am not sure what is the best way to achieve it. At some point I was thinking about a tool that compares the activations per-layer and applies some optimization strategy to improve the distribution of bits (#2783). We now how the tools (e.g. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I wonder whether the strategy
use_more_bit
for different layers inspired by Llama-v1 can also be the better policy for Llama-v3:For llama3-8B, my experiment compares three different plans when token embedding is Q4_K and LM head is Q6_K:
Observations:
1.The first few layers are important for generation quality
2.The last few layers and the selection of a few layers in a jump may not be as important for generation quality as originally believed in
use_more_bit
strategyMaybe we need more different mixed-precision insights for different LLM models rather than only
use_more_bit
or some toolkits for quantization sensitivity analysisBeta Was this translation helpful? Give feedback.
All reactions