-
Couldn't load subscription status.
- Fork 13.4k
convert-hf : support direct Q8_0 conversion #7234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I didn't notice these on my first pass.
|
Hi, this PR breaks model conversion on my system. I was using However, I am hoping that it is possible to allow it to work with numpy 1.22 as it did before this commit as a fallback? A lot of toolchains might still be on slightly older versions of numpy, forcing the use of the latest newest version may not be ideal. |
| if tensor_dtype == np.uint8: | ||
| block_size, type_size = GGML_QUANT_SIZES[raw_dtype] | ||
| if tensor_shape[-1] % type_size != 0: | ||
| raise ValueError(f"Quantized tensor row size ({tensor_shape[-1]}) is not a multiple of {dtype.name} type size ({type_size})") | ||
| tensor_shape = tuple(tensor_shape[:-1]) + (tensor_shape[-1] // type_size * block_size,) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has broken copying of tensors on i-quants (and probably several others as well), using
./gguf-new-metadata.py foo.IQ4_NL.gguf bar.ggufyou now get
ValueError: Quantized tensor row size (4096) is not a multiple of IQ4_NL type size (18)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue seems to be that the type_size is off by 2, however I don't see why the tensor should be reshaped in this scenario, so this should probably be re-evaluated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding this!
I think it also breaks copying of all other quantized tensors in gguf-new-metadata.
Sorry about that.
I think I found a way to fix this while also simplifying what happens to the shape in the round-trip between GGUFReader and GGUFWriter. See #7483
This adds
Q8_0conversion toconvert-hf-to-gguf.py, and it results in EXACTLY the same files as if converted with./quantizefrom anf32model.Note that this was NOT the case for
convert.py, because it rounds to nearest even and divides by the scale, while the reference implementation inggml-quants.crounds away from zero and multiplies by the inverse of the scale.Summary of changes
self.gguf_writer.add_file_type(self.ftype)forStableLMModel,InternLM2Model,PlamoModel,QwenModel,BaichuanModel, andXverseModel.gguf-py/gguf/quants.pyto put theQ8_0implementation there and also movebf16conversion in there too.bf16conversion, from40-60 MB/son my machine to104 MB/sGGUFWritersupport arbitrary quants withnp.uint8dtyperaw_dtypeTODO:
Model.extra_f16_tensorstoModel.extra_quantized_tensorsQ8_0inconvert.pyto round in the same way as the reference implementation?Testing
To be sure this Python implementation of
Q8_0really is working in the exact same way as the reference implementation fromggml-quants.c, I'm testing conversion and quantization of a bunch of different model architectures.I recently got a big external hard drive, which makes storing the output of these tests much easier.
I'm doing pretty much this for each for every model architecture tested below
(using
outfile{ftype}templating introduced in #7158) :I'd say there is some suspense when the checksums begin to appear. Will they match?
torch.float32) https://huggingface.co/BAAI/bge-small-en-v1.5torch.bfloat16) https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0torch.float32) https://huggingface.co/jtatman/TinyMistral-248m-v2.5-4x-Moetorch.bfloat16) https://huggingface.co/smallcloudai/Refact-1_6B-fimtorch.float16) https://huggingface.co/pansophic/rocket-3Bf16, but I'm quantizing fromf32anyway for consistency with the other tests.AhA! Looking at the diff of the
gguf-dumpof each model shows this is a metadata problem! Theftypewas missing!ftypein 2b1e5ea: CHECKSUMS MATCH!!!torch.float16)torch.float16models? Bloom is infloat16, yet it still worked for it.AHA, the problem is that the
ftypeis not put in the model!!!ftypein 2b1e5ea: CHECKSUMS MATCH!!!!torch.float32) https://huggingface.co/jondurbin/bagel-dpo-2.8b-v0.2torch.float16) https://huggingface.co/bigscience/bloom-560mtorch.bfloat16) https://huggingface.co/Qwen/Qwen-1_8B-ChatStableLMModel, theftypeis missing.ftypein 2b1e5ea: checksums match!torch.bfloat16) https://huggingface.co/internlm/internlm2-chat-1_8bftypewas missing, but I noticed it before first converting. After adding theftypein 2b1e5ea: checksums match!