-
Notifications
You must be signed in to change notification settings - Fork 646
Experimental GGUF-2-PTE Converter #13266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dillondesilva
wants to merge
2
commits into
pytorch:main
Choose a base branch
from
dillondesilva:dillon-gguf2pte-experiments
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+44
−0
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
''' | ||
Example to convert .gguf files into .pte format. | ||
|
||
1. Load our model using transformers/gguf | ||
2. Torch export | ||
3. Executorch lowering and export to .pte | ||
''' | ||
from transformers import AutoTokenizer, AutoModelForCausalLM | ||
from executorch.exir import to_edge_transform_and_lower | ||
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner | ||
from torch.export import export | ||
import torch | ||
|
||
model_id = "bartowski/SmolLM2-135M-Instruct-GGUF" # Here we would have our HF model in GGUF form we wish to convert | ||
filename = "SmolLM2-135M-Instruct-Q8_0.gguf" | ||
|
||
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename) | ||
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename) | ||
print(f"Model weights dtype: {model.dtype}") | ||
model.eval() | ||
|
||
# Generate some sample input for our torch export | ||
sample_inputs = tokenizer("Plants create energy through a process known as", return_tensors="pt",) | ||
print(sample_inputs) | ||
print(sample_inputs["input_ids"].shape) | ||
print(sample_inputs["attention_mask"].shape) | ||
|
||
sample_inputs = (sample_inputs["input_ids"], sample_inputs["attention_mask"],) | ||
|
||
# Torch export followed by ET lowering and export | ||
exported_program = export(model, sample_inputs) | ||
executorch_program = to_edge_transform_and_lower( | ||
exported_program, | ||
partitioner = [XnnpackPartitioner()] | ||
).to_executorch() | ||
|
||
with open("model.pte", "wb") as file: | ||
file.write(executorch_program.buffer) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
accelerate | ||
gguf | ||
setuptools | ||
transformers | ||
executorch | ||
torch |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillondesilva what dtype are the weights after loading a GGUF model? Are they dequantized to FP32?
If so, I'm not sure this is really a converter in the sense that it doesn't preserve the quantization from GGUF.
But it is a good start, especially for getting the model structure. We just need to parse the GGUF weights and convert them to int_data/scales/zeros so we can reroute to a kernel. We did have a rudimentary converter for GGUF in torchchat that supported Q4_0 and Q6_K, but this is no longer a popular format.
We could probably start by trying to support Q4_K_M, which requires support for Q4_K and Q6_K. Here is a vibe-coded version of this for Q4_K (so no guarantee that it's correct, but it looks reasonable):
Now we don't currently have any quantized kernels that will handle floating point zeros (in XNNPACK or elsewhere), but I could quickly put up a patch to support that for our lowbit kernels in a day or two.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the example, the flow looks quite clean. Agree with @metascroy that we may need some custom weight conversion.
I was imagining we could export a PTE file without weights, and plug in gguf weights at runtime, but that also requires some more work on export/runtime before it's possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch about the weights being dequantized. I pushed a quick update and it does seem that the GGUF weights are dequantized to FP32 (also found it on the docs)
As you've mentioned, it would be great to have some sort of a conversion module we route the model through once the GGUF has been loaded by HF.
What would be the best path forward for development? Do we want an RFC/some abstractions in this PR we can use to capture this process + any additional steps (e.g. dtype conversion)?