Skip to content

Experimental GGUF-2-PTE Converter #13266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions examples/converters/gguf2pte.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
'''
Example to convert .gguf files into .pte format.

1. Load our model using transformers/gguf
2. Torch export
3. Executorch lowering and export to .pte
'''
from transformers import AutoTokenizer, AutoModelForCausalLM
from executorch.exir import to_edge_transform_and_lower
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
from torch.export import export
import torch

model_id = "bartowski/SmolLM2-135M-Instruct-GGUF" # Here we would have our HF model in GGUF form we wish to convert
filename = "SmolLM2-135M-Instruct-Q8_0.gguf"

tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillondesilva what dtype are the weights after loading a GGUF model? Are they dequantized to FP32?

If so, I'm not sure this is really a converter in the sense that it doesn't preserve the quantization from GGUF.

But it is a good start, especially for getting the model structure. We just need to parse the GGUF weights and convert them to int_data/scales/zeros so we can reroute to a kernel. We did have a rudimentary converter for GGUF in torchchat that supported Q4_0 and Q6_K, but this is no longer a popular format.

We could probably start by trying to support Q4_K_M, which requires support for Q4_K and Q6_K. Here is a vibe-coded version of this for Q4_K (so no guarantee that it's correct, but it looks reasonable):

# pip install gguf numpy
import numpy as np
import gguf

# ---- helpers ----
def _fp16le_to_f32(buf_mv):
    return np.frombuffer(buf_mv, dtype="<f2", count=1).astype(np.float32)[0]

def _unpack_q4k_scale_min_codes(bytes12: memoryview):
    """Return two arrays (8,) of 6-bit integers for sub-block scales and mins."""
    b = np.frombuffer(bytes12, dtype=np.uint8)
    # Layout per llama.cpp wiki ("Tensor Encoding Schemes"):
    #  0: EEAAAAAA   1: FFBBBBBB   2: GGCCCCCC   3: HHDDDDDD
    #  4: eeaaaaaa   5: ffbbbbbb   6: ggcccccc   7: hhdddddd
    #  8: eeeeEEEE   9: ffffFFFF  10: ggggGGGG  11: hhhhHHHH
    S0_3 =  b[0:4] & 0x3F
    S4_7 = ((b[0:4] >> 6) & 0x03) | ((b[8:12] >> 4) << 2)

    M0_3 =  b[4:8] & 0x3F
    M4_7 = ((b[4:8] >> 6) & 0x03) | ((b[8:12] & 0x0F) << 2)

    S = np.concatenate([S0_3, S4_7]).astype(np.float32)  # (8,)
    M = np.concatenate([M0_3, M4_7]).astype(np.float32)  # (8,)
    return S, M

def extract_q4k(gguf_path: str, tensor_name: str):
    """
    Returns:
      q_codes  : (n_super, 256) uint8  -- 4-bit codes per superblock (values 0..15)
      scales   : (n_super, 8)  float32 -- per-subblock scale (real units)
      mins     : (n_super, 8)  float32 -- per-subblock min/offset (real units)
      d, dmin  : (n_super,)    float32 -- super-scales used to decode the 6-bit fields
    Notes:
      - Each superblock covers 256 weights = 8 sub-blocks * 32 each.
      - Reconstruct weights for sub-block j:  w = scales[i,j] * q - mins[i,j]
      - Zero-point (affine form): z = mins / scales  (can be fractional)
    """
    r = gguf.GGUFReader(gguf_path)
    t = r.tensors_map[tensor_name]
    raw = memoryview(t.data)

    # Superblock layout (Q4_K):
    # [d fp16][dmin fp16][12B packed S/M codes][128B 4-bit codes]
    stride = 2 + 2 + 12 + 128  # 144 bytes
    n_super = len(raw) // stride
    assert len(raw) % stride == 0, "Unexpected Q4_K tensor byte length"

    d     = np.empty(n_super, dtype=np.float32)
    dmin  = np.empty(n_super, dtype=np.float32)
    S_all = np.empty((n_super, 8), dtype=np.float32)
    M_all = np.empty((n_super, 8), dtype=np.float32)
    Q_all = np.empty((n_super, 256), dtype=np.uint8)

    off = 0
    for i in range(n_super):
        # two fp16 super-scales
        d[i]    = _fp16le_to_f32(raw[off:off+2]); off += 2
        dmin[i] = _fp16le_to_f32(raw[off:off+2]); off += 2

        # packed 6-bit sub-scales / sub-mins
        s12 = raw[off:off+12]; off += 12
        S6, M6 = _unpack_q4k_scale_min_codes(s12)

        # realize to real units
        S_all[i, :] = d[i]    * S6
        M_all[i, :] = dmin[i] * M6

        # 128 bytes => 256 4-bit codes
        codes_b = np.frombuffer(raw[off:off+128], dtype=np.uint8); off += 128
        q_low   = (codes_b & 0x0F).astype(np.uint8)
        q_high  = (codes_b >> 4).astype(np.uint8)
        Q_all[i, 0::2] = q_low
        Q_all[i, 1::2] = q_high

    return Q_all, S_all, M_all, d, dmin

# ---- Example usage ----
# q, s, m, d, dmin = extract_q4k("model.gguf", "model.layers.0.self_attn.q_proj.weight")
# # Dequantize one superblock 'i', sub-block j (32 weights):
# i, j = 0, 3
# w_block = s[i, j] * q[i, j*32:(j+1)*32].astype(np.float32) - m[i, j]
# # Optional affine form zero-point:
# z_block = m[i, j] / s[i, j]

Now we don't currently have any quantized kernels that will handle floating point zeros (in XNNPACK or elsewhere), but I could quickly put up a patch to support that for our lowbit kernels in a day or two.

Copy link
Contributor

@lucylq lucylq Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the example, the flow looks quite clean. Agree with @metascroy that we may need some custom weight conversion.

I was imagining we could export a PTE file without weights, and plug in gguf weights at runtime, but that also requires some more work on export/runtime before it's possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch about the weights being dequantized. I pushed a quick update and it does seem that the GGUF weights are dequantized to FP32 (also found it on the docs)

As you've mentioned, it would be great to have some sort of a conversion module we route the model through once the GGUF has been loaded by HF.

What would be the best path forward for development? Do we want an RFC/some abstractions in this PR we can use to capture this process + any additional steps (e.g. dtype conversion)?

Screenshot 2025-08-12 at 8 31 14 pm

print(f"Model weights dtype: {model.dtype}")
model.eval()

# Generate some sample input for our torch export
sample_inputs = tokenizer("Plants create energy through a process known as", return_tensors="pt",)
print(sample_inputs)
print(sample_inputs["input_ids"].shape)
print(sample_inputs["attention_mask"].shape)

sample_inputs = (sample_inputs["input_ids"], sample_inputs["attention_mask"],)

# Torch export followed by ET lowering and export
exported_program = export(model, sample_inputs)
executorch_program = to_edge_transform_and_lower(
exported_program,
partitioner = [XnnpackPartitioner()]
).to_executorch()

with open("model.pte", "wb") as file:
file.write(executorch_program.buffer)
6 changes: 6 additions & 0 deletions examples/converters/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
accelerate
gguf
setuptools
transformers
executorch
torch