Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions examples/converters/gguf2pte.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
'''
Example to convert .gguf files into .pte format.

1. Load our model using transformers/gguf
2. Torch export
3. Executorch lowering and export to .pte
'''
from transformers import AutoTokenizer, AutoModelForCausalLM
from executorch.exir import to_edge_transform_and_lower
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
from torch.export import export
import torch

model_id = "bartowski/SmolLM2-135M-Instruct-GGUF" # Here we would have our HF model in GGUF form we wish to convert
filename = "SmolLM2-135M-Instruct-Q8_0.gguf"

torch_dtype = torch.float32
tokenizer = AutoTokenizer.from_pretrained(model_id, gguf_file=filename)
model = AutoModelForCausalLM.from_pretrained(model_id, gguf_file=filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillondesilva what dtype are the weights after loading a GGUF model? Are they dequantized to FP32?

If so, I'm not sure this is really a converter in the sense that it doesn't preserve the quantization from GGUF.

But it is a good start, especially for getting the model structure. We just need to parse the GGUF weights and convert them to int_data/scales/zeros so we can reroute to a kernel. We did have a rudimentary converter for GGUF in torchchat that supported Q4_0 and Q6_K, but this is no longer a popular format.

We could probably start by trying to support Q4_K_M, which requires support for Q4_K and Q6_K. Here is a vibe-coded version of this for Q4_K (so no guarantee that it's correct, but it looks reasonable):

# pip install gguf numpy
import numpy as np
import gguf

# ---- helpers ----
def _fp16le_to_f32(buf_mv):
    return np.frombuffer(buf_mv, dtype="<f2", count=1).astype(np.float32)[0]

def _unpack_q4k_scale_min_codes(bytes12: memoryview):
    """Return two arrays (8,) of 6-bit integers for sub-block scales and mins."""
    b = np.frombuffer(bytes12, dtype=np.uint8)
    # Layout per llama.cpp wiki ("Tensor Encoding Schemes"):
    #  0: EEAAAAAA   1: FFBBBBBB   2: GGCCCCCC   3: HHDDDDDD
    #  4: eeaaaaaa   5: ffbbbbbb   6: ggcccccc   7: hhdddddd
    #  8: eeeeEEEE   9: ffffFFFF  10: ggggGGGG  11: hhhhHHHH
    S0_3 =  b[0:4] & 0x3F
    S4_7 = ((b[0:4] >> 6) & 0x03) | ((b[8:12] >> 4) << 2)

    M0_3 =  b[4:8] & 0x3F
    M4_7 = ((b[4:8] >> 6) & 0x03) | ((b[8:12] & 0x0F) << 2)

    S = np.concatenate([S0_3, S4_7]).astype(np.float32)  # (8,)
    M = np.concatenate([M0_3, M4_7]).astype(np.float32)  # (8,)
    return S, M

def extract_q4k(gguf_path: str, tensor_name: str):
    """
    Returns:
      q_codes  : (n_super, 256) uint8  -- 4-bit codes per superblock (values 0..15)
      scales   : (n_super, 8)  float32 -- per-subblock scale (real units)
      mins     : (n_super, 8)  float32 -- per-subblock min/offset (real units)
      d, dmin  : (n_super,)    float32 -- super-scales used to decode the 6-bit fields
    Notes:
      - Each superblock covers 256 weights = 8 sub-blocks * 32 each.
      - Reconstruct weights for sub-block j:  w = scales[i,j] * q - mins[i,j]
      - Zero-point (affine form): z = mins / scales  (can be fractional)
    """
    r = gguf.GGUFReader(gguf_path)
    t = r.tensors_map[tensor_name]
    raw = memoryview(t.data)

    # Superblock layout (Q4_K):
    # [d fp16][dmin fp16][12B packed S/M codes][128B 4-bit codes]
    stride = 2 + 2 + 12 + 128  # 144 bytes
    n_super = len(raw) // stride
    assert len(raw) % stride == 0, "Unexpected Q4_K tensor byte length"

    d     = np.empty(n_super, dtype=np.float32)
    dmin  = np.empty(n_super, dtype=np.float32)
    S_all = np.empty((n_super, 8), dtype=np.float32)
    M_all = np.empty((n_super, 8), dtype=np.float32)
    Q_all = np.empty((n_super, 256), dtype=np.uint8)

    off = 0
    for i in range(n_super):
        # two fp16 super-scales
        d[i]    = _fp16le_to_f32(raw[off:off+2]); off += 2
        dmin[i] = _fp16le_to_f32(raw[off:off+2]); off += 2

        # packed 6-bit sub-scales / sub-mins
        s12 = raw[off:off+12]; off += 12
        S6, M6 = _unpack_q4k_scale_min_codes(s12)

        # realize to real units
        S_all[i, :] = d[i]    * S6
        M_all[i, :] = dmin[i] * M6

        # 128 bytes => 256 4-bit codes
        codes_b = np.frombuffer(raw[off:off+128], dtype=np.uint8); off += 128
        q_low   = (codes_b & 0x0F).astype(np.uint8)
        q_high  = (codes_b >> 4).astype(np.uint8)
        Q_all[i, 0::2] = q_low
        Q_all[i, 1::2] = q_high

    return Q_all, S_all, M_all, d, dmin

# ---- Example usage ----
# q, s, m, d, dmin = extract_q4k("model.gguf", "model.layers.0.self_attn.q_proj.weight")
# # Dequantize one superblock 'i', sub-block j (32 weights):
# i, j = 0, 3
# w_block = s[i, j] * q[i, j*32:(j+1)*32].astype(np.float32) - m[i, j]
# # Optional affine form zero-point:
# z_block = m[i, j] / s[i, j]

Now we don't currently have any quantized kernels that will handle floating point zeros (in XNNPACK or elsewhere), but I could quickly put up a patch to support that for our lowbit kernels in a day or two.

Copy link
Contributor

@lucylq lucylq Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the example, the flow looks quite clean. Agree with @metascroy that we may need some custom weight conversion.

I was imagining we could export a PTE file without weights, and plug in gguf weights at runtime, but that also requires some more work on export/runtime before it's possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch about the weights being dequantized. I pushed a quick update and it does seem that the GGUF weights are dequantized to FP32 (also found it on the docs)

As you've mentioned, it would be great to have some sort of a conversion module we route the model through once the GGUF has been loaded by HF.

What would be the best path forward for development? Do we want an RFC/some abstractions in this PR we can use to capture this process + any additional steps (e.g. dtype conversion)?

Screenshot 2025-08-12 at 8 31 14 pm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I mentioned some of this on Discord; commenting here for more legibility.)

I think we will want to preserve the quantized weights. IMO, ideally we would have the following flow:

  1. Have transformers ingest GGUF and produce model class with hyperparameters configured appropriately.
  2. Export configured model and delegate to ExecuTorch. Produced PTE does not contain weights; instead, it references the original GGUF as though it was a PTD.

If we convert the weights or copy them into a PTE, we will IMO be likely to get one or more of slow inference (from lack of quantization), double (or more) peak disk space requirement, and/or double (or more) peak RAM requirement.

Regarding quantization, I think it would be easiest to decouple the rest of the work that needs to be done from quantization by starting with only LLAMA_FTYPE_ALL_F32 and fast-following with LLAMA_FTYPE_MOSTLY_F16 and LLAMA_FTYPE_MOSTLY_BF16. Then, adding Q4_K_M support can be done separately; I would propose to do that with a torchao-style model rewrite that replaces the matmuls with custom ops, which we then preserve during the actual delegation & lowering process. (If I understand correctly, this is more or less how lowbit lowering works today.)

model.eval()

# Generate some sample input for our torch export
sample_inputs = tokenizer("Plants create energy through a process known as", return_tensors="pt",)
print(sample_inputs)
print(sample_inputs["input_ids"].shape)
print(sample_inputs["attention_mask"].shape)

sample_inputs = (sample_inputs["input_ids"], sample_inputs["attention_mask"],)

# Torch export followed by ET lowering and export
exported_program = export(model, sample_inputs)
executorch_program = to_edge_transform_and_lower(
exported_program,
partitioner = [XnnpackPartitioner()]
).to_executorch()

with open("model.pte", "wb") as file:
file.write(executorch_program.buffer)
6 changes: 6 additions & 0 deletions examples/converters/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
accelerate
gguf
setuptools
transformers
executorch
torch
Loading