Skip to content

Conversation

@gabe-l-hart
Copy link
Collaborator

Description

This PR adds GGUF conversion support for https://huggingface.co/ibm-granite/granite-docling-258M. Once converted, the model runs to completion, however the results are quite bad (see Outstanding Questions below).

Partially addresses #16110

Testing

# Convert language model
 python convert_hf_to_gguf.py ~/models/ibm-granite/granite-docling-258M/

# Convert MMProj
python convert_hf_to_gguf.py ~/models/ibm-granite/granite-docling-258M/ --mmproj

# Run a sample
./bin/llama-mtmd-cli -m ~/models/ibm-granite/granite-docling-258M/granite-docling-258M-F16.gguf --image ~/Pictures/sample-doc.png --mmproj ~/models/ibm-granite/mmproj-granite-docling-258M -p "<__media__>Convert this page to markdown." --verbose -ngl 99

Outstanding Questions

With this PR, the model does convert and the all of the math runs correctly, but in comparing with the results from transformers (implemented here), the output is significantly worse to the point of being unusable. In digging through, it seems that it largely boils down to the implementation of clip_image_preprocess compared to Idefics3ImageProcessor.preprocess. In the transformers implementation, do_resize and do_image_splitting default to True, resulting in a grid of sub-images with appropriate image boundary tokens post-tokenization (the input to the LLM). The corresponding output from mtmd_tokenize simply pads the image to square and resizes to the configured image_size, resulting in a single image in the input token sequence.

This likely relates to some of the follow-on discussion in #13050 (starting with #13050 (comment)) since GraniteDocling is very similar to SmolVLM. I'll dig further on that issue, but my current thinking is that the best course of action will be to merge this as-is, then add a follow-on PR that supports patch-based preprocessing for idefics3.

@gabe-l-hart
Copy link
Collaborator Author

cc @ryan-mangeno since I know you were looking at this

also cc @ngxson since you're the source-of-truth for all-things mtmd

@github-actions github-actions bot added the python python script changes label Sep 19, 2025
@gabe-l-hart gabe-l-hart requested review from CISC and ngxson September 19, 2025 17:16
@CISC
Copy link
Collaborator

CISC commented Sep 19, 2025

You cannot use the refact pre-tokenizer as Granite Docling sets clean_up_tokenization_spaces to false, while refact has it set to true (default), also refact has individual_digits enabled.

So, you should use the gpt2 regex and set clean_spaces = false.

@ryan-mangeno
Copy link
Contributor

I have tried some testing on it, using the f16 model, question: I hit this assert

Assertion failed: (!isnan(sumf) && !isinf(sumf)), function ggml_vec_dot_f16, file vec.cpp, line 328.

when I set -ngl lower than 26, anything higher or equal, I get somewhat infinite responses and buffer allocation fails, to my knowledge should we be trying to get it to work for -ngl 1?

@gabe-l-hart
Copy link
Collaborator Author

Good catch @CISC. I'll make that fix

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>
@gabe-l-hart gabe-l-hart force-pushed the gabe-l-hart/GraniteDocling branch from cbc4bc2 to 64e10f5 Compare September 19, 2025 19:19
@gabe-l-hart
Copy link
Collaborator Author

Ok, I've made the prtokenizer fix using trillion which maps to gpt-2 pretokenizer here and sets clean_spaces = false here.

Unfortunately, the generation results are still just as bad.

@CISC
Copy link
Collaborator

CISC commented Sep 19, 2025

Ok, I've made the prtokenizer fix using trillion which maps to gpt-2 pretokenizer here and sets clean_spaces = false here.

It's better if you give it a new name (like granite-docling), move it from pre_computed_hashes to models, and just add it here:

tokenizer_pre == "trillion") {

@gabe-l-hart
Copy link
Collaborator Author

I've now confirmed with some pretty janky hacking that with proper preprocessing the results look much better. Here's what I did:

Extract patches from transformers preprocessing

from transformers import AutoProcessor
from transformers.image_utils import load_image
import torch
from transformers.image_transforms import to_pil_image

model_path = "/Users/ghart/models/ibm-granite/granite-docling-258M"
processor = AutoProcessor.from_pretrained(model_path)
image = load_image("sample.png")

res = processor.image_processor.preprocess([image], return_row_col_info=True, return_tensors="pt")
px = res['pixel_values']
n_imgs = px.shape[1]
for i in range(n_imgs):
    patch_img = px[0, i, :, :].reshape(px.shape[2], px.shape[3], px.shape[4])
    pil_patch_img = to_pil_image(torch.max(patch_img, torch.zeros_like(patch_img)))
    pil_patch_img.save(f"patch_{i}.png")

Manually create prompt for patches

There are 13 patches, arranged in 3 rows of 4 columns plus a single global image reshaped to 512 x 512

"<row_1_col_1><__media__><row_1_col_2><__media__><row_1_col_3><__media__><row_1_col_4><__media__>
<row_2_col_1><__media__><row_2_col_2><__media__><row_2_col_3><__media__><row_2_col_4><__media__>
<row_3_col_1><__media__><row_3_col_2><__media__><row_3_col_3><__media__><row_3_col_4><__media__>

<global-img><__media__>Convert this page to markdown.<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>"

Call llama-mtmd-cli

./bin/llama-mtmd-cli -m ~/models/ibm-granite/granite-docling-258M/granite-docling-258M-F16.gguf --mmproj ~/models/ibm-granite/mmproj-granite-docling-258M -p "<row_1_col_1><__media__><row_1_col_2><__media__><row_1_col_3><__media__><row_1_col_4><__media__>
<row_2_col_1><__media__><row_2_col_2><__media__><row_2_col_3><__media__><row_2_col_4><__media__>
<row_3_col_1><__media__><row_3_col_2><__media__><row_3_col_3><__media__><row_3_col_4><__media__>

<global-img><__media__>Convert this page to markdown." --verbose -ngl 99 --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_0.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_1.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_10.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_11.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_12.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_2.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_3.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_4.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_5.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_6.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_7.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_8.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_9.png

@gabe-l-hart
Copy link
Collaborator Author

It's better if you give it a new name (like granite-docling), move it from pre_computed_hashes to models, and just add it here:

That's totally fair. I was trying to see if I could get away without changing the c++ side to avoid needing to bump versions in downstream projects (ollama et al), but given the performance without changes to the preprocessing, I think that's a somewhat moot point.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>
Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <[email protected]>
@CISC
Copy link
Collaborator

CISC commented Sep 19, 2025

You will hopefully try to improve it in this PR?

@gabe-l-hart
Copy link
Collaborator Author

Yeah, I think it probably makes sense to do in one PR since the results currently are garbage and it might require some additional gguf KVs to indicate patching

@CISC
Copy link
Collaborator

CISC commented Sep 19, 2025

Yeah, I think it probably makes sense to do in one PR since the results currently are garbage and it might require some additional gguf KVs to indicate patching

Great, just re-request review when you're ready. :)

@gabe-l-hart gabe-l-hart marked this pull request as draft September 19, 2025 21:03
@vonjackustc
Copy link

It's better if you give it a new name (like granite-docling), move it from pre_computed_hashes to models, and just add it here:

That's totally fair. I was trying to see if I could get away without changing the c++ side to avoid needing to bump versions in downstream projects (ollama et al), but given the performance without changes to the preprocessing, I think that's a somewhat moot point.

Can you change the jinja template to support patches?

@gabe-l-hart
Copy link
Collaborator Author

Since this was based on a branch in the repo before the refactor of contributor roles, I can't actually update this branch. I've opened a new PR instead with the overhaul to the preprocessing included: #16206

@CISC would you mind killing this branch in the core repo since I can't?

@CISC CISC deleted the gabe-l-hart/GraniteDocling branch September 23, 2025 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants