-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Model: GraniteDocling #16112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model: GraniteDocling #16112
Conversation
|
cc @ryan-mangeno since I know you were looking at this also cc @ngxson since you're the source-of-truth for all-things |
|
You cannot use the So, you should use the |
|
I have tried some testing on it, using the f16 model, question: I hit this assert when I set -ngl lower than 26, anything higher or equal, I get somewhat infinite responses and buffer allocation fails, to my knowledge should we be trying to get it to work for -ngl 1? |
|
Good catch @CISC. I'll make that fix |
Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]>
cbc4bc2 to
64e10f5
Compare
It's better if you give it a new name (like Line 1961 in 6d75883
|
|
I've now confirmed with some pretty janky hacking that with proper preprocessing the results look much better. Here's what I did: Extract patches from from transformers import AutoProcessor
from transformers.image_utils import load_image
import torch
from transformers.image_transforms import to_pil_image
model_path = "/Users/ghart/models/ibm-granite/granite-docling-258M"
processor = AutoProcessor.from_pretrained(model_path)
image = load_image("sample.png")
res = processor.image_processor.preprocess([image], return_row_col_info=True, return_tensors="pt")
px = res['pixel_values']
n_imgs = px.shape[1]
for i in range(n_imgs):
patch_img = px[0, i, :, :].reshape(px.shape[2], px.shape[3], px.shape[4])
pil_patch_img = to_pil_image(torch.max(patch_img, torch.zeros_like(patch_img)))
pil_patch_img.save(f"patch_{i}.png")Manually create prompt for patches There are 13 patches, arranged in 3 rows of 4 columns plus a single global image reshaped to 512 x 512 Call ./bin/llama-mtmd-cli -m ~/models/ibm-granite/granite-docling-258M/granite-docling-258M-F16.gguf --mmproj ~/models/ibm-granite/mmproj-granite-docling-258M -p "<row_1_col_1><__media__><row_1_col_2><__media__><row_1_col_3><__media__><row_1_col_4><__media__>
<row_2_col_1><__media__><row_2_col_2><__media__><row_2_col_3><__media__><row_2_col_4><__media__>
<row_3_col_1><__media__><row_3_col_2><__media__><row_3_col_3><__media__><row_3_col_4><__media__>
<global-img><__media__>Convert this page to markdown." --verbose -ngl 99 --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_0.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_1.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_10.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_11.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_12.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_2.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_3.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_4.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_5.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_6.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_7.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_8.png --image /Users/ghart/models/ibm-granite/granite-docling-258M/patch_9.png |
That's totally fair. I was trying to see if I could get away without changing the |
Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]>
Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <[email protected]>
|
You will hopefully try to improve it in this PR? |
|
Yeah, I think it probably makes sense to do in one PR since the results currently are garbage and it might require some additional gguf KVs to indicate patching |
Great, just re-request review when you're ready. :) |
Can you change the jinja template to support patches? |
Description
This PR adds GGUF conversion support for https://huggingface.co/ibm-granite/granite-docling-258M. Once converted, the model runs to completion, however the results are quite bad (see Outstanding Questions below).
Partially addresses #16110
Testing
Outstanding Questions
With this PR, the model does convert and the all of the math runs correctly, but in comparing with the results from
transformers(implemented here), the output is significantly worse to the point of being unusable. In digging through, it seems that it largely boils down to the implementation of clip_image_preprocess compared to Idefics3ImageProcessor.preprocess. In thetransformersimplementation,do_resizeanddo_image_splittingdefault toTrue, resulting in a grid of sub-images with appropriate image boundary tokens post-tokenization (the input to the LLM). The corresponding output frommtmd_tokenizesimply pads the image to square and resizes to the configuredimage_size, resulting in a single image in the input token sequence.This likely relates to some of the follow-on discussion in #13050 (starting with #13050 (comment)) since
GraniteDoclingis very similar toSmolVLM. I'll dig further on that issue, but my current thinking is that the best course of action will be to merge this as-is, then add a follow-on PR that supports patch-based preprocessing foridefics3.