-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Description
Feature Request: Idefics3 Architecture Support
Summary
Requesting native Unsloth support for the Idefics3 architecture, which would enable optimized fine-tuning of models like IBM Granite Docling VLM (258M params) — a high-performing document understanding model.
Why This Matters
Granite Docling VLM achieves 87.7 on DocVQA with only 258M parameters (vs. the original Idefics3-8B at 74.0). It's Apache 2.0 licensed and increasingly used for document conversion (PDFs, scans, slides → structured output). Unsloth support would make fine-tuning this model significantly faster and more memory-efficient, opening it up to consumer GPUs.
Architecture Analysis
Granite Docling VLM (and all Idefics3 models) uses Idefics3ForConditionalGeneration. Its components map closely to things Unsloth already supports:
| Component | Idefics3 / Granite Docling | Unsloth Status |
|---|---|---|
| Vision Encoder | SigLIP2-base-patch16-512 | SigLIP supported in other VLMs (LLaVA, etc.) |
| Language Model | Granite 165M (Llama 3-based) | Llama fully supported |
| Connector | Pixel Shuffle projector (4x spatial compression) | Not yet in Unsloth |
| Model Class | Idefics3ForConditionalGeneration |
Not registered |
| Config type | model_type = "idefics3" with vision_config + text_config |
Would be detected as VLM, but lacks patches |
The language model backbone is Llama-based, and the vision encoder is SigLIP — both already have Unsloth optimizations in other model families. The primary new component is the Pixel Shuffle connector that bridges vision→language.
Desired Scope
Full FastVisionModel support including:
- SFT via
SFTTrainer - DPO via
DPOTrainer - GRPO / GSPO via
GRPOTrainer - LoRA with selective layer training (
finetune_vision_layers,finetune_language_layers, etc.) - 4-bit quantization via
load_in_4bit - Unsloth gradient checkpointing (
use_gradient_checkpointing="unsloth") - Fast inference via vLLM integration (
fast_inference=True)
Ideal Usage
from unsloth import FastVisionModel
# Load Granite Docling VLM with Unsloth optimizations
model, tokenizer = FastVisionModel.from_pretrained(
model_name="ibm-granite/granite-docling-258M",
max_seq_length=2048,
load_in_4bit=True,
use_gradient_checkpointing="unsloth",
)
# Apply LoRA
model = FastVisionModel.get_peft_model(
model,
r=16,
lora_alpha=16,
target_modules="all-linear",
finetune_vision_layers=False,
finetune_language_layers=True,
)
# Train with any TRL trainer (SFT, DPO, GRPO)
trainer = SFTTrainer(model=model, tokenizer=tokenizer, ...)
trainer.train()Implementation Suggestions
Based on our analysis of the Unsloth codebase, here's what we believe is needed:
1. Registry entry — new _idefics.py:
class Idefics3VLMeta(ModelMeta):
is_multimodal = True
model_type = "idefics3"
architectures = ["Idefics3ForConditionalGeneration"]2. Architecture patches — new idefics.py in models:
- Attention optimizations for the Llama-based text model (can likely reuse existing Llama patches)
- Optional vision encoder patches (SigLIP attention)
- Pixel Shuffle connector handling
3. Support list updates:
- Add
"idefics3"toSUPPORTED_ARCHITECTURESin_utils.py - Add to
VLLM_SUPPORTED_VLMinvision.py
4. Chat template — add Idefics3 template to chat_templates.py
5. LoRA target modules:
# Language model (Llama-based)
"q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"
# Vision encoder (SigLIP)
"vision_model.encoder.layers.*.self_attn.{q,k,v,out}_proj"
"vision_model.encoder.layers.*.mlp.{fc1,fc2}"
# Connector
"image_connector.{proj_in,proj_out}", "image_connector.simple_mlp.{fc1,fc2}"Architectural Similarity to Existing Models
| Feature | Idefics3 | Similar Supported Model |
|---|---|---|
| Language backbone | Llama 3-based | Llama 3.2 Vision |
| Vision encoder | SigLIP | LLaVA |
| Attention type | Standard multi-head | LLaVA / Llama |
| Connector type | Pixel Shuffle | Unique (but simple linear projections) |
Given the overlap, we estimate this could leverage much of the existing Llama + SigLIP optimization code.
Models That Would Benefit
ibm-granite/granite-docling-258M(document understanding)HuggingFaceM4/Idefics3-8B-Llama3(general VLM)- Any future Idefics3-based models
References
- Granite Docling VLM model card
- Idefics3 paper: Building and Better Understanding VLMs
- HuggingFace Idefics3 docs
Happy to help with implementation or testing if useful!