Skip to content

[Feature Request] Add Idefics3 architecture support (Granite Docling VLM) #4079

@gaztrabisme

Description

@gaztrabisme

Feature Request: Idefics3 Architecture Support

Summary

Requesting native Unsloth support for the Idefics3 architecture, which would enable optimized fine-tuning of models like IBM Granite Docling VLM (258M params) — a high-performing document understanding model.

Why This Matters

Granite Docling VLM achieves 87.7 on DocVQA with only 258M parameters (vs. the original Idefics3-8B at 74.0). It's Apache 2.0 licensed and increasingly used for document conversion (PDFs, scans, slides → structured output). Unsloth support would make fine-tuning this model significantly faster and more memory-efficient, opening it up to consumer GPUs.

Architecture Analysis

Granite Docling VLM (and all Idefics3 models) uses Idefics3ForConditionalGeneration. Its components map closely to things Unsloth already supports:

Component Idefics3 / Granite Docling Unsloth Status
Vision Encoder SigLIP2-base-patch16-512 SigLIP supported in other VLMs (LLaVA, etc.)
Language Model Granite 165M (Llama 3-based) Llama fully supported
Connector Pixel Shuffle projector (4x spatial compression) Not yet in Unsloth
Model Class Idefics3ForConditionalGeneration Not registered
Config type model_type = "idefics3" with vision_config + text_config Would be detected as VLM, but lacks patches

The language model backbone is Llama-based, and the vision encoder is SigLIP — both already have Unsloth optimizations in other model families. The primary new component is the Pixel Shuffle connector that bridges vision→language.

Desired Scope

Full FastVisionModel support including:

  • SFT via SFTTrainer
  • DPO via DPOTrainer
  • GRPO / GSPO via GRPOTrainer
  • LoRA with selective layer training (finetune_vision_layers, finetune_language_layers, etc.)
  • 4-bit quantization via load_in_4bit
  • Unsloth gradient checkpointing (use_gradient_checkpointing="unsloth")
  • Fast inference via vLLM integration (fast_inference=True)

Ideal Usage

from unsloth import FastVisionModel

# Load Granite Docling VLM with Unsloth optimizations
model, tokenizer = FastVisionModel.from_pretrained(
    model_name="ibm-granite/granite-docling-258M",
    max_seq_length=2048,
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth",
)

# Apply LoRA
model = FastVisionModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    target_modules="all-linear",
    finetune_vision_layers=False,
    finetune_language_layers=True,
)

# Train with any TRL trainer (SFT, DPO, GRPO)
trainer = SFTTrainer(model=model, tokenizer=tokenizer, ...)
trainer.train()

Implementation Suggestions

Based on our analysis of the Unsloth codebase, here's what we believe is needed:

1. Registry entry — new _idefics.py:

class Idefics3VLMeta(ModelMeta):
    is_multimodal = True
    model_type = "idefics3"
    architectures = ["Idefics3ForConditionalGeneration"]

2. Architecture patches — new idefics.py in models:

  • Attention optimizations for the Llama-based text model (can likely reuse existing Llama patches)
  • Optional vision encoder patches (SigLIP attention)
  • Pixel Shuffle connector handling

3. Support list updates:

  • Add "idefics3" to SUPPORTED_ARCHITECTURES in _utils.py
  • Add to VLLM_SUPPORTED_VLM in vision.py

4. Chat template — add Idefics3 template to chat_templates.py

5. LoRA target modules:

# Language model (Llama-based)
"q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"

# Vision encoder (SigLIP)
"vision_model.encoder.layers.*.self_attn.{q,k,v,out}_proj"
"vision_model.encoder.layers.*.mlp.{fc1,fc2}"

# Connector
"image_connector.{proj_in,proj_out}", "image_connector.simple_mlp.{fc1,fc2}"

Architectural Similarity to Existing Models

Feature Idefics3 Similar Supported Model
Language backbone Llama 3-based Llama 3.2 Vision
Vision encoder SigLIP LLaVA
Attention type Standard multi-head LLaVA / Llama
Connector type Pixel Shuffle Unique (but simple linear projections)

Given the overlap, we estimate this could leverage much of the existing Llama + SigLIP optimization code.

Models That Would Benefit

  • ibm-granite/granite-docling-258M (document understanding)
  • HuggingFaceM4/Idefics3-8B-Llama3 (general VLM)
  • Any future Idefics3-based models

References

Happy to help with implementation or testing if useful!

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedHelp from the OSS community wanted!

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions