[Feature Request] Add Idefics3 architecture support (Granite Docling VLM)

## Feature Request: Idefics3 Architecture Support

### Summary

Requesting native Unsloth support for the **Idefics3** architecture, which would enable optimized fine-tuning of models like [IBM Granite Docling VLM](https://huggingface.co/ibm-granite/granite-docling-258M) (258M params) — a high-performing document understanding model.

### Why This Matters

Granite Docling VLM achieves **87.7 on DocVQA** with only 258M parameters (vs. the original Idefics3-8B at 74.0). It's Apache 2.0 licensed and increasingly used for document conversion (PDFs, scans, slides → structured output). Unsloth support would make fine-tuning this model significantly faster and more memory-efficient, opening it up to consumer GPUs.

### Architecture Analysis

Granite Docling VLM (and all Idefics3 models) uses `Idefics3ForConditionalGeneration`. Its components map closely to things Unsloth already supports:

| Component | Idefics3 / Granite Docling | Unsloth Status |
|-----------|---------------------------|----------------|
| **Vision Encoder** | SigLIP2-base-patch16-512 | SigLIP supported in other VLMs (LLaVA, etc.) |
| **Language Model** | Granite 165M (Llama 3-based) | Llama fully supported |
| **Connector** | Pixel Shuffle projector (4x spatial compression) | Not yet in Unsloth |
| **Model Class** | `Idefics3ForConditionalGeneration` | Not registered |
| **Config type** | `model_type = "idefics3"` with `vision_config` + `text_config` | Would be detected as VLM, but lacks patches |

The language model backbone is Llama-based, and the vision encoder is SigLIP — both already have Unsloth optimizations in other model families. The primary new component is the **Pixel Shuffle connector** that bridges vision→language.

### Desired Scope

Full `FastVisionModel` support including:

- **SFT** via `SFTTrainer`
- **DPO** via `DPOTrainer`
- **GRPO / GSPO** via `GRPOTrainer`
- **LoRA** with selective layer training (`finetune_vision_layers`, `finetune_language_layers`, etc.)
- **4-bit quantization** via `load_in_4bit`
- **Unsloth gradient checkpointing** (`use_gradient_checkpointing="unsloth"`)
- **Fast inference** via vLLM integration (`fast_inference=True`)

### Ideal Usage

```python
from unsloth import FastVisionModel

# Load Granite Docling VLM with Unsloth optimizations
model, tokenizer = FastVisionModel.from_pretrained(
    model_name="ibm-granite/granite-docling-258M",
    max_seq_length=2048,
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth",
)

# Apply LoRA
model = FastVisionModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    target_modules="all-linear",
    finetune_vision_layers=False,
    finetune_language_layers=True,
)

# Train with any TRL trainer (SFT, DPO, GRPO)
trainer = SFTTrainer(model=model, tokenizer=tokenizer, ...)
trainer.train()
```

### Implementation Suggestions

Based on our analysis of the Unsloth codebase, here's what we believe is needed:

**1. Registry entry** — new `_idefics.py`:
```python
class Idefics3VLMeta(ModelMeta):
    is_multimodal = True
    model_type = "idefics3"
    architectures = ["Idefics3ForConditionalGeneration"]
```

**2. Architecture patches** — new `idefics.py` in models:
- Attention optimizations for the Llama-based text model (can likely reuse existing Llama patches)
- Optional vision encoder patches (SigLIP attention)
- Pixel Shuffle connector handling

**3. Support list updates**:
- Add `"idefics3"` to `SUPPORTED_ARCHITECTURES` in `_utils.py`
- Add to `VLLM_SUPPORTED_VLM` in `vision.py`

**4. Chat template** — add Idefics3 template to `chat_templates.py`

**5. LoRA target modules**:
```python
# Language model (Llama-based)
"q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"

# Vision encoder (SigLIP)
"vision_model.encoder.layers.*.self_attn.{q,k,v,out}_proj"
"vision_model.encoder.layers.*.mlp.{fc1,fc2}"

# Connector
"image_connector.{proj_in,proj_out}", "image_connector.simple_mlp.{fc1,fc2}"
```

### Architectural Similarity to Existing Models

| Feature | Idefics3 | Similar Supported Model |
|---------|----------|------------------------|
| Language backbone | Llama 3-based | Llama 3.2 Vision |
| Vision encoder | SigLIP | LLaVA |
| Attention type | Standard multi-head | LLaVA / Llama |
| Connector type | Pixel Shuffle | Unique (but simple linear projections) |

Given the overlap, we estimate this could leverage much of the existing Llama + SigLIP optimization code.

### Models That Would Benefit

- `ibm-granite/granite-docling-258M` (document understanding)
- `HuggingFaceM4/Idefics3-8B-Llama3` (general VLM)
- Any future Idefics3-based models

### References

- [Granite Docling VLM model card](https://huggingface.co/ibm-granite/granite-docling-258M)
- [Idefics3 paper: Building and Better Understanding VLMs](https://arxiv.org/abs/2408.12637)
- [HuggingFace Idefics3 docs](https://huggingface.co/docs/transformers/model_doc/idefics3)

Happy to help with implementation or testing if useful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Add Idefics3 architecture support (Granite Docling VLM) #4079

Feature Request: Idefics3 Architecture Support

Summary

Why This Matters

Architecture Analysis

Desired Scope

Ideal Usage

Implementation Suggestions

Architectural Similarity to Existing Models

Models That Would Benefit

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Idefics3 / Granite Docling	Unsloth Status
Vision Encoder	SigLIP2-base-patch16-512	SigLIP supported in other VLMs (LLaVA, etc.)
Language Model	Granite 165M (Llama 3-based)	Llama fully supported
Connector	Pixel Shuffle projector (4x spatial compression)	Not yet in Unsloth
Model Class	`Idefics3ForConditionalGeneration`	Not registered
Config type	`model_type = "idefics3"` with `vision_config` + `text_config`	Would be detected as VLM, but lacks patches

Feature	Idefics3	Similar Supported Model
Language backbone	Llama 3-based	Llama 3.2 Vision
Vision encoder	SigLIP	LLaVA
Attention type	Standard multi-head	LLaVA / Llama
Connector type	Pixel Shuffle	Unique (but simple linear projections)

Uh oh!

[Feature Request] Add Idefics3 architecture support (Granite Docling VLM) #4079

Description

Feature Request: Idefics3 Architecture Support

Summary

Why This Matters

Architecture Analysis

Desired Scope

Ideal Usage

Implementation Suggestions

Architectural Similarity to Existing Models

Models That Would Benefit

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions