Release v0.7.0: Vision-Language Model (VLM) Support · Red-Hat-AI-Innovation-Team/mini_trainer

Summary: This release adds first-class support for training Vision-Language Models (VLMs) in text-only mode. Users can now fine-tune VLM models like Qwen3.5-VL, Qwen3-VL, and Gemma3n by automatically detecting the VLM architecture and either extracting the CausalLM text backbone or loading the full VLM directly. Both SFT and OSFT training modes are supported.

Highlights

🔭 VLM Auto-Detection & Extraction: Automatically detects VLM models, extracts CausalLM text backbones (e.g. Mistral3 → Ministral3), or loads directly for models with no standalone CausalLM class (e.g. Qwen3-VL)
🧠 New vlm_utils Module: Dedicated utilities for VLM detection, backbone extraction, M-RoPE handling, timm vision tower detection, and SDPA fallback
⚡ Expanded Model Support: Added Qwen3.5, Qwen3-VL, Gemma3n, and Ministral3 to supported fine-tuning targets
🔧 Improved Attention Handling: Automatic SDPA fallback for M-RoPE models and per-component attention config for timm vision towers
🐛 Mamba Kernel Compatibility: Local mamba_ssm/causal_conv1d packages are used over Hub kernels for GraniteMoeHybrid to prevent PyTorch/CUDA ABI mismatches

New Features

Vision-Language Model Support

Add support for Qwen3.5 VL model by @RobotSail in #70

New vlm_utils.py module with helpers for the full VLM lifecycle:
- is_vlm_with_causal_lm() — detect VLMs wrapping a CausalLM text backbone
- is_vlm_for_direct_loading() — detect VLMs with no standalone CausalLM class
- extract_causal_lm_from_vlm() — load a VLM and extract the text backbone as a standalone CausalLM
- load_vlm_for_text_training() — load a VLM directly for text-only forward passes
- has_mrope() / needs_sdpa() / has_timm_vision_tower() — attention implementation selection
Multi-step model class resolution in get_model_class_from_config() with fallback to text_config and ImageTextToText mapping
FSDP2 wrapping extended to handle model.model.language_model.layers path for direct VLMs
Activation checkpointing skipped for direct VLMs (M-RoPE layers cause tensor count mismatches during reentrant recomputation)
Per-component attention config: timm vision towers use eager while text models use FA2/SDPA
VLM-aware config access via _get_text_config() for vocab_size, pad_token_id, bos_token_id, eos_token_id
OSFT path updated to detect VLMs and extract text backbones before factorization
Added _can_set_experts_implementation guard on OSFT model class for non-MoE base classes

Expanded Model Support

Qwen3.5: New OSFT config with linear_attn.out_proj pattern support
Qwen3-VL: Direct VLM loading (no CausalLM class available)
Gemma3n: Dual-registered VLM loaded as CausalLM
Ministral3: Added to supported architectures list
Updated MODEL_NAME_MAPPINGS with qwen3_5/qwen3.5 entries (ordered before generic qwen for correct substring matching)

GPT-OSS Improvements

Flash-attn3 now checks for Hopper (SM 9.0+) GPU capability; falls back to eager on older hardware
Mamba kernel patching broadened to catch AttributeError in addition to ImportError

Upgrade Notes

No breaking API changes
VLM support is automatic — pass a VLM model path and mini-trainer handles detection and loading
Models using M-RoPE (e.g. Qwen3-VL) will automatically use SDPA instead of Flash Attention 2

Contributors

@RobotSail

Installation

Through Pip:

uv pip install rhai-innovation-mini-trainer && uv pip install rhai-innovation-mini-trainer[cuda] --no-build-isolation

Locally:

uv pip install . && uv pip install .[cuda] --no-build-isolation

Full Changelog: v0.6.1...v0.7.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.7.0: Vision-Language Model (VLM) Support

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

New Features

Vision-Language Model Support

Expanded Model Support

GPT-OSS Improvements

Upgrade Notes

Contributors

Installation

Contributors

Uh oh!