Skip to content

v0.7.0: Vision-Language Model (VLM) Support

Latest

Choose a tag to compare

@RobotSail RobotSail released this 06 Mar 21:45
· 1 commit to main since this release
3c1e9d7

Summary: This release adds first-class support for training Vision-Language Models (VLMs) in text-only mode. Users can now fine-tune VLM models like Qwen3.5-VL, Qwen3-VL, and Gemma3n by automatically detecting the VLM architecture and either extracting the CausalLM text backbone or loading the full VLM directly. Both SFT and OSFT training modes are supported.

Highlights

  • 🔭 VLM Auto-Detection & Extraction: Automatically detects VLM models, extracts CausalLM text backbones (e.g. Mistral3 → Ministral3), or loads directly for models with no standalone CausalLM class (e.g. Qwen3-VL)
  • 🧠 New vlm_utils Module: Dedicated utilities for VLM detection, backbone extraction, M-RoPE handling, timm vision tower detection, and SDPA fallback
  • Expanded Model Support: Added Qwen3.5, Qwen3-VL, Gemma3n, and Ministral3 to supported fine-tuning targets
  • 🔧 Improved Attention Handling: Automatic SDPA fallback for M-RoPE models and per-component attention config for timm vision towers
  • 🐛 Mamba Kernel Compatibility: Local mamba_ssm/causal_conv1d packages are used over Hub kernels for GraniteMoeHybrid to prevent PyTorch/CUDA ABI mismatches

New Features

Vision-Language Model Support

Add support for Qwen3.5 VL model by @RobotSail in #70

  • New vlm_utils.py module with helpers for the full VLM lifecycle:
    • is_vlm_with_causal_lm() — detect VLMs wrapping a CausalLM text backbone
    • is_vlm_for_direct_loading() — detect VLMs with no standalone CausalLM class
    • extract_causal_lm_from_vlm() — load a VLM and extract the text backbone as a standalone CausalLM
    • load_vlm_for_text_training() — load a VLM directly for text-only forward passes
    • has_mrope() / needs_sdpa() / has_timm_vision_tower() — attention implementation selection
  • Multi-step model class resolution in get_model_class_from_config() with fallback to text_config and ImageTextToText mapping
  • FSDP2 wrapping extended to handle model.model.language_model.layers path for direct VLMs
  • Activation checkpointing skipped for direct VLMs (M-RoPE layers cause tensor count mismatches during reentrant recomputation)
  • Per-component attention config: timm vision towers use eager while text models use FA2/SDPA
  • VLM-aware config access via _get_text_config() for vocab_size, pad_token_id, bos_token_id, eos_token_id
  • OSFT path updated to detect VLMs and extract text backbones before factorization
  • Added _can_set_experts_implementation guard on OSFT model class for non-MoE base classes

Expanded Model Support

  • Qwen3.5: New OSFT config with linear_attn.out_proj pattern support
  • Qwen3-VL: Direct VLM loading (no CausalLM class available)
  • Gemma3n: Dual-registered VLM loaded as CausalLM
  • Ministral3: Added to supported architectures list
  • Updated MODEL_NAME_MAPPINGS with qwen3_5/qwen3.5 entries (ordered before generic qwen for correct substring matching)

GPT-OSS Improvements

  • Flash-attn3 now checks for Hopper (SM 9.0+) GPU capability; falls back to eager on older hardware
  • Mamba kernel patching broadened to catch AttributeError in addition to ImportError

Upgrade Notes

  • No breaking API changes
  • VLM support is automatic — pass a VLM model path and mini-trainer handles detection and loading
  • Models using M-RoPE (e.g. Qwen3-VL) will automatically use SDPA instead of Flash Attention 2

Contributors

Installation

Through Pip:

uv pip install rhai-innovation-mini-trainer && uv pip install rhai-innovation-mini-trainer[cuda] --no-build-isolation

Locally:

uv pip install . && uv pip install .[cuda] --no-build-isolation

Full Changelog: v0.6.1...v0.7.0