Summary: This release adds first-class support for training Vision-Language Models (VLMs) in text-only mode. Users can now fine-tune VLM models like Qwen3.5-VL, Qwen3-VL, and Gemma3n by automatically detecting the VLM architecture and either extracting the CausalLM text backbone or loading the full VLM directly. Both SFT and OSFT training modes are supported.
Highlights
- 🔭 VLM Auto-Detection & Extraction: Automatically detects VLM models, extracts CausalLM text backbones (e.g. Mistral3 → Ministral3), or loads directly for models with no standalone CausalLM class (e.g. Qwen3-VL)
- 🧠 New
vlm_utilsModule: Dedicated utilities for VLM detection, backbone extraction, M-RoPE handling, timm vision tower detection, and SDPA fallback - ⚡ Expanded Model Support: Added Qwen3.5, Qwen3-VL, Gemma3n, and Ministral3 to supported fine-tuning targets
- 🔧 Improved Attention Handling: Automatic SDPA fallback for M-RoPE models and per-component attention config for timm vision towers
- 🐛 Mamba Kernel Compatibility: Local mamba_ssm/causal_conv1d packages are used over Hub kernels for GraniteMoeHybrid to prevent PyTorch/CUDA ABI mismatches
New Features
Vision-Language Model Support
Add support for Qwen3.5 VL model by @RobotSail in #70
- New
vlm_utils.pymodule with helpers for the full VLM lifecycle:is_vlm_with_causal_lm()— detect VLMs wrapping a CausalLM text backboneis_vlm_for_direct_loading()— detect VLMs with no standalone CausalLM classextract_causal_lm_from_vlm()— load a VLM and extract the text backbone as a standalone CausalLMload_vlm_for_text_training()— load a VLM directly for text-only forward passeshas_mrope()/needs_sdpa()/has_timm_vision_tower()— attention implementation selection
- Multi-step model class resolution in
get_model_class_from_config()with fallback totext_configandImageTextToTextmapping - FSDP2 wrapping extended to handle
model.model.language_model.layerspath for direct VLMs - Activation checkpointing skipped for direct VLMs (M-RoPE layers cause tensor count mismatches during reentrant recomputation)
- Per-component attention config: timm vision towers use
eagerwhile text models use FA2/SDPA - VLM-aware config access via
_get_text_config()forvocab_size,pad_token_id,bos_token_id,eos_token_id - OSFT path updated to detect VLMs and extract text backbones before factorization
- Added
_can_set_experts_implementationguard on OSFT model class for non-MoE base classes
Expanded Model Support
- Qwen3.5: New OSFT config with
linear_attn.out_projpattern support - Qwen3-VL: Direct VLM loading (no CausalLM class available)
- Gemma3n: Dual-registered VLM loaded as CausalLM
- Ministral3: Added to supported architectures list
- Updated
MODEL_NAME_MAPPINGSwithqwen3_5/qwen3.5entries (ordered before genericqwenfor correct substring matching)
GPT-OSS Improvements
- Flash-attn3 now checks for Hopper (SM 9.0+) GPU capability; falls back to
eageron older hardware - Mamba kernel patching broadened to catch
AttributeErrorin addition toImportError
Upgrade Notes
- No breaking API changes
- VLM support is automatic — pass a VLM model path and mini-trainer handles detection and loading
- Models using M-RoPE (e.g. Qwen3-VL) will automatically use SDPA instead of Flash Attention 2
Contributors
Installation
Through Pip:
uv pip install rhai-innovation-mini-trainer && uv pip install rhai-innovation-mini-trainer[cuda] --no-build-isolationLocally:
uv pip install . && uv pip install .[cuda] --no-build-isolationFull Changelog: v0.6.1...v0.7.0