This directory contains comprehensive documentation for state-of-the-art multimodal (vision-language) models that combine visual and textual understanding.
Multimodal models bridge the gap between computer vision and natural language processing, enabling AI systems to understand and reason about both visual and textual information. These models have revolutionized tasks like image captioning, visual question answering, document understanding, and embodied AI.
- LLaVA-RLHF - Large Language and Vision Assistant with RLHF alignment
- PaLM-E - Embodied multimodal language model for robotics
- HiViLT - Hierarchical Vision-Language Transformer
- LLaVA-NeXT - Advanced LLaVA with dynamic resolution support
- Qwen2-VL - Vision-language model with Multimodal RoPE
- Molmo - Fully open vision-language model from AI2
- Phi-3-Vision - Lightweight multimodal model with 128K context
- BiomedCLIP - Biomedical vision-language model
Documentation for NVIDIA's NVLM will be added once implementation is available.
The core challenge in multimodal models is aligning representations from different modalities:
- Contrastive Learning: CLIP-style approaches that learn joint embeddings
- Cross-Modal Attention: Attention mechanisms that fuse visual and text features
- Projection Layers: Mapping visual features to language model space
- Instruction Tuning: Aligning models with human preferences via RLHF/DPO
Common architectural patterns across multimodal models:
- Dual Encoder: Separate vision and language encoders (CLIP, BiomedCLIP)
- Fusion-based: Cross-modal fusion layers (PaLM-E, HiViLT)
- LLM-centric: Visual features projected into LLM space (LLaVA family, Molmo)
- Efficient Design: Lightweight models for edge deployment (Phi-3-Vision)
- Visual Question Answering (VQA)
- Image Captioning
- Visual Reasoning
- Document Understanding
- Video Understanding
- Robotics: Embodied AI and manipulation (PaLM-E)
- Biomedical: Medical image understanding (BiomedCLIP)
- Research: Open science and reproducibility (Molmo)
All models are implemented in Nexus/nexus/models/multimodal/:
- Modular design following NexusModule base class
- Support for ConfigValidatorMixin and FeatureBankMixin
- Efficient implementations with attention to memory usage
- Integration with training infrastructure
For each model, the documentation follows a consistent structure:
- Overview & Motivation - Why this model exists
- Theoretical Background - Key concepts and innovations
- Mathematical Formulation - Rigorous definitions
- Architecture - High-level design and diagrams
- Implementation Details - Code walkthrough
- Optimization Tricks - Training and inference optimizations
- Experiments & Results - Performance benchmarks
- Common Pitfalls - What to avoid
- References - Papers and resources
- Start with LLaVA-NeXT for modern vision-LLM architecture
- Read BiomedCLIP for contrastive learning fundamentals
- Explore Molmo for a fully open ecosystem
- Phi-3-Vision for efficient deployment
- Qwen2-VL for advanced position encoding
- LLaVA-RLHF for alignment techniques
- PaLM-E for embodied AI
- HiViLT for hierarchical fusion
- Qwen2-VL for novel architectural components
- Dynamic Resolution: Moving beyond fixed image sizes (LLaVA-NeXT, Qwen2-VL)
- Long Context: Supporting 128K+ tokens for document understanding (Phi-3-Vision)
- Open Science: Fully open models including data and training recipes (Molmo)
- Efficient Architectures: Smaller models with strong performance (Phi-3-Vision)
- Domain Specialization: Vertical-specific models (BiomedCLIP for medicine)
- Embodied AI: Integration with robotics and physical world (PaLM-E)
| Model | Parameters | Context Length | Key Strength | Use Case |
|---|---|---|---|---|
| LLaVA-RLHF | 7B-13B | 4K | RLHF alignment | General VQA |
| PaLM-E | 562B | 2K | Embodied AI | Robotics |
| HiViLT | Variable | 1K | Hierarchical fusion | Multi-granular tasks |
| LLaVA-NeXT | 7B-34B | 4K | Dynamic resolution | High-res images |
| Qwen2-VL | 7B-72B | 32K | M-RoPE, Any resolution | Long documents |
| Molmo | 7B-72B | 2K | Fully open | Research |
| Phi-3-Vision | 4.2B | 128K | Efficiency | Edge deployment |
| BiomedCLIP | 340M | 77 | Medical domain | Healthcare |
Multimodal models leverage specialized training components:
- EnhancedSFTLoss: Supervised fine-tuning with quality assessment
- FeatureBankMixin: Experience replay for stable training
- HallucinationReducer: Reducing visual hallucinations
- RAGModule: Retrieval-augmented generation for grounding
- Unified Architectures: Single model for images, videos, audio
- Efficient Training: Reducing computational requirements
- Better Alignment: Improving vision-language grounding
- Multilinguality: Beyond English-centric models
- 3D Understanding: Spatial reasoning and NeRF integration
- Tool Use: Integrating with external tools and APIs
When adding new multimodal models:
- Follow the established documentation template
- Include mathematical formulations
- Provide code walkthrough with references to implementation
- Document optimization tricks and common pitfalls
- Add benchmark results and comparisons
See individual model documentation for specific papers and resources. Key foundational works:
- CLIP (Radford et al., 2021)
- Flamingo (Alayrac et al., 2022)
- BLIP-2 (Li et al., 2023)
- LLaVA (Liu et al., 2023)