This directory contains comprehensive documentation for Vision Transformer architectures and their variants.
- ViT - Vision Transformer: The original pure transformer for image classification
- Swin Transformer - Hierarchical vision transformer with shifted windows
- HiViT - Hierarchical Vision Transformer with multi-scale processing
- ResNet/VGG - Classic convolutional architectures
- EfficientNet - Compound scaling and mobile-optimized CNNs
- DINOv2 - Self-supervised visual features without labels
- SigLIP - Sigmoid loss for language-image pre-training
- EVA-02 - Enhanced Vision-language pre-training
- InternVL - Intern Vision-Language model
| Model | Type | Key Feature | Best For |
|---|---|---|---|
| ViT | Pure Transformer | Patch-based attention | Large-scale pre-training |
| Swin | Hierarchical ViT | Shifted windows | Dense prediction tasks |
| HiViT | Multi-scale ViT | Hierarchical features | Multi-resolution analysis |
| ResNet | CNN | Residual connections | Baseline comparisons |
| EfficientNet | CNN | Compound scaling | Mobile/edge deployment |
| DINOv2 | Self-supervised ViT | No labels needed | Transfer learning |
| SigLIP | Vision-Language | Sigmoid loss | Image-text matching |
| EVA-02 | Vision-Language | MIM + CLIP | Multimodal tasks |
| InternVL | Vision-Language | Cross-modal fusion | VQA, captioning |
- Small datasets: DINOv2 → fine-tune
- Large datasets: ViT-B/16 or Swin-B
- Mobile/Edge: EfficientNet-B0 to B3
- Best accuracy: Swin-L or EVA-02-L
- Balanced: Swin-B or HiViT-B
- Fast inference: ResNet-50 + FPN
- General vision: DINOv2-g or EVA-02-L
- Vision-language: SigLIP-L or InternVL
- Few-shot: DINOv2 with linear probing
- Image-text retrieval: SigLIP-L
- Visual question answering: InternVL
- Zero-shot classification: EVA-02-CLIP
All models follow the Nexus framework patterns:
- Inherit from NexusModule
- Use unified configuration dictionaries
- Support feature extraction and fine-tuning
- Include weight initialization utilities
See individual model documentation for detailed references and paper citations.