|
86 | 86 | ## Introduction |
87 | 87 | **LLaVA-OneVision-1.5** introduces a novel family of **fully open-source** Large Multimodal Models (LMMs) that achieves **state-of-the-art performance** with substantially **lower cost** through training on **native resolution** images. |
88 | 88 |
|
89 | | -- **Superior Performance** |
90 | | -A family of fully open-source large multimodal models demonstrating |
91 | | - - Superior performance across multiple multimodal benchmarks |
92 | | - - outperforming **Qwen2.5-VL** in most evaluation tasks. |
93 | | - |
94 | | -- **High-Quality Data at Scale** |
95 | | -Meticulously curated **pre-training and SFT data** with rigorous filtering and quality control. |
96 | | - - Concept-balanced, highly diverse, high-quality caption data |
97 | | - - Comprehensive instruction fine-tuning data covering a wide range of tasks |
98 | | - |
99 | | -- **Ultra-Efficient Training Framework** Complete end-to-end training framework designed for maximum efficiency: |
100 | | - - $16000 total budget for full model training on A100 GPUs ($0.6 per GPU hour) |
101 | | - - Built on **Megatron-LM** with support for **MoE**, **FP8**, and **long sequence parallelization** |
102 | | - - Optimized codebase for cost-effective scaling |
103 | | - |
104 | | - |
105 | | -- **Fully Open Framework** for community access and reproducibility: |
106 | | - - High-quality pre-training & SFT data |
107 | | - - Complete training framework & code |
108 | | - - Training recipes & configurations |
109 | | - - Comprehensive training logs & metrics |
| 89 | +#### **Superior Performance** |
| 90 | + - The model leads on multiple multimodal benchmarks and generally surpasses Qwen2.5-VL. |
| 91 | + - Training on native-resolution images significantly improves its visual understanding. |
| 92 | + |
| 93 | +#### **High-Quality Data at Scale** |
| 94 | + - The pretraining corpus comprises large-scale, concept-balanced, diverse, and high-quality captions curated with strict filtering and quality control. |
| 95 | + - The instruction-tuning dataset is comprehensive and covers a wide range of tasks. |
| 96 | + |
| 97 | +#### **Ultra-Efficient Training Framework** |
| 98 | + - The end-to-end training cost is about $16,000 on A100 GPUs at roughly $0.60 per GPU-hour. |
| 99 | + - The system is built on Megatron-LM with support for MoE, FP8, and long-sequence parallelism, and the codebase is optimized for cost-effective scaling. |
| 100 | + |
| 101 | +#### **Fully Open Framework** |
| 102 | + - The project releases high-quality pretraining and SFT datasets along with the complete training framework, configurations, and recipes. |
| 103 | + - It also provides detailed training logs and metrics to enable reproducibility and community adoption. |
110 | 104 |
|
111 | 105 |
|
112 | 106 | ## Models |
|
0 commit comments