|
14 | 14 |
|
15 | 15 | --- |
16 | 16 |
|
| 17 | + |
| 18 | +## Contents |
| 19 | +<!-- TOC (no expansion for Quick Start Guide / Fully Reproducing Guide) --> |
| 20 | +- [Introduction](#introduction) |
| 21 | +- [Models](#models) |
| 22 | +- [Datasets](#datasets) |
| 23 | +- [Results](#evaluation-results) |
| 24 | +- [Quick Start with HuggingFace](#quick-start-with-huggingface) |
| 25 | +- [Evaluation](#evaluation) |
| 26 | +- [Quick Start For Training](#quick-start-guide) |
| 27 | +- [Fully Reproducing Guide](#fully-reproducing-guide) |
| 28 | +- [Citation](#citation) |
| 29 | +- [Acknowledgement](#acknowledgement) |
| 30 | + |
| 31 | + |
17 | 32 | ## Introduction |
18 | 33 | **LLaVA-OneVision1.5** introduces a novel family of **fully open-source** Large Multimodal Models (LMMs) that achieves **state-of-the-art performance** with substantially **lower cost** through training on **native resolution** images. |
19 | 34 |
|
20 | | -1. **Superior Performance** |
21 | | -A family of fully open-source large multimodal models demonstrating **superior performance** across multiple multimodal benchmarks, **outperforming Qwen2.5-VL** in most evaluation tasks. |
| 35 | +- **Superior Performance** |
| 36 | +A family of fully open-source large multimodal models demonstrating |
| 37 | + - Superior performance across multiple multimodal benchmarks |
| 38 | + - outperforming Qwen2.5-VL** in most evaluation tasks. |
22 | 39 |
|
23 | | -2. **High-Quality Data at Scale** |
| 40 | +- **High-Quality Data at Scale** |
24 | 41 | Meticulously curated **pre-training and SFT data** with rigorous filtering and quality control, achieving **superior data efficiency** with only **64B tokens**. |
25 | | -- Concept-balanced, highly diverse, high-quality caption data |
26 | | -- Comprehensive instruction fine-tuning data covering a wide range of tasks |
| 42 | + - Concept-balanced, highly diverse, high-quality caption data |
| 43 | + - Comprehensive instruction fine-tuning data covering a wide range of tasks |
27 | 44 |
|
28 | | -3. **Ultra-Efficient Training Framework** |
29 | | -Complete end-to-end training framework designed for maximum efficiency: |
30 | | -- $16000 total budget for full model training on A100 GPUs ($0.6 per GPU/Hour) |
31 | | -- 45% HFU efficiency in 8k context length |
32 | | -- Built on **MegatronLM** with support for **MoE**, **FP8**, and **long sequence parallelization** |
33 | | -- Optimized codebase for cost-effective scaling |
| 45 | +- **Ultra-Efficient Training Framework** Complete end-to-end training framework designed for maximum efficiency: |
| 46 | + - $16000 total budget for full model training on A100 GPUs ($0.6 per GPU/Hour) |
| 47 | + - 45% HFU efficiency in 8k context length |
| 48 | + - Built on **MegatronLM** with support for **MoE**, **FP8**, and **long sequence parallelization** |
| 49 | + - Optimized codebase for cost-effective scaling |
34 | 50 |
|
35 | | -- - [ ] Better data load balancing optimization |
36 | | -- - [ ] More efficient multimodal model parallelism strategy |
37 | | -- - [ ] FP8 training support cases/examples |
38 | 51 |
|
39 | | -4. **Fully Open Framework** for community access and reproducibility: |
40 | | -- ✅ High-quality pre-training & SFT data |
41 | | -- ✅ Complete training framework & code |
42 | | -- ✅ Training recipes & configurations |
43 | | -- ✅ Base & instruct model checkpoints |
44 | | -- ✅ Comprehensive training logs & metrics |
| 52 | +- **Fully Open Framework** for community access and reproducibility: |
| 53 | + - High-quality pre-training & SFT data |
| 54 | + - Complete training framework & code |
| 55 | + - Training recipes & configurations |
| 56 | + - Comprehensive training logs & metrics |
45 | 57 |
|
46 | 58 |
|
47 | 59 | ## Models |
48 | 60 |
|
49 | | -| Model | #Vision Param | #Language Param | #Total Param | HF Link | |
50 | | -|------------------------|---------------|-----------------|--------------|------------------------------------------------------------------------------| |
51 | | -| LLaVA-OV-1.5-4B-Instruct | 0.3B | 4.4B | 4.7B | [🤗 link]() | |
52 | | -| LLaVA-OV-1.5-8B-Instruct | 0.3B | 8.2B | 8.5B | [🤗 link](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct) | |
| 61 | +| Model | HuggingFace Link | |
| 62 | +|---------------------------|------------------| |
| 63 | +| LLaVA-OV-1.5-4B-Instruct | (coming soon) | |
| 64 | +| LLaVA-OV-1.5-8B-Instruct | [🤗](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct) | |
53 | 65 |
|
54 | 66 |
|
55 | 67 | ## Datasets |
56 | 68 |
|
57 | 69 |  |
| 70 | +<p align="center"><b></b> (a) The vocabulary coverage proportion in the LLaVA-OneVision-1.5-Mid-Traning dataset |
| 71 | +before and after concept balancing. (b) Distribution of data sources within the LLaVA-OneVision-1.5Mid-Traning dataset. (c) Distribution of data sources within the LLaVA-OneVision-1.5-Mid-Traningdataset.</p> |
58 | 72 |
|
59 | | - |
60 | | -| Description | Link | |
61 | | -|-------------|------| |
62 | | -| Mid-training data for LLaVA-OneVision-1.5 | [🤗 Download (Uploading!)](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M) | |
63 | | -| SFT data for LLaVA-OneVision-1.5 | [🤗 Download (Uploading!)](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data) | |
| 73 | +| Description | Link | Status | |
| 74 | +|--------------------|--------------------------------------------------------------------------------------------------------|-------------| |
| 75 | +| OV-1.5-Mid-Training-85M | [🤗](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M) | Uploading… | |
| 76 | +| OV-1.5-Instruct | [🤗](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data) | Uploading… | |
64 | 77 |
|
65 | 78 |
|
66 | 79 | ## Evaluation Results |
67 | 80 |
|
68 | 81 |
|
69 | 82 | All evaluations were conducted using lmms_eval. |
70 | 83 |
|
71 | | -| | **LLaVA-OV-1.5-8B** | **Qwen2.5 VL 7B** | **LLaVA-OV-1.5-4B** | **Qwen2.5 VL 3B** | |
72 | | -|:----------------------------------|:---------------:|:-------------:|:---------------:|:-------------:| |
73 | | -| MMMU (Validation) | **55.44** | 51.33 | **51.44** | 46.44 | |
74 | | -| MMMU-Pro (Standard) | **37.40** | 36.30 | **33.24** | 31.10 | |
75 | | -| MMMU-Pro (Vision) | 25.15 | **32.83** | **23.53** | 21.27 | |
76 | | -| MMBench (English; Test) | **84.14** | 83.40 | **82.29** | 77.97 | |
77 | | -| MMBench (Chinese; Test) | 81.00 | **81.61** | **76.73** | 74.55 | |
78 | | -| MME-RealWorld (English) | **62.31** | 57.33 | **57.16** | 51.60 | |
79 | | -| MME-RealWorld (Chinese) | **56.11** | 51.50 | 21.38 | **45.38** | |
80 | | -| AI2D (With Mask) | **84.16** | 82.58 | **84.62** | 78.56 | |
81 | | -| AI2D (Without Mask) | **94.11** | 93.36 | **92.84** | 90.74 | |
82 | | -| CV-Bench | **80.82** | 79.95 | **74.00** | 71.53 | |
83 | | -| VL-RewardBench | 45.90 | **49.65** | **45.90** | 42.06 | |
84 | | -| V* | **78.01** | 76.96 | 66.49 | **69.63** | |
85 | | -| PixmoCount | 62.19 | **63.33** | **59.17** | 50.85 | |
86 | | -| CountBench | **88.19** | 86.35 | **77.80** | 72.51 | |
87 | | -| ChartQA | **86.48** | 84.08 | **85.11** | 83.36 | |
88 | | -| CharXiv (Direct Questions) | **74.10** | 69.80 | **70.70** | 58.20 | |
89 | | -| DocVQA (Test) | **95.00** | 94.93 | **93.48** | 92.67 | |
90 | | -| InfoVQA (Test) | 78.42 | **81.67** | **75.27** | 75.63 | |
91 | | -| WeMath | **33.62** | 33.33 | **28.00** | 18.38 | |
92 | | -| MathVista (Mini) | **69.57** | 68.60 | **67.36** | 60.23 | |
93 | | -| MathVision | **25.56** | 22.37 | **22.76** | 21.25 | |
94 | | -| MMStar | **67.72** | 62.54 | **64.22** | 55.86 | |
95 | | -| SEED-Bench (Image) | 77.32 | **77.53** | **76.74** | 74.81 | |
96 | | -| ScienceQA | **94.98** | 88.75 | **92.05** | 83.33 | |
97 | | -| SEED-Bench 2-Plus | 69.21 | **70.93** | **68.42** | 68.64 | |
98 | | -| OCRBench | 82.90 | **84.20** | 77.80 | **79.20** | |
99 | | -| RealWorldQA | 68.10 | **68.50** | **64.05** | 60.00 | |
| 84 | + |
| 85 | +<p align="center"><b></b> Performance comparison across vision-language models on various benchmarks grouped by task |
| 86 | +type. All scores are reported as accuracy percentages unless otherwise specified.</p> |
100 | 87 |
|
101 | 88 |
|
102 | 89 | ## Quick Start with HuggingFace |
@@ -360,7 +347,7 @@ Thanks so much to all of our amazing contributors! |
360 | 347 | <a href="https://github.com/RobitYadda"> |
361 | 348 | <img src="https://avatars.githubusercontent.com/u/6811311?v=4" width="100;" alt="RobitYadda"/> |
362 | 349 | <br /> |
363 | | - <sub><b>闫梓祯@智能引擎</b></sub> |
| 350 | + <sub><b>zizhenyan</b></sub> |
364 | 351 | </a> |
365 | 352 | </td> |
366 | 353 | </tr> |
|
0 commit comments