Skip to content

Commit c38af3a

Browse files
committed
updated
1 parent e2792e5 commit c38af3a

File tree

3 files changed

+46
-59
lines changed

3 files changed

+46
-59
lines changed

README.md

Lines changed: 46 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -14,89 +14,76 @@
1414

1515
---
1616

17+
18+
## Contents
19+
<!-- TOC (no expansion for Quick Start Guide / Fully Reproducing Guide) -->
20+
- [Introduction](#introduction)
21+
- [Models](#models)
22+
- [Datasets](#datasets)
23+
- [Results](#evaluation-results)
24+
- [Quick Start with HuggingFace](#quick-start-with-huggingface)
25+
- [Evaluation](#evaluation)
26+
- [Quick Start For Training](#quick-start-guide)
27+
- [Fully Reproducing Guide](#fully-reproducing-guide)
28+
- [Citation](#citation)
29+
- [Acknowledgement](#acknowledgement)
30+
31+
1732
## Introduction
1833
**LLaVA-OneVision1.5** introduces a novel family of **fully open-source** Large Multimodal Models (LMMs) that achieves **state-of-the-art performance** with substantially **lower cost** through training on **native resolution** images.
1934

20-
1. **Superior Performance**
21-
A family of fully open-source large multimodal models demonstrating **superior performance** across multiple multimodal benchmarks, **outperforming Qwen2.5-VL** in most evaluation tasks.
35+
- **Superior Performance**
36+
A family of fully open-source large multimodal models demonstrating
37+
- Superior performance across multiple multimodal benchmarks
38+
- outperforming Qwen2.5-VL** in most evaluation tasks.
2239

23-
2. **High-Quality Data at Scale**
40+
- **High-Quality Data at Scale**
2441
Meticulously curated **pre-training and SFT data** with rigorous filtering and quality control, achieving **superior data efficiency** with only **64B tokens**.
25-
- Concept-balanced, highly diverse, high-quality caption data
26-
- Comprehensive instruction fine-tuning data covering a wide range of tasks
42+
- Concept-balanced, highly diverse, high-quality caption data
43+
- Comprehensive instruction fine-tuning data covering a wide range of tasks
2744

28-
3. **Ultra-Efficient Training Framework**
29-
Complete end-to-end training framework designed for maximum efficiency:
30-
- $16000 total budget for full model training on A100 GPUs ($0.6 per GPU/Hour)
31-
- 45% HFU efficiency in 8k context length
32-
- Built on **MegatronLM** with support for **MoE**, **FP8**, and **long sequence parallelization**
33-
- Optimized codebase for cost-effective scaling
45+
- **Ultra-Efficient Training Framework** Complete end-to-end training framework designed for maximum efficiency:
46+
- $16000 total budget for full model training on A100 GPUs ($0.6 per GPU/Hour)
47+
- 45% HFU efficiency in 8k context length
48+
- Built on **MegatronLM** with support for **MoE**, **FP8**, and **long sequence parallelization**
49+
- Optimized codebase for cost-effective scaling
3450

35-
- - [ ] Better data load balancing optimization
36-
- - [ ] More efficient multimodal model parallelism strategy
37-
- - [ ] FP8 training support cases/examples
3851

39-
4. **Fully Open Framework** for community access and reproducibility:
40-
- ✅ High-quality pre-training & SFT data
41-
- ✅ Complete training framework & code
42-
- ✅ Training recipes & configurations
43-
- ✅ Base & instruct model checkpoints
44-
- ✅ Comprehensive training logs & metrics
52+
- **Fully Open Framework** for community access and reproducibility:
53+
- High-quality pre-training & SFT data
54+
- Complete training framework & code
55+
- Training recipes & configurations
56+
- Comprehensive training logs & metrics
4557

4658

4759
## Models
4860

49-
| Model | #Vision Param | #Language Param | #Total Param | HF Link |
50-
|------------------------|---------------|-----------------|--------------|------------------------------------------------------------------------------|
51-
| LLaVA-OV-1.5-4B-Instruct | 0.3B | 4.4B | 4.7B | [🤗 link]() |
52-
| LLaVA-OV-1.5-8B-Instruct | 0.3B | 8.2B | 8.5B | [🤗 link](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct) |
61+
| Model | HuggingFace Link |
62+
|---------------------------|------------------|
63+
| LLaVA-OV-1.5-4B-Instruct | (coming soon) |
64+
| LLaVA-OV-1.5-8B-Instruct | [🤗](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct) |
5365

5466

5567
## Datasets
5668

5769
![Dataset Visualization](asset/dataset.jpg)
70+
<p align="center"><b></b> **(a)** The vocabulary coverage proportion in the LLaVA-OneVision-1.5-Mid-Traning dataset
71+
before and after concept balancing. **(b)** Distribution of data sources within the LLaVA-OneVision-1.5Mid-Traning dataset. **(c)** Distribution of data sources within the LLaVA-OneVision-1.5-Mid-Traningdataset.</p>
5872

59-
60-
| Description | Link |
61-
|-------------|------|
62-
| Mid-training data for LLaVA-OneVision-1.5 | [🤗 Download (Uploading!)](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M) |
63-
| SFT data for LLaVA-OneVision-1.5 | [🤗 Download (Uploading!)](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data) |
73+
| Description | Link | Status |
74+
|--------------------|--------------------------------------------------------------------------------------------------------|-------------|
75+
| OV-1.5-Mid-Training-85M | [🤗](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M) | Uploading… |
76+
| OV-1.5-Instruct | [🤗](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data) | Uploading… |
6477

6578

6679
## Evaluation Results
6780

6881

6982
All evaluations were conducted using lmms_eval.
7083

71-
| | **LLaVA-OV-1.5-8B** | **Qwen2.5 VL 7B** | **LLaVA-OV-1.5-4B** | **Qwen2.5 VL 3B** |
72-
|:----------------------------------|:---------------:|:-------------:|:---------------:|:-------------:|
73-
| MMMU (Validation) | **55.44** | 51.33 | **51.44** | 46.44 |
74-
| MMMU-Pro (Standard) | **37.40** | 36.30 | **33.24** | 31.10 |
75-
| MMMU-Pro (Vision) | 25.15 | **32.83** | **23.53** | 21.27 |
76-
| MMBench (English; Test) | **84.14** | 83.40 | **82.29** | 77.97 |
77-
| MMBench (Chinese; Test) | 81.00 | **81.61** | **76.73** | 74.55 |
78-
| MME-RealWorld (English) | **62.31** | 57.33 | **57.16** | 51.60 |
79-
| MME-RealWorld (Chinese) | **56.11** | 51.50 | 21.38 | **45.38** |
80-
| AI2D (With Mask) | **84.16** | 82.58 | **84.62** | 78.56 |
81-
| AI2D (Without Mask) | **94.11** | 93.36 | **92.84** | 90.74 |
82-
| CV-Bench | **80.82** | 79.95 | **74.00** | 71.53 |
83-
| VL-RewardBench | 45.90 | **49.65** | **45.90** | 42.06 |
84-
| V* | **78.01** | 76.96 | 66.49 | **69.63** |
85-
| PixmoCount | 62.19 | **63.33** | **59.17** | 50.85 |
86-
| CountBench | **88.19** | 86.35 | **77.80** | 72.51 |
87-
| ChartQA | **86.48** | 84.08 | **85.11** | 83.36 |
88-
| CharXiv (Direct Questions) | **74.10** | 69.80 | **70.70** | 58.20 |
89-
| DocVQA (Test) | **95.00** | 94.93 | **93.48** | 92.67 |
90-
| InfoVQA (Test) | 78.42 | **81.67** | **75.27** | 75.63 |
91-
| WeMath | **33.62** | 33.33 | **28.00** | 18.38 |
92-
| MathVista (Mini) | **69.57** | 68.60 | **67.36** | 60.23 |
93-
| MathVision | **25.56** | 22.37 | **22.76** | 21.25 |
94-
| MMStar | **67.72** | 62.54 | **64.22** | 55.86 |
95-
| SEED-Bench (Image) | 77.32 | **77.53** | **76.74** | 74.81 |
96-
| ScienceQA | **94.98** | 88.75 | **92.05** | 83.33 |
97-
| SEED-Bench 2-Plus | 69.21 | **70.93** | **68.42** | 68.64 |
98-
| OCRBench | 82.90 | **84.20** | 77.80 | **79.20** |
99-
| RealWorldQA | 68.10 | **68.50** | **64.05** | 60.00 |
84+
![](asset/performance.png)
85+
<p align="center"><b></b> Performance comparison across vision-language models on various benchmarks grouped by task
86+
type. All scores are reported as accuracy percentages unless otherwise specified.</p>
10087

10188

10289
## Quick Start with HuggingFace
@@ -360,7 +347,7 @@ Thanks so much to all of our amazing contributors!
360347
<a href="https://github.com/RobitYadda">
361348
<img src="https://avatars.githubusercontent.com/u/6811311?v=4" width="100;" alt="RobitYadda"/>
362349
<br />
363-
<sub><b>闫梓祯@智能引擎</b></sub>
350+
<sub><b>zizhenyan</b></sub>
364351
</a>
365352
</td>
366353
</tr>

asset/dataset.jpg

-45.2 KB
Loading

asset/performance.png

409 KB
Loading

0 commit comments

Comments
 (0)