Skip to content

Commit 9df9d16

Browse files
committed
updated
1 parent e2792e5 commit 9df9d16

File tree

3 files changed

+46
-31
lines changed

3 files changed

+46
-31
lines changed

README.md

Lines changed: 46 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -14,61 +14,76 @@
1414

1515
---
1616

17+
18+
## Contents
19+
<!-- TOC (no expansion for Quick Start Guide / Fully Reproducing Guide) -->
20+
- [Introduction](#introduction)
21+
- [Models](#models)
22+
- [Datasets](#datasets)
23+
- [Results](#evaluation-results)
24+
- [Quick Start with HuggingFace](#quick-start-with-huggingface)
25+
- [Evaluation](#evaluation)
26+
- [Quick Start For Training](#quick-start-guide)
27+
- [Fully Reproducing Guide](#fully-reproducing-guide)
28+
- [Citation](#citation)
29+
- [Acknowledgement](#acknowledgement)
30+
31+
1732
## Introduction
1833
**LLaVA-OneVision1.5** introduces a novel family of **fully open-source** Large Multimodal Models (LMMs) that achieves **state-of-the-art performance** with substantially **lower cost** through training on **native resolution** images.
1934

20-
1. **Superior Performance**
21-
A family of fully open-source large multimodal models demonstrating **superior performance** across multiple multimodal benchmarks, **outperforming Qwen2.5-VL** in most evaluation tasks.
35+
- **Superior Performance**
36+
A family of fully open-source large multimodal models demonstrating
37+
- Superior performance across multiple multimodal benchmarks
38+
- outperforming Qwen2.5-VL** in most evaluation tasks.
2239

23-
2. **High-Quality Data at Scale**
40+
- **High-Quality Data at Scale**
2441
Meticulously curated **pre-training and SFT data** with rigorous filtering and quality control, achieving **superior data efficiency** with only **64B tokens**.
25-
- Concept-balanced, highly diverse, high-quality caption data
26-
- Comprehensive instruction fine-tuning data covering a wide range of tasks
42+
- Concept-balanced, highly diverse, high-quality caption data
43+
- Comprehensive instruction fine-tuning data covering a wide range of tasks
2744

28-
3. **Ultra-Efficient Training Framework**
29-
Complete end-to-end training framework designed for maximum efficiency:
30-
- $16000 total budget for full model training on A100 GPUs ($0.6 per GPU/Hour)
31-
- 45% HFU efficiency in 8k context length
32-
- Built on **MegatronLM** with support for **MoE**, **FP8**, and **long sequence parallelization**
33-
- Optimized codebase for cost-effective scaling
45+
- **Ultra-Efficient Training Framework** Complete end-to-end training framework designed for maximum efficiency:
46+
- $16000 total budget for full model training on A100 GPUs ($0.6 per GPU/Hour)
47+
- 45% HFU efficiency in 8k context length
48+
- Built on **MegatronLM** with support for **MoE**, **FP8**, and **long sequence parallelization**
49+
- Optimized codebase for cost-effective scaling
3450

35-
- - [ ] Better data load balancing optimization
36-
- - [ ] More efficient multimodal model parallelism strategy
37-
- - [ ] FP8 training support cases/examples
3851

39-
4. **Fully Open Framework** for community access and reproducibility:
40-
- ✅ High-quality pre-training & SFT data
41-
- ✅ Complete training framework & code
42-
- ✅ Training recipes & configurations
43-
- ✅ Base & instruct model checkpoints
44-
- ✅ Comprehensive training logs & metrics
52+
- **Fully Open Framework** for community access and reproducibility:
53+
- High-quality pre-training & SFT data
54+
- Complete training framework & code
55+
- Training recipes & configurations
56+
- Comprehensive training logs & metrics
4557

4658

4759
## Models
4860

49-
| Model | #Vision Param | #Language Param | #Total Param | HF Link |
50-
|------------------------|---------------|-----------------|--------------|------------------------------------------------------------------------------|
51-
| LLaVA-OV-1.5-4B-Instruct | 0.3B | 4.4B | 4.7B | [🤗 link]() |
52-
| LLaVA-OV-1.5-8B-Instruct | 0.3B | 8.2B | 8.5B | [🤗 link](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct) |
61+
| Model | HuggingFace Link |
62+
|---------------------------|------------------|
63+
| LLaVA-OV-1.5-4B-Instruct | (coming soon) |
64+
| LLaVA-OV-1.5-8B-Instruct | [🤗](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct) |
5365

5466

5567
## Datasets
5668

5769
![Dataset Visualization](asset/dataset.jpg)
5870

5971

60-
| Description | Link |
61-
|-------------|------|
62-
| Mid-training data for LLaVA-OneVision-1.5 | [🤗 Download (Uploading!)](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M) |
63-
| SFT data for LLaVA-OneVision-1.5 | [🤗 Download (Uploading!)](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data) |
72+
| Description | Link | Status |
73+
|--------------------|--------------------------------------------------------------------------------------------------------|-------------|
74+
| OV-1.5-Mid-Training-85M | [🤗](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M) | Uploading… |
75+
| OV-1.5-Instruct | [🤗](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data) | Uploading… |
6476

6577

6678
## Evaluation Results
6779

6880

6981
All evaluations were conducted using lmms_eval.
7082

71-
| | **LLaVA-OV-1.5-8B** | **Qwen2.5 VL 7B** | **LLaVA-OV-1.5-4B** | **Qwen2.5 VL 3B** |
83+
![Performance comparison across vision-language models on various benchmarks grouped by task
84+
type. All scores are reported as accuracy percentages unless otherwise specified.](asset/performance.png)
85+
86+
<!-- | | **LLaVA-OV-1.5-8B** | **Qwen2.5 VL 7B** | **LLaVA-OV-1.5-4B** | **Qwen2.5 VL 3B** |
7287
|:----------------------------------|:---------------:|:-------------:|:---------------:|:-------------:|
7388
| MMMU (Validation) | **55.44** | 51.33 | **51.44** | 46.44 |
7489
| MMMU-Pro (Standard) | **37.40** | 36.30 | **33.24** | 31.10 |
@@ -96,7 +111,7 @@ All evaluations were conducted using lmms_eval.
96111
| ScienceQA | **94.98** | 88.75 | **92.05** | 83.33 |
97112
| SEED-Bench 2-Plus | 69.21 | **70.93** | **68.42** | 68.64 |
98113
| OCRBench | 82.90 | **84.20** | 77.80 | **79.20** |
99-
| RealWorldQA | 68.10 | **68.50** | **64.05** | 60.00 |
114+
| RealWorldQA | 68.10 | **68.50** | **64.05** | 60.00 | -->
100115

101116

102117
## Quick Start with HuggingFace
@@ -360,7 +375,7 @@ Thanks so much to all of our amazing contributors!
360375
<a href="https://github.com/RobitYadda">
361376
<img src="https://avatars.githubusercontent.com/u/6811311?v=4" width="100;" alt="RobitYadda"/>
362377
<br />
363-
<sub><b>闫梓祯@智能引擎</b></sub>
378+
<sub><b>zizhenyan</b></sub>
364379
</a>
365380
</td>
366381
</tr>

asset/dataset.jpg

-29.8 KB
Loading

asset/performance.png

409 KB
Loading

0 commit comments

Comments
 (0)