EvolvingLMMs-Lab
diff --git a/‎README.md‎
Lines changed: 46 additions & 31 deletions b/‎README.md‎
Lines changed: 46 additions & 31 deletions
diff --git a/‎asset/dataset.jpg‎
-29.8 KB b/‎asset/dataset.jpg‎
-29.8 KB
diff --git a/‎asset/performance.png‎
409 KB b/‎asset/performance.png‎
409 KB
@@ -14,61 +14,76 @@
 
 ---
 
+
+## Contents
+<!-- TOC (no expansion for Quick Start Guide / Fully Reproducing Guide) -->
+- [Introduction](#introduction)
+- [Models](#models)
+- [Datasets](#datasets)
+- [Results](#evaluation-results)
+- [Quick Start with HuggingFace](#quick-start-with-huggingface)
+- [Evaluation](#evaluation)
+- [Quick Start For Training](#quick-start-guide)
+- [Fully Reproducing Guide](#fully-reproducing-guide)
+- [Citation](#citation)
+- [Acknowledgement](#acknowledgement)
+
+
 ## Introduction
 **LLaVA-OneVision1.5** introduces a novel family of **fully open-source** Large Multimodal Models (LMMs) that achieves **state-of-the-art performance**  with substantially **lower cost** through training on **native resolution** images.
 
-1. **Superior Performance**
-A family of fully open-source large multimodal models demonstrating **superior performance** across multiple multimodal benchmarks, **outperforming Qwen2.5-VL** in most evaluation tasks.
+- **Superior Performance**
+A family of fully open-source large multimodal models demonstrating 
+    - Superior performance across multiple multimodal benchmarks
+    - outperforming Qwen2.5-VL** in most evaluation tasks.
 
-2. **High-Quality Data at Scale**
+- **High-Quality Data at Scale**
 Meticulously curated **pre-training and SFT data** with rigorous filtering and quality control, achieving **superior data efficiency** with only **64B tokens**.
-- Concept-balanced, highly diverse, high-quality caption data
-- Comprehensive instruction fine-tuning data covering a wide range of tasks
+    - Concept-balanced, highly diverse, high-quality caption data
+    - Comprehensive instruction fine-tuning data covering a wide range of tasks
 
-3. **Ultra-Efficient Training Framework**
-Complete end-to-end training framework designed for maximum efficiency:
-- $16000 total budget for full model training on A100 GPUs  ($0.6 per GPU/Hour)
-- 45% HFU efficiency in 8k context length
-- Built on **MegatronLM** with support for **MoE**, **FP8**, and **long sequence parallelization**
-- Optimized codebase for cost-effective scaling
+- **Ultra-Efficient Training Framework** Complete end-to-end training framework designed for maximum efficiency:
+    - $16000 total budget for full model training on A100 GPUs  ($0.6 per GPU/Hour)
+    - 45% HFU efficiency in 8k context length
+    - Built on **MegatronLM** with support for **MoE**, **FP8**, and **long sequence parallelization**
+    - Optimized codebase for cost-effective scaling
 
-- - [ ] Better data load balancing optimization
-- - [ ] More efficient multimodal model parallelism strategy
-- - [ ] FP8 training support cases/examples
 
-4. **Fully Open Framework** for community access and reproducibility:
-- ✅ High-quality pre-training & SFT data
-- ✅ Complete training framework & code
-- ✅ Training recipes & configurations
-- ✅ Base & instruct model checkpoints
-- ✅ Comprehensive training logs & metrics
+- **Fully Open Framework** for community access and reproducibility:
+    - High-quality pre-training & SFT data
+    - Complete training framework & code
+    - Training recipes & configurations
+    - Comprehensive training logs & metrics
 
 
 ## Models
 
-| Model                  | #Vision Param | #Language Param | #Total Param | HF Link                                                                      |
-|------------------------|---------------|-----------------|--------------|------------------------------------------------------------------------------|
-| LLaVA-OV-1.5-4B-Instruct      | 0.3B          | 4.4B            | 4.7B         | [🤗 link]()                |
-| LLaVA-OV-1.5-8B-Instruct      | 0.3B          | 8.2B            | 8.5B         | [🤗 link](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct) |
+| Model                     | HuggingFace Link |
+|---------------------------|------------------|
+| LLaVA-OV-1.5-4B-Instruct  | (coming soon)    |
+| LLaVA-OV-1.5-8B-Instruct  | [🤗](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct) |
 
 
 ## Datasets
 
 ![Dataset Visualization](asset/dataset.jpg)
 
 
-| Description | Link |
-|-------------|------|
-| Mid-training data for LLaVA-OneVision-1.5 | [🤗 Download (Uploading!)](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M) |
-| SFT data for LLaVA-OneVision-1.5 | [🤗 Download (Uploading!)](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data) |
+| Description        | Link                                                                                                   | Status      |
+|--------------------|--------------------------------------------------------------------------------------------------------|-------------|
+| OV-1.5-Mid-Training-85M   | [🤗](https://huggingface.co/datasets/lmms-lab/LLaVA-One-Vision-1.5-Mid-Training-85M) | Uploading…  |
+| OV-1.5-Instruct           | [🤗](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-1.5-Insturct-Data)     | Uploading…  |
 
 
 ## Evaluation Results
 
 
 All evaluations were conducted using lmms_eval.
 
-|                                  | **LLaVA-OV-1.5-8B** | **Qwen2.5 VL 7B** | **LLaVA-OV-1.5-4B** | **Qwen2.5 VL 3B** |
+![Performance comparison across vision-language models on various benchmarks grouped by task
+type. All scores are reported as accuracy percentages unless otherwise specified.](asset/performance.png)
+
+<!-- |                                  | **LLaVA-OV-1.5-8B** | **Qwen2.5 VL 7B** | **LLaVA-OV-1.5-4B** | **Qwen2.5 VL 3B** |
 |:----------------------------------|:---------------:|:-------------:|:---------------:|:-------------:|
 | MMMU (Validation)                 |    **55.44**    |     51.33     |    **51.44**    |     46.44     |
 | MMMU-Pro (Standard)               |    **37.40**    |     36.30     |    **33.24**    |     31.10     |
@@ -96,7 +111,7 @@ All evaluations were conducted using lmms_eval.
 | ScienceQA                         |    **94.98**    |     88.75     |    **92.05**    |     83.33     |
 | SEED-Bench 2-Plus                 |      69.21      |   **70.93**   |    **68.42**    |     68.64     |
 | OCRBench                          |      82.90      |   **84.20**   |      77.80      |   **79.20**   |
-| RealWorldQA                       |      68.10      |   **68.50**   |    **64.05**    |     60.00     |
+| RealWorldQA                       |      68.10      |   **68.50**   |    **64.05**    |     60.00     | -->
 
 
 ## Quick Start with HuggingFace
@@ -360,7 +375,7 @@ Thanks so much to all of our amazing contributors!
                 <a href="https://github.com/RobitYadda">
                     <img src="https://avatars.githubusercontent.com/u/6811311?v=4" width="100;" alt="RobitYadda"/>
                     <br />
-                    <sub><b>闫梓祯@智能引擎</b></sub>
+                    <sub><b>zizhenyan</b></sub>
                 </a>
             </td>
 		</tr>