Update README.md

anxiangsir · web-flow · commit 103a10823355 · 2025-04-06T20:44:47.000+08:00
diff --git a/mlcd/README.md b/mlcd/README.md
@@ -1,61 +1,38 @@
-### Performance
-
-The results of the ImageNet linear probe are as follows:
-
-| Model Name             | ImageNet Linear Probe | Hugging Face                                                                               |
-| :--------------------- | :-------------------: | :----------------------------------------------------------------------------------------- |
-| MLCD-ViT-B-32-224px    |         79.1          | [HF:MLCD-ViT-B-32-224px](https://huggingface.co/DeepGlint-AI/mlcd-vit-base-patch32-224)    |
-| MLCD-ViT-L-14-336px    |         86.3          | [HF:MLCD-ViT-L-14-336px](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336)   |
-| MLCD-ViT-bigG-14-224px |         87.1          | [HF:MLCD-ViT-bigG-14-224px](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-224) |
-
-### convert pytorch2huggingface
-
-```python3
-
-python convert_vit_bigG_14_rope2d_to_hf.py \
---pytorch_dump_folder_path mlcd-vit-bigG-patch14-336 \
---checkpoint_path MLCD_ViT_bigG_14_336px_pytorch.pt \
---image_size 336
-```
-
-
 [![Arxiv](https://img.shields.io/badge/arXiv-2407.17331-red)](https://arxiv.org/abs/2407.17331) [![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Model-yellow)](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336)
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-label-cluster-discrimination-for-visual/self-supervised-image-classification-on)](https://paperswithcode.com/sota/self-supervised-image-classification-on?p=multi-label-cluster-discrimination-for-visual)
 
 
-### Evaluation
+### Performance
+
 
 #### A. MLLMs Evaluation Results
 To evaluate MLCD’s performance within multimodal large language models (MLLMs), we replaced the CLIP model in LLaVA-NeXT with the MLCD model. We paired this with the Qwen2.5-7B language model. For reproducibility, we utilized the LLaVA-Pretrain dataset for pre-training and the LLaVA-NeXT-Data for structured fine-tuning. The evaluation results confirm that the MLCD model performs exceptionally well across multiple benchmarks, underscoring its effectiveness in MLLMs.
 
 
-| Vision Tower    | [MLCD (ViT_L_14_336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336) | CLIP (ViT_L_14_336px) |
-| :-------------- | :-------------------------------------------------------------------------------------- | :-------------------- |
-| LLM             | Qwen2.5-7B                                                                              | Qwen2.5-7B            |
-| AI2D            | **76.98**                                                                               | 73.15                 |
-| GQA             | **64.17**                                                                               | 63.31                 |
-| ScienceQA-Img   | **78.09**                                                                               | 76.35                 |
-| InfoVQA-Val     | **43.48**                                                                               | 38.88                 |
-| MMBenchCN-Dev   | **74.83**                                                                               | 72.51                 |
-| MMBenchEN-Dev   | **76.37**                                                                               | 74.57                 |
-| SeedBench       | **68.20**                                                                               | 66.80                 |
-| SeedBench-Img   | **73.75**                                                                               | 72.72                 |
-| MMStar          | **50.98**                                                                               | 48.98                 |
-| MMMU            | **44.30**                                                                               | 44.20                 |
-| POPE            | 88.69                                                                                   | **88.83**             |
-| ChartQA         | **67.84**                                                                               | 66.52                 |
-| DocVQA-Val      | **76.46**                                                                               | 75.21                 |
-| TextVQA-Val     | 61.69                                                                                   | **62.47**             |
-| OCRBench        | **531**                                                                                 | 525                   |
-| MME(cognition)  | **432**                                                                                 | 384                   |
-| MME(perception) | **1598**                                                                                | 1512                  |
-
+| Vision Tower                                                                                  | RoPE2D | ChartQA   | DocVQA    | InfoVQA   | OCRBench   | MMMU      |
+| :-------------------------------------------------------------------------------------------- | :----: | :-------- | :-------- | :-------- | :--------- | :-------- |
+| CLIP (ViT-L-14-336px)                                                                         |   ×    | 66.52     | 75.21     | 38.88     | 525.00     | 44.20     |
+| SigLIP (ViT-SO400M-384px)                                                                     |   ×    | 69.28     | 76.71     | 41.38     | 554.00     | 46.78     |
+| DFN5B (ViT-H-14-378px)                                                                        |   ×    | 64.36     | 70.87     | 38.59     | 473.00     | **48.00** |
+| **[HF:MLCD (ViT-L-14-336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336)**   |   ×    | 67.84     | 76.46     | 43.48     | 531.00     | 44.30     |
+| **[HF:MLCD (ViT-bigG-14-336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-336)** |   √    | 71.07     | 79.63     | 44.38     | 572.00     | 46.78     |
+| **[HF:MLCD (ViT-bigG-14-448px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-448)** |   √    | **73.80** | **83.34** | **46.59** | **582.00** | 46.00     |
 
 
 
 #### B. Linear Probe Evaluation Results
 This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.
 
+
+The results of the ImageNet linear probe are as follows:
+
+| Model Name             | ImageNet Linear Probe | Hugging Face                                                                               |
+| :--------------------- | :-------------------: | :----------------------------------------------------------------------------------------- |
+| MLCD-ViT-B-32-224px    |         79.1          | [HF:MLCD-ViT-B-32-224px](https://huggingface.co/DeepGlint-AI/mlcd-vit-base-patch32-224)    |
+| MLCD-ViT-L-14-336px    |         86.3          | [HF:MLCD-ViT-L-14-336px](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336)   |
+| MLCD-ViT-bigG-14-224px |         87.1          | [HF:MLCD-ViT-bigG-14-224px](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-224) |
+
+
 | Dataset                      | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
 | :--------------------------- | :-------------------- | :-------------------- |
 | Food101                      | **96.21**             | 95.90                 |
@@ -70,3 +47,14 @@ This table presents the results of linear probe evaluations comparing CLIP and M
 | Caltech-101                  | **97.92**             | 96.00                 |
 | Flowers102                   | **99.58**             | 99.20                 |
 | ImageNet                     | **86.10**             | 85.40                 |
+
+
+### convert pytorch2huggingface
+
+```python3
+
+python convert_vit_bigG_14_rope2d_to_hf.py \
+--pytorch_dump_folder_path mlcd-vit-bigG-patch14-336 \
+--checkpoint_path MLCD_ViT_bigG_14_336px_pytorch.pt \
+--image_size 336
+```