Skip to content

Commit 103a108

Browse files
authored
Update README.md
1 parent 6fd3814 commit 103a108

File tree

1 file changed

+31
-43
lines changed

1 file changed

+31
-43
lines changed

mlcd/README.md

Lines changed: 31 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,61 +1,38 @@
1-
### Performance
2-
3-
The results of the ImageNet linear probe are as follows:
4-
5-
| Model Name | ImageNet Linear Probe | Hugging Face |
6-
| :--------------------- | :-------------------: | :----------------------------------------------------------------------------------------- |
7-
| MLCD-ViT-B-32-224px | 79.1 | [HF:MLCD-ViT-B-32-224px](https://huggingface.co/DeepGlint-AI/mlcd-vit-base-patch32-224) |
8-
| MLCD-ViT-L-14-336px | 86.3 | [HF:MLCD-ViT-L-14-336px](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336) |
9-
| MLCD-ViT-bigG-14-224px | 87.1 | [HF:MLCD-ViT-bigG-14-224px](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-224) |
10-
11-
### convert pytorch2huggingface
12-
13-
```python3
14-
15-
python convert_vit_bigG_14_rope2d_to_hf.py \
16-
--pytorch_dump_folder_path mlcd-vit-bigG-patch14-336 \
17-
--checkpoint_path MLCD_ViT_bigG_14_336px_pytorch.pt \
18-
--image_size 336
19-
```
20-
21-
221
[![Arxiv](https://img.shields.io/badge/arXiv-2407.17331-red)](https://arxiv.org/abs/2407.17331) [![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Model-yellow)](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336)
232
[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-label-cluster-discrimination-for-visual/self-supervised-image-classification-on)](https://paperswithcode.com/sota/self-supervised-image-classification-on?p=multi-label-cluster-discrimination-for-visual)
243

254

26-
### Evaluation
5+
### Performance
6+
277

288
#### A. MLLMs Evaluation Results
299
To evaluate MLCD’s performance within multimodal large language models (MLLMs), we replaced the CLIP model in LLaVA-NeXT with the MLCD model. We paired this with the Qwen2.5-7B language model. For reproducibility, we utilized the LLaVA-Pretrain dataset for pre-training and the LLaVA-NeXT-Data for structured fine-tuning. The evaluation results confirm that the MLCD model performs exceptionally well across multiple benchmarks, underscoring its effectiveness in MLLMs.
3010

3111

32-
| Vision Tower | [MLCD (ViT_L_14_336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336) | CLIP (ViT_L_14_336px) |
33-
| :-------------- | :-------------------------------------------------------------------------------------- | :-------------------- |
34-
| LLM | Qwen2.5-7B | Qwen2.5-7B |
35-
| AI2D | **76.98** | 73.15 |
36-
| GQA | **64.17** | 63.31 |
37-
| ScienceQA-Img | **78.09** | 76.35 |
38-
| InfoVQA-Val | **43.48** | 38.88 |
39-
| MMBenchCN-Dev | **74.83** | 72.51 |
40-
| MMBenchEN-Dev | **76.37** | 74.57 |
41-
| SeedBench | **68.20** | 66.80 |
42-
| SeedBench-Img | **73.75** | 72.72 |
43-
| MMStar | **50.98** | 48.98 |
44-
| MMMU | **44.30** | 44.20 |
45-
| POPE | 88.69 | **88.83** |
46-
| ChartQA | **67.84** | 66.52 |
47-
| DocVQA-Val | **76.46** | 75.21 |
48-
| TextVQA-Val | 61.69 | **62.47** |
49-
| OCRBench | **531** | 525 |
50-
| MME(cognition) | **432** | 384 |
51-
| MME(perception) | **1598** | 1512 |
52-
12+
| Vision Tower | RoPE2D | ChartQA | DocVQA | InfoVQA | OCRBench | MMMU |
13+
| :-------------------------------------------------------------------------------------------- | :----: | :-------- | :-------- | :-------- | :--------- | :-------- |
14+
| CLIP (ViT-L-14-336px) | × | 66.52 | 75.21 | 38.88 | 525.00 | 44.20 |
15+
| SigLIP (ViT-SO400M-384px) | × | 69.28 | 76.71 | 41.38 | 554.00 | 46.78 |
16+
| DFN5B (ViT-H-14-378px) | × | 64.36 | 70.87 | 38.59 | 473.00 | **48.00** |
17+
| **[HF:MLCD (ViT-L-14-336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336)** | × | 67.84 | 76.46 | 43.48 | 531.00 | 44.30 |
18+
| **[HF:MLCD (ViT-bigG-14-336px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-336)** || 71.07 | 79.63 | 44.38 | 572.00 | 46.78 |
19+
| **[HF:MLCD (ViT-bigG-14-448px)](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-448)** || **73.80** | **83.34** | **46.59** | **582.00** | 46.00 |
5320

5421

5522

5623
#### B. Linear Probe Evaluation Results
5724
This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.
5825

26+
27+
The results of the ImageNet linear probe are as follows:
28+
29+
| Model Name | ImageNet Linear Probe | Hugging Face |
30+
| :--------------------- | :-------------------: | :----------------------------------------------------------------------------------------- |
31+
| MLCD-ViT-B-32-224px | 79.1 | [HF:MLCD-ViT-B-32-224px](https://huggingface.co/DeepGlint-AI/mlcd-vit-base-patch32-224) |
32+
| MLCD-ViT-L-14-336px | 86.3 | [HF:MLCD-ViT-L-14-336px](https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336) |
33+
| MLCD-ViT-bigG-14-224px | 87.1 | [HF:MLCD-ViT-bigG-14-224px](https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-224) |
34+
35+
5936
| Dataset | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
6037
| :--------------------------- | :-------------------- | :-------------------- |
6138
| Food101 | **96.21** | 95.90 |
@@ -70,3 +47,14 @@ This table presents the results of linear probe evaluations comparing CLIP and M
7047
| Caltech-101 | **97.92** | 96.00 |
7148
| Flowers102 | **99.58** | 99.20 |
7249
| ImageNet | **86.10** | 85.40 |
50+
51+
52+
### convert pytorch2huggingface
53+
54+
```python3
55+
56+
python convert_vit_bigG_14_rope2d_to_hf.py \
57+
--pytorch_dump_folder_path mlcd-vit-bigG-patch14-336 \
58+
--checkpoint_path MLCD_ViT_bigG_14_336px_pytorch.pt \
59+
--image_size 336
60+
```

0 commit comments

Comments
 (0)