1- ### Performance
2-
3- The results of the ImageNet linear probe are as follows:
4-
5- | Model Name | ImageNet Linear Probe | Hugging Face |
6- | :--------------------- | :-------------------: | :----------------------------------------------------------------------------------------- |
7- | MLCD-ViT-B-32-224px | 79.1 | [ HF: MLCD-ViT-B-32-224px ] ( https://huggingface.co/DeepGlint-AI/mlcd-vit-base-patch32-224 ) |
8- | MLCD-ViT-L-14-336px | 86.3 | [ HF: MLCD-ViT-L-14-336px ] ( https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336 ) |
9- | MLCD-ViT-bigG-14-224px | 87.1 | [ HF: MLCD-ViT-bigG-14-224px ] ( https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-224 ) |
10-
11- ### convert pytorch2huggingface
12-
13- ``` python3
14-
15- python convert_vit_bigG_14_rope2d_to_hf.py \
16- -- pytorch_dump_folder_path mlcd- vit- bigG- patch14- 336 \
17- -- checkpoint_path MLCD_ViT_bigG_14_336px_pytorch .pt \
18- -- image_size 336
19- ```
20-
21-
221[ ![ Arxiv] ( https://img.shields.io/badge/arXiv-2407.17331-red )] ( https://arxiv.org/abs/2407.17331 ) [ ![ Hugging Face] ( https://img.shields.io/badge/Hugging%20Face-Model-yellow )] ( https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336 )
232[ ![ PWC] ( https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/multi-label-cluster-discrimination-for-visual/self-supervised-image-classification-on )] ( https://paperswithcode.com/sota/self-supervised-image-classification-on?p=multi-label-cluster-discrimination-for-visual )
243
254
26- ### Evaluation
5+ ### Performance
6+
277
288#### A. MLLMs Evaluation Results
299To evaluate MLCD’s performance within multimodal large language models (MLLMs), we replaced the CLIP model in LLaVA-NeXT with the MLCD model. We paired this with the Qwen2.5-7B language model. For reproducibility, we utilized the LLaVA-Pretrain dataset for pre-training and the LLaVA-NeXT-Data for structured fine-tuning. The evaluation results confirm that the MLCD model performs exceptionally well across multiple benchmarks, underscoring its effectiveness in MLLMs.
3010
3111
32- | Vision Tower | [ MLCD (ViT_L_14_336px)] ( https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336 ) | CLIP (ViT_L_14_336px) |
33- | :-------------- | :-------------------------------------------------------------------------------------- | :-------------------- |
34- | LLM | Qwen2.5-7B | Qwen2.5-7B |
35- | AI2D | ** 76.98** | 73.15 |
36- | GQA | ** 64.17** | 63.31 |
37- | ScienceQA-Img | ** 78.09** | 76.35 |
38- | InfoVQA-Val | ** 43.48** | 38.88 |
39- | MMBenchCN-Dev | ** 74.83** | 72.51 |
40- | MMBenchEN-Dev | ** 76.37** | 74.57 |
41- | SeedBench | ** 68.20** | 66.80 |
42- | SeedBench-Img | ** 73.75** | 72.72 |
43- | MMStar | ** 50.98** | 48.98 |
44- | MMMU | ** 44.30** | 44.20 |
45- | POPE | 88.69 | ** 88.83** |
46- | ChartQA | ** 67.84** | 66.52 |
47- | DocVQA-Val | ** 76.46** | 75.21 |
48- | TextVQA-Val | 61.69 | ** 62.47** |
49- | OCRBench | ** 531** | 525 |
50- | MME(cognition) | ** 432** | 384 |
51- | MME(perception) | ** 1598** | 1512 |
52-
12+ | Vision Tower | RoPE2D | ChartQA | DocVQA | InfoVQA | OCRBench | MMMU |
13+ | :-------------------------------------------------------------------------------------------- | :----: | :-------- | :-------- | :-------- | :--------- | :-------- |
14+ | CLIP (ViT-L-14-336px) | × | 66.52 | 75.21 | 38.88 | 525.00 | 44.20 |
15+ | SigLIP (ViT-SO400M-384px) | × | 69.28 | 76.71 | 41.38 | 554.00 | 46.78 |
16+ | DFN5B (ViT-H-14-378px) | × | 64.36 | 70.87 | 38.59 | 473.00 | ** 48.00** |
17+ | ** [ HF: MLCD (ViT-L-14-336px)] ( https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336 ) ** | × | 67.84 | 76.46 | 43.48 | 531.00 | 44.30 |
18+ | ** [ HF: MLCD (ViT-bigG-14-336px)] ( https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-336 ) ** | √ | 71.07 | 79.63 | 44.38 | 572.00 | 46.78 |
19+ | ** [ HF: MLCD (ViT-bigG-14-448px)] ( https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-448 ) ** | √ | ** 73.80** | ** 83.34** | ** 46.59** | ** 582.00** | 46.00 |
5320
5421
5522
5623#### B. Linear Probe Evaluation Results
5724This table presents the results of linear probe evaluations comparing CLIP and MLCD models on the ViT_L_14_336px architecture across various datasets. The linear probe test freezes the pre-trained model's weights and trains a linear classifier on top to assess how well the model's representations generalize to different tasks.
5825
26+
27+ The results of the ImageNet linear probe are as follows:
28+
29+ | Model Name | ImageNet Linear Probe | Hugging Face |
30+ | :--------------------- | :-------------------: | :----------------------------------------------------------------------------------------- |
31+ | MLCD-ViT-B-32-224px | 79.1 | [ HF: MLCD-ViT-B-32-224px ] ( https://huggingface.co/DeepGlint-AI/mlcd-vit-base-patch32-224 ) |
32+ | MLCD-ViT-L-14-336px | 86.3 | [ HF: MLCD-ViT-L-14-336px ] ( https://huggingface.co/DeepGlint-AI/mlcd-vit-large-patch14-336 ) |
33+ | MLCD-ViT-bigG-14-224px | 87.1 | [ HF: MLCD-ViT-bigG-14-224px ] ( https://huggingface.co/DeepGlint-AI/mlcd-vit-bigG-patch14-224 ) |
34+
35+
5936| Dataset | MLCD (ViT_L_14_336px) | CLIP (ViT_L_14_336px) |
6037| :--------------------------- | :-------------------- | :-------------------- |
6138| Food101 | ** 96.21** | 95.90 |
@@ -70,3 +47,14 @@ This table presents the results of linear probe evaluations comparing CLIP and M
7047| Caltech-101 | ** 97.92** | 96.00 |
7148| Flowers102 | ** 99.58** | 99.20 |
7249| ImageNet | ** 86.10** | 85.40 |
50+
51+
52+ ### convert pytorch2huggingface
53+
54+ ``` python3
55+
56+ python convert_vit_bigG_14_rope2d_to_hf.py \
57+ -- pytorch_dump_folder_path mlcd- vit- bigG- patch14- 336 \
58+ -- checkpoint_path MLCD_ViT_bigG_14_336px_pytorch .pt \
59+ -- image_size 336
60+ ```
0 commit comments