1818 < h1 class ="title is-1 "> OpenVision</ h1 >
1919 < h2 class ="subtitle is-4 "> Fully-Open, Cost-Effective Vision Encoders for Multimodal Learning</ h2 >
2020 < p style ="font-size: 1.15em; font-weight: 500; margin-top: 1.2em; ">
21- Xianhang Li · Yanqing Liu · Haoqin Tu · Hongru Zhu · Cihang Xie
21+ < a href ="https://xhl-video.github.io/xianhangli/ " target ="_blank " style ="color: white; text-decoration: underline; "> Xianhang Li</ a > < sup > *</ sup > ·
22+ < a href ="https://yanqing0327.github.io/Yanqing.github.io/ " target ="_blank " style ="color: white; text-decoration: underline; "> Yanqing Liu</ a > < sup > *</ sup > ·
23+ < a href ="https://www.haqtu.me/ " target ="_blank " style ="color: white; text-decoration: underline; "> Haoqin Tu</ a > ·
24+ < a href ="https://scholar.google.com/citations?user=G8NZJLIAAAAJ&hl=en " target ="_blank " style ="color: white; text-decoration: underline; "> Hongru Zhu</ a > ·
25+ < a href ="https://cihangxie.github.io/ " target ="_blank " style ="color: white; text-decoration: underline; "> Cihang Xie</ a >
2226 </ p >
2327 < p style ="font-size: 1.05em; color: rgba(255, 255, 255, 0.9); ">
2428 University of California, Santa Cruz
@@ -50,7 +54,7 @@ <h2 class="subtitle is-4">Fully-Open, Cost-Effective Vision Encoders for Multimo
5054
5155 <!-- Model Link -->
5256 < span class ="link-block ">
53- < a href ="https://huggingface.co/datasets /UCSC-VLAA/OpenVision "
57+ < a href ="https://huggingface.co/collections /UCSC-VLAA/openvision-681a4c27ee1f66411b4ae919 "
5458 class ="external-link button is-normal is-rounded is-dark ">
5559 < span class ="icon ">
5660 < img src ="./resources/gr.svg " alt ="HF " style ="width: 1.2em; height: 1.2em; " />
@@ -79,16 +83,17 @@ <h2 class="title is-3 has-text-centered">Abstract</h2>
7983 < h2 class ="title is-3 "> Key Contributions</ h2 >
8084 < div class ="content ">
8185 < ul >
82- < li > Completely open training recipe: datasets, codebase , and checkpoints are public.</ li >
83- < li > Family of ViT encoders spanning Tiny (5.9M) to Huge (632.1M) parameters.</ li >
84- < li > Superior multimodal performance under LLaVA-1.5 and Open-LLaVA-Next benchmarks .</ li >
85- < li > Efficient progressive resolution training: 2×–3× faster than proprietary counterparts.</ li >
86- < li > Supports flexible patch sizes (8×8, 16×16) for detailed or efficient encoding .</ li >
86+ < li > < strong > Fully Open Vision Encoders: </ strong > Datasets, training recipes , and model checkpoints are entirely public, fostering reproducibility and transparency in multimodal research .</ li >
87+ < li > < strong > Wide Range of Model Scales: </ strong > A comprehensive family of vision encoders from Tiny (5.9M) to Huge (632.1M) parameters, supporting deployment from edge devices to high-capacity servers .</ li >
88+ < li > < strong > Superior Multimodal Performance: </ strong > Matches or surpasses proprietary vision encoders (e.g., OpenAI-CLIP, SigLIP) across popular multimodal benchmarks (e.g., LLaVA-1.5, Open-LLaVA-Next) .</ li >
89+ < li > < strong > Efficient Progressive Resolution Training: </ strong > Demonstrates significant efficiency improvements ( 2×–3× faster) compared to proprietary counterparts through a progressive, multi-stage resolution training strategy .</ li >
90+ < li > < strong > Flexible Patch-Size Configuration: </ strong > Supports adaptive encoding (8×8, 16×16 patches), allowing detailed visual understanding or efficient processing based on practical needs .</ li >
8791 </ ul >
8892 </ div >
8993 </ div >
9094</ section >
9195
96+
9297<!-- Detailed Comparisons Section -->
9398< section class ="section has-background-light ">
9499 < div class ="container has-text-centered ">
@@ -98,7 +103,7 @@ <h2 class="title is-3">Detailed Comparisons and Efficiency</h2>
98103 < div class ="content " style ="margin-bottom: 2em; ">
99104 < h3 class ="title is-4 "> OpenVision vs. Proprietary Encoders</ h3 >
100105 < img src ="resources/openvision_teaser_v1.3.png " alt ="OpenVision vs Proprietary Encoders "
101- style ="max-width: 65 %; height: auto; border: 1px solid #ccc; box-shadow: 0 2px 6px rgba(0,0,0,0.1); ">
106+ style ="max-width: 60 %; height: auto; border: 1px solid #ccc; box-shadow: 0 2px 6px rgba(0,0,0,0.1); ">
102107 < p style ="margin-top: 0.8em; ">
103108 OpenVision encoders match or outperform proprietary models like OpenAI's CLIP and Google's SigLIP across multimodal tasks.
104109 </ p >
@@ -108,7 +113,7 @@ <h3 class="title is-4">OpenVision vs. Proprietary Encoders</h3>
108113 < div class ="content " style ="margin-bottom: 2em; ">
109114 < h3 class ="title is-4 "> Performance under LLaVA-1.5 Framework</ h3 >
110115 < img src ="resources/performance_normalized.png " alt ="LLaVA-1.5 Performance Comparison "
111- style ="max-width: 70 %; height: auto; border: 1px solid #ccc; box-shadow: 0 2px 6px rgba(0,0,0,0.1); ">
116+ style ="max-width: 50 %; height: auto; border: 1px solid #ccc; box-shadow: 0 2px 6px rgba(0,0,0,0.1); ">
112117 < p style ="margin-top: 0.8em; ">
113118 OpenVision demonstrates strong performance improvements over existing CLIP models under the LLaVA-1.5 multimodal framework.
114119 </ p >
@@ -118,7 +123,7 @@ <h3 class="title is-4">Performance under LLaVA-1.5 Framework</h3>
118123 < div class ="content " style ="margin-bottom: 2em; ">
119124 < h3 class ="title is-4 "> Performance under Open-LLaVA-Next Framework</ h3 >
120125 < img src ="resources/performance_normalized_2.png " alt ="Open-LLaVA-Next Performance Comparison "
121- style ="max-width: 70 %; height: auto; border: 1px solid #ccc; box-shadow: 0 2px 6px rgba(0,0,0,0.1); ">
126+ style ="max-width: 50 %; height: auto; border: 1px solid #ccc; box-shadow: 0 2px 6px rgba(0,0,0,0.1); ">
122127 < p style ="margin-top: 0.8em; ">
123128 Under Open-LLaVA-Next, OpenVision maintains its competitive edge, excelling particularly in document-heavy multimodal tasks.
124129 </ p >
@@ -127,8 +132,8 @@ <h3 class="title is-4">Performance under Open-LLaVA-Next Framework</h3>
127132 <!-- Efficiency Comparison -->
128133 < div class ="content ">
129134 < h3 class ="title is-4 "> Efficiency Comparison</ h3 >
130- < img src ="resources/efficiency_llava1.5_vs_next .png " alt ="Efficiency Comparison "
131- style ="max-width: 70 %; height: auto; border: 1px solid #ccc; box-shadow: 0 2px 6px rgba(0,0,0,0.1); ">
135+ < img src ="resources/efficiency_llava1.5_vs_next_resolution_size .png " alt ="Efficiency Comparison "
136+ style ="max-width: 50 %; height: auto; border: 1px solid #ccc; box-shadow: 0 2px 6px rgba(0,0,0,0.1); ">
132137 < p style ="margin-top: 0.8em; ">
133138 OpenVision achieves superior multimodal performance with significantly reduced training time compared to proprietary alternatives.
134139 </ p >
@@ -138,50 +143,72 @@ <h3 class="title is-4">Efficiency Comparison</h3>
138143</ section >
139144
140145
141- <!-- Model Variants -->
142- < section class ="section has-background-light ">
143- < div class ="container ">
144- < h2 class ="title is-3 "> Model Variants</ h2 >
145- < table class ="table is-fullwidth is-striped ">
146- < thead >
147- < tr >
148- < th > Variant</ th > < th > # Params</ th > < th > Patch Size</ th > < th > Resolution Stages</ th >
149- </ tr >
150- </ thead >
151- < tbody >
152- < tr > < td > Tiny (Ti)</ td > < td > 5.9M</ td > < td > 8×8 / 16×16</ td > < td > 160→224→384</ td > </ tr >
153- < tr > < td > Small (S)</ td > < td > 22.4M</ td > < td > 8×8 / 16×16</ td > < td > 160→224→384</ td > </ tr >
154- < tr > < td > Base (B)</ td > < td > 87.4M</ td > < td > 8×8 / 16×16</ td > < td > 84→224→336/384</ td > </ tr >
155- < tr > < td > Large (L)</ td > < td > 303.7M</ td > < td > 14×14</ td > < td > 84→224→336/384</ td > </ tr >
156- < tr > < td > SoViT-400M</ td > < td > 400M</ td > < td > 14×14</ td > < td > 84→224→384</ td > </ tr >
157- < tr > < td > Huge (H)</ td > < td > 632.1M</ td > < td > 14×14</ td > < td > 84→224→336</ td > </ tr >
158- </ tbody >
159- </ table >
160- </ div >
161- </ section >
162-
163- <!-- Get Started -->
146+ <!-- Model Zoo (ImageNet-1K) -->
164147< section class ="section ">
165148 < div class ="container ">
166- < h2 class ="title is-3 "> Get Started</ h2 >
167- < div class ="content ">
168- < p > Install and load a pre-trained model with Hugging Face:</ p >
169- < pre > < code > pip install transformers timm
170-
171- from transformers import CLIPProcessor, CLIPModel
172- model = CLIPModel.from_pretrained('UCSC-VLAA/OpenVision-B-16')
173- processor = CLIPProcessor.from_pretrained('UCSC-VLAA/OpenVision-B-16')</ code > </ pre >
149+ < h2 class ="title is-3 has-text-centered "> Model Zoo (ImageNet-1K)</ h2 >
150+ < p class ="has-text-centered " style ="max-width: 700px; margin: 0 auto 1.5em auto; font-size: 1.05em; ">
151+ We report ImageNet-1K Top-1 accuracy across OpenVision variants. All models are available in both JAX and PyTorch formats.
152+ </ p >
153+ < div class ="table-container " style ="overflow-x: auto; box-shadow: 0 2px 8px rgba(0,0,0,0.08); border-radius: 10px; ">
154+ < table class ="table is-bordered is-fullwidth has-text-centered " style ="min-width: 900px; ">
155+ < thead class ="has-background-light ">
156+ < tr >
157+ < th > Model</ th > < th > Size</ th > < th > Patch</ th > < th > Resolution</ th > < th > Top-1</ th > < th > Link</ th > < th > JAX</ th > < th > PyTorch</ th >
158+ </ tr >
159+ </ thead >
160+ < tbody >
161+ < tr > < td > ViT-Tiny</ td > < td > 5M</ td > < td > 16</ td > < td > 160</ td > < td > 46.9%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-tiny-patch16-160 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
162+ < tr > < td > ViT-Tiny</ td > < td > 5M</ td > < td > 16</ td > < td > 224</ td > < td > 49.6%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-tiny-patch16-224 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
163+ < tr > < td > ViT-Tiny</ td > < td > 5M</ td > < td > 16</ td > < td > 384</ td > < td > 51.5%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-tiny-patch16-384 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
164+ < tr > < td > ViT-Tiny</ td > < td > 5M</ td > < td > 8</ td > < td > 160</ td > < td > 51.9%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-tiny-patch8-160 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
165+ < tr > < td > ViT-Tiny</ td > < td > 5M</ td > < td > 8</ td > < td > 224</ td > < td > 53.5%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-tiny-patch8-224 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
166+ < tr > < td > ViT-Tiny</ td > < td > 5M</ td > < td > 8</ td > < td > 384</ td > < td > 53.9%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-tiny-patch8-384 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
167+
168+ < tr > < td > ViT-Small</ td > < td > 22M</ td > < td > 16</ td > < td > 160</ td > < td > 63.5%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-small-patch16-160 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
169+ < tr > < td > ViT-Small</ td > < td > 22M</ td > < td > 16</ td > < td > 224</ td > < td > 65.9%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-small-patch16-224 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
170+ < tr > < td > ViT-Small</ td > < td > 22M</ td > < td > 16</ td > < td > 384</ td > < td > 67.1%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-small-patch16-384 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
171+ < tr > < td > ViT-Small</ td > < td > 22M</ td > < td > 8</ td > < td > 160</ td > < td > 67.3%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-small-patch8-160 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
172+ < tr > < td > ViT-Small</ td > < td > 22M</ td > < td > 8</ td > < td > 224</ td > < td > 68.6%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-small-patch8-224 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
173+ < tr > < td > ViT-Small</ td > < td > 22M</ td > < td > 8</ td > < td > 384</ td > < td > 68.5%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-small-patch8-384 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
174+
175+ < tr > < td > ViT-Base</ td > < td > 86M</ td > < td > 16</ td > < td > 160</ td > < td > 72.4%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-base-patch16-160 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
176+ < tr > < td > ViT-Base</ td > < td > 86M</ td > < td > 16</ td > < td > 224</ td > < td > 73.9%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-base-patch16-224 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
177+ < tr > < td > ViT-Base</ td > < td > 86M</ td > < td > 16</ td > < td > 384</ td > < td > 74.5%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-base-patch16-384 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
178+ < tr > < td > ViT-Base</ td > < td > 86M</ td > < td > 8</ td > < td > 160</ td > < td > 74.8%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-base-patch8-160 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
179+ < tr > < td > ViT-Base</ td > < td > 86M</ td > < td > 8</ td > < td > 224</ td > < td > 75.4%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-base-patch8-224 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
180+ < tr > < td > ViT-Base</ td > < td > 86M</ td > < td > 8</ td > < td > 384</ td > < td > 75.6%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-base-patch8-384 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
181+
182+ < tr > < td > ViT-Large</ td > < td > 307M</ td > < td > 14</ td > < td > 84</ td > < td > 74.7%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-large-patch14-84 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
183+ < tr > < td > ViT-Large</ td > < td > 307M</ td > < td > 14</ td > < td > 224</ td > < td > 78.5%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-large-patch14-224 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
184+ < tr > < td > ViT-Large</ td > < td > 307M</ td > < td > 14</ td > < td > 336</ td > < td > 78.9%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-large-patch14-336 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
185+ < tr > < td > SoViT-400M</ td > < td > 400M</ td > < td > 14</ td > < td > 84</ td > < td > 76.2%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-so400m-patch14-84 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
186+ < tr > < td > SoViT-400M</ td > < td > 400M</ td > < td > 14</ td > < td > 224</ td > < td > 79.7%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-so400m-patch14-224 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
187+ < tr > < td > SoViT-400M</ td > < td > 400M</ td > < td > 14</ td > < td > 384</ td > < td > 79.9%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-so400m-patch14-384 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
188+ < tr > < td > ViT-Huge</ td > < td > 632M</ td > < td > 14</ td > < td > 84</ td > < td > 77.4%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-huge-patch14-84 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
189+ < tr > < td > ViT-Huge</ td > < td > 632M</ td > < td > 14</ td > < td > 224</ td > < td > 80.4%</ td > < td > < a href ="https://huggingface.co/UCSC-VLAA/openvision-vit-huge-patch14-224 " target ="_blank "> HF</ a > </ td > < td > ✓</ td > < td > ✓</ td > </ tr >
190+ </ tbody >
191+ </ table >
174192 </ div >
175193 </ div >
176194</ section >
177195
196+
197+
178198<!-- Footer -->
179199< footer class ="footer ">
180200 < div class ="content has-text-centered ">
181- < p > Built by the UCSC-VLAA team. < a href ="https://xhl-video.github.io/xianhangli/ "> Xianhang Li</ a > ,
182- < a href ="https://yanqing0327.github.io/Yanqing.github.io/ "> Yanqing Liu</ a > , Haoqin Tu, Hongru Zhu, Cihang Xie.</ p >
201+ < p >
202+ Built by the UCSC-VLAA team.
203+ < a href ="https://xhl-video.github.io/xianhangli/ " target ="_blank "> Xianhang Li</ a > ,
204+ < a href ="https://yanqing0327.github.io/Yanqing.github.io/ " target ="_blank "> Yanqing Liu</ a > ,
205+ < a href ="https://www.haqtu.me/ " target ="_blank "> Haoqin Tu</ a > ,
206+ < a href ="https://scholar.google.com/citations?user=G8NZJLIAAAAJ&hl=en " target ="_blank "> Hongru Zhu</ a > ,
207+ < a href ="https://cihangxie.github.io/ " target ="_blank "> Cihang Xie</ a > .
208+ </ p >
183209 </ div >
184210</ footer >
185211
212+
186213</ body >
187214</ html >
0 commit comments