Skip to content

Commit 1f57b1f

Browse files
committed
update project page
1 parent b315ab7 commit 1f57b1f

12 files changed

+72
-45
lines changed

docs/.!43731!.DS_Store

-1.02 KB
Binary file not shown.

docs/index.html

Lines changed: 72 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,11 @@
1818
<h1 class="title is-1">OpenVision</h1>
1919
<h2 class="subtitle is-4">Fully-Open, Cost-Effective Vision Encoders for Multimodal Learning</h2>
2020
<p style="font-size: 1.15em; font-weight: 500; margin-top: 1.2em;">
21-
Xianhang Li · Yanqing Liu · Haoqin Tu · Hongru Zhu · Cihang Xie
21+
<a href="https://xhl-video.github.io/xianhangli/" target="_blank" style="color: white; text-decoration: underline;">Xianhang Li</a><sup>*</sup> ·
22+
<a href="https://yanqing0327.github.io/Yanqing.github.io/" target="_blank" style="color: white; text-decoration: underline;">Yanqing Liu</a><sup>*</sup> ·
23+
<a href="https://www.haqtu.me/" target="_blank" style="color: white; text-decoration: underline;">Haoqin Tu</a> ·
24+
<a href="https://scholar.google.com/citations?user=G8NZJLIAAAAJ&hl=en" target="_blank" style="color: white; text-decoration: underline;">Hongru Zhu</a> ·
25+
<a href="https://cihangxie.github.io/" target="_blank" style="color: white; text-decoration: underline;">Cihang Xie</a>
2226
</p>
2327
<p style="font-size: 1.05em; color: rgba(255, 255, 255, 0.9);">
2428
University of California, Santa Cruz
@@ -50,7 +54,7 @@ <h2 class="subtitle is-4">Fully-Open, Cost-Effective Vision Encoders for Multimo
5054

5155
<!-- Model Link -->
5256
<span class="link-block">
53-
<a href="https://huggingface.co/datasets/UCSC-VLAA/OpenVision"
57+
<a href="https://huggingface.co/collections/UCSC-VLAA/openvision-681a4c27ee1f66411b4ae919"
5458
class="external-link button is-normal is-rounded is-dark">
5559
<span class="icon">
5660
<img src="./resources/gr.svg" alt="HF" style="width: 1.2em; height: 1.2em;" />
@@ -79,16 +83,17 @@ <h2 class="title is-3 has-text-centered">Abstract</h2>
7983
<h2 class="title is-3">Key Contributions</h2>
8084
<div class="content">
8185
<ul>
82-
<li>Completely open training recipe: datasets, codebase, and checkpoints are public.</li>
83-
<li>Family of ViT encoders spanning Tiny (5.9M) to Huge (632.1M) parameters.</li>
84-
<li>Superior multimodal performance under LLaVA-1.5 and Open-LLaVA-Next benchmarks.</li>
85-
<li>Efficient progressive resolution training: 2×–3× faster than proprietary counterparts.</li>
86-
<li>Supports flexible patch sizes (8×8, 16×16) for detailed or efficient encoding.</li>
86+
<li><strong>Fully Open Vision Encoders:</strong> Datasets, training recipes, and model checkpoints are entirely public, fostering reproducibility and transparency in multimodal research.</li>
87+
<li><strong>Wide Range of Model Scales:</strong> A comprehensive family of vision encoders from Tiny (5.9M) to Huge (632.1M) parameters, supporting deployment from edge devices to high-capacity servers.</li>
88+
<li><strong>Superior Multimodal Performance:</strong> Matches or surpasses proprietary vision encoders (e.g., OpenAI-CLIP, SigLIP) across popular multimodal benchmarks (e.g., LLaVA-1.5, Open-LLaVA-Next).</li>
89+
<li><strong>Efficient Progressive Resolution Training:</strong> Demonstrates significant efficiency improvements (2×–3× faster) compared to proprietary counterparts through a progressive, multi-stage resolution training strategy.</li>
90+
<li><strong>Flexible Patch-Size Configuration:</strong> Supports adaptive encoding (8×8, 16×16 patches), allowing detailed visual understanding or efficient processing based on practical needs.</li>
8791
</ul>
8892
</div>
8993
</div>
9094
</section>
9195

96+
9297
<!-- Detailed Comparisons Section -->
9398
<section class="section has-background-light">
9499
<div class="container has-text-centered">
@@ -98,7 +103,7 @@ <h2 class="title is-3">Detailed Comparisons and Efficiency</h2>
98103
<div class="content" style="margin-bottom: 2em;">
99104
<h3 class="title is-4">OpenVision vs. Proprietary Encoders</h3>
100105
<img src="resources/openvision_teaser_v1.3.png" alt="OpenVision vs Proprietary Encoders"
101-
style="max-width: 65%; height: auto; border: 1px solid #ccc; box-shadow: 0 2px 6px rgba(0,0,0,0.1);">
106+
style="max-width: 60%; height: auto; border: 1px solid #ccc; box-shadow: 0 2px 6px rgba(0,0,0,0.1);">
102107
<p style="margin-top: 0.8em;">
103108
OpenVision encoders match or outperform proprietary models like OpenAI's CLIP and Google's SigLIP across multimodal tasks.
104109
</p>
@@ -108,7 +113,7 @@ <h3 class="title is-4">OpenVision vs. Proprietary Encoders</h3>
108113
<div class="content" style="margin-bottom: 2em;">
109114
<h3 class="title is-4">Performance under LLaVA-1.5 Framework</h3>
110115
<img src="resources/performance_normalized.png" alt="LLaVA-1.5 Performance Comparison"
111-
style="max-width: 70%; height: auto; border: 1px solid #ccc; box-shadow: 0 2px 6px rgba(0,0,0,0.1);">
116+
style="max-width: 50%; height: auto; border: 1px solid #ccc; box-shadow: 0 2px 6px rgba(0,0,0,0.1);">
112117
<p style="margin-top: 0.8em;">
113118
OpenVision demonstrates strong performance improvements over existing CLIP models under the LLaVA-1.5 multimodal framework.
114119
</p>
@@ -118,7 +123,7 @@ <h3 class="title is-4">Performance under LLaVA-1.5 Framework</h3>
118123
<div class="content" style="margin-bottom: 2em;">
119124
<h3 class="title is-4">Performance under Open-LLaVA-Next Framework</h3>
120125
<img src="resources/performance_normalized_2.png" alt="Open-LLaVA-Next Performance Comparison"
121-
style="max-width: 70%; height: auto; border: 1px solid #ccc; box-shadow: 0 2px 6px rgba(0,0,0,0.1);">
126+
style="max-width: 50%; height: auto; border: 1px solid #ccc; box-shadow: 0 2px 6px rgba(0,0,0,0.1);">
122127
<p style="margin-top: 0.8em;">
123128
Under Open-LLaVA-Next, OpenVision maintains its competitive edge, excelling particularly in document-heavy multimodal tasks.
124129
</p>
@@ -127,8 +132,8 @@ <h3 class="title is-4">Performance under Open-LLaVA-Next Framework</h3>
127132
<!-- Efficiency Comparison -->
128133
<div class="content">
129134
<h3 class="title is-4">Efficiency Comparison</h3>
130-
<img src="resources/efficiency_llava1.5_vs_next.png" alt="Efficiency Comparison"
131-
style="max-width: 70%; height: auto; border: 1px solid #ccc; box-shadow: 0 2px 6px rgba(0,0,0,0.1);">
135+
<img src="resources/efficiency_llava1.5_vs_next_resolution_size.png" alt="Efficiency Comparison"
136+
style="max-width: 50%; height: auto; border: 1px solid #ccc; box-shadow: 0 2px 6px rgba(0,0,0,0.1);">
132137
<p style="margin-top: 0.8em;">
133138
OpenVision achieves superior multimodal performance with significantly reduced training time compared to proprietary alternatives.
134139
</p>
@@ -138,50 +143,72 @@ <h3 class="title is-4">Efficiency Comparison</h3>
138143
</section>
139144

140145

141-
<!-- Model Variants -->
142-
<section class="section has-background-light">
143-
<div class="container">
144-
<h2 class="title is-3">Model Variants</h2>
145-
<table class="table is-fullwidth is-striped">
146-
<thead>
147-
<tr>
148-
<th>Variant</th><th># Params</th><th>Patch Size</th><th>Resolution Stages</th>
149-
</tr>
150-
</thead>
151-
<tbody>
152-
<tr><td>Tiny (Ti)</td><td>5.9M</td><td>8×8 / 16×16</td><td>160→224→384</td></tr>
153-
<tr><td>Small (S)</td><td>22.4M</td><td>8×8 / 16×16</td><td>160→224→384</td></tr>
154-
<tr><td>Base (B)</td><td>87.4M</td><td>8×8 / 16×16</td><td>84→224→336/384</td></tr>
155-
<tr><td>Large (L)</td><td>303.7M</td><td>14×14</td><td>84→224→336/384</td></tr>
156-
<tr><td>SoViT-400M</td><td>400M</td><td>14×14</td><td>84→224→384</td></tr>
157-
<tr><td>Huge (H)</td><td>632.1M</td><td>14×14</td><td>84→224→336</td></tr>
158-
</tbody>
159-
</table>
160-
</div>
161-
</section>
162-
163-
<!-- Get Started -->
146+
<!-- Model Zoo (ImageNet-1K) -->
164147
<section class="section">
165148
<div class="container">
166-
<h2 class="title is-3">Get Started</h2>
167-
<div class="content">
168-
<p>Install and load a pre-trained model with Hugging Face:</p>
169-
<pre><code>pip install transformers timm
170-
171-
from transformers import CLIPProcessor, CLIPModel
172-
model = CLIPModel.from_pretrained('UCSC-VLAA/OpenVision-B-16')
173-
processor = CLIPProcessor.from_pretrained('UCSC-VLAA/OpenVision-B-16')</code></pre>
149+
<h2 class="title is-3 has-text-centered">Model Zoo (ImageNet-1K)</h2>
150+
<p class="has-text-centered" style="max-width: 700px; margin: 0 auto 1.5em auto; font-size: 1.05em;">
151+
We report ImageNet-1K Top-1 accuracy across OpenVision variants. All models are available in both JAX and PyTorch formats.
152+
</p>
153+
<div class="table-container" style="overflow-x: auto; box-shadow: 0 2px 8px rgba(0,0,0,0.08); border-radius: 10px;">
154+
<table class="table is-bordered is-fullwidth has-text-centered" style="min-width: 900px;">
155+
<thead class="has-background-light">
156+
<tr>
157+
<th>Model</th><th>Size</th><th>Patch</th><th>Resolution</th><th>Top-1</th><th>Link</th><th>JAX</th><th>PyTorch</th>
158+
</tr>
159+
</thead>
160+
<tbody>
161+
<tr><td>ViT-Tiny</td><td>5M</td><td>16</td><td>160</td><td>46.9%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-tiny-patch16-160" target="_blank">HF</a></td><td></td><td></td></tr>
162+
<tr><td>ViT-Tiny</td><td>5M</td><td>16</td><td>224</td><td>49.6%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-tiny-patch16-224" target="_blank">HF</a></td><td></td><td></td></tr>
163+
<tr><td>ViT-Tiny</td><td>5M</td><td>16</td><td>384</td><td>51.5%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-tiny-patch16-384" target="_blank">HF</a></td><td></td><td></td></tr>
164+
<tr><td>ViT-Tiny</td><td>5M</td><td>8</td><td>160</td><td>51.9%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-tiny-patch8-160" target="_blank">HF</a></td><td></td><td></td></tr>
165+
<tr><td>ViT-Tiny</td><td>5M</td><td>8</td><td>224</td><td>53.5%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-tiny-patch8-224" target="_blank">HF</a></td><td></td><td></td></tr>
166+
<tr><td>ViT-Tiny</td><td>5M</td><td>8</td><td>384</td><td>53.9%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-tiny-patch8-384" target="_blank">HF</a></td><td></td><td></td></tr>
167+
168+
<tr><td>ViT-Small</td><td>22M</td><td>16</td><td>160</td><td>63.5%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-small-patch16-160" target="_blank">HF</a></td><td></td><td></td></tr>
169+
<tr><td>ViT-Small</td><td>22M</td><td>16</td><td>224</td><td>65.9%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-small-patch16-224" target="_blank">HF</a></td><td></td><td></td></tr>
170+
<tr><td>ViT-Small</td><td>22M</td><td>16</td><td>384</td><td>67.1%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-small-patch16-384" target="_blank">HF</a></td><td></td><td></td></tr>
171+
<tr><td>ViT-Small</td><td>22M</td><td>8</td><td>160</td><td>67.3%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-small-patch8-160" target="_blank">HF</a></td><td></td><td></td></tr>
172+
<tr><td>ViT-Small</td><td>22M</td><td>8</td><td>224</td><td>68.6%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-small-patch8-224" target="_blank">HF</a></td><td></td><td></td></tr>
173+
<tr><td>ViT-Small</td><td>22M</td><td>8</td><td>384</td><td>68.5%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-small-patch8-384" target="_blank">HF</a></td><td></td><td></td></tr>
174+
175+
<tr><td>ViT-Base</td><td>86M</td><td>16</td><td>160</td><td>72.4%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-base-patch16-160" target="_blank">HF</a></td><td></td><td></td></tr>
176+
<tr><td>ViT-Base</td><td>86M</td><td>16</td><td>224</td><td>73.9%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-base-patch16-224" target="_blank">HF</a></td><td></td><td></td></tr>
177+
<tr><td>ViT-Base</td><td>86M</td><td>16</td><td>384</td><td>74.5%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-base-patch16-384" target="_blank">HF</a></td><td></td><td></td></tr>
178+
<tr><td>ViT-Base</td><td>86M</td><td>8</td><td>160</td><td>74.8%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-base-patch8-160" target="_blank">HF</a></td><td></td><td></td></tr>
179+
<tr><td>ViT-Base</td><td>86M</td><td>8</td><td>224</td><td>75.4%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-base-patch8-224" target="_blank">HF</a></td><td></td><td></td></tr>
180+
<tr><td>ViT-Base</td><td>86M</td><td>8</td><td>384</td><td>75.6%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-base-patch8-384" target="_blank">HF</a></td><td></td><td></td></tr>
181+
182+
<tr><td>ViT-Large</td><td>307M</td><td>14</td><td>84</td><td>74.7%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-large-patch14-84" target="_blank">HF</a></td><td></td><td></td></tr>
183+
<tr><td>ViT-Large</td><td>307M</td><td>14</td><td>224</td><td>78.5%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-large-patch14-224" target="_blank">HF</a></td><td></td><td></td></tr>
184+
<tr><td>ViT-Large</td><td>307M</td><td>14</td><td>336</td><td>78.9%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-large-patch14-336" target="_blank">HF</a></td><td></td><td></td></tr>
185+
<tr><td>SoViT-400M</td><td>400M</td><td>14</td><td>84</td><td>76.2%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-so400m-patch14-84" target="_blank">HF</a></td><td></td><td></td></tr>
186+
<tr><td>SoViT-400M</td><td>400M</td><td>14</td><td>224</td><td>79.7%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-so400m-patch14-224" target="_blank">HF</a></td><td></td><td></td></tr>
187+
<tr><td>SoViT-400M</td><td>400M</td><td>14</td><td>384</td><td>79.9%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-so400m-patch14-384" target="_blank">HF</a></td><td></td><td></td></tr>
188+
<tr><td>ViT-Huge</td><td>632M</td><td>14</td><td>84</td><td>77.4%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-huge-patch14-84" target="_blank">HF</a></td><td></td><td></td></tr>
189+
<tr><td>ViT-Huge</td><td>632M</td><td>14</td><td>224</td><td>80.4%</td><td><a href="https://huggingface.co/UCSC-VLAA/openvision-vit-huge-patch14-224" target="_blank">HF</a></td><td></td><td></td></tr>
190+
</tbody>
191+
</table>
174192
</div>
175193
</div>
176194
</section>
177195

196+
197+
178198
<!-- Footer -->
179199
<footer class="footer">
180200
<div class="content has-text-centered">
181-
<p>Built by the UCSC-VLAA team. <a href="https://xhl-video.github.io/xianhangli/">Xianhang Li</a>,
182-
<a href="https://yanqing0327.github.io/Yanqing.github.io/">Yanqing Liu</a>, Haoqin Tu, Hongru Zhu, Cihang Xie.</p>
201+
<p>
202+
Built by the UCSC-VLAA team.
203+
<a href="https://xhl-video.github.io/xianhangli/" target="_blank">Xianhang Li</a>,
204+
<a href="https://yanqing0327.github.io/Yanqing.github.io/" target="_blank">Yanqing Liu</a>,
205+
<a href="https://www.haqtu.me/" target="_blank">Haoqin Tu</a>,
206+
<a href="https://scholar.google.com/citations?user=G8NZJLIAAAAJ&hl=en" target="_blank">Hongru Zhu</a>,
207+
<a href="https://cihangxie.github.io/" target="_blank">Cihang Xie</a>.
208+
</p>
183209
</div>
184210
</footer>
185211

212+
186213
</body>
187214
</html>

docs/resources/.!43732!icon.png

Whitespace-only changes.

docs/resources/.!43733!.DS_Store

Whitespace-only changes.

docs/resources/.!43734!tw.png

Whitespace-only changes.

docs/resources/.!43736!performance_normalized.png

Whitespace-only changes.

docs/resources/.!43737!efficiency_llava1.5_vs_next.png

Whitespace-only changes.

docs/resources/.!43739!openvision_teaser_v1.3.png

Whitespace-only changes.

docs/resources/.!43740!performance_normalized_2.png

Whitespace-only changes.

docs/resources/.!43742!hf_dataset.jpg

Whitespace-only changes.

0 commit comments

Comments
 (0)