Add model card for MobileViT (#40033)

Shivamjan · stevhliu · web-flow · commit a1a4fcd03e34 · 2025-08-12T11:36:59.000-07:00
* Add model card for MobileViT

* Update docs/source/en/model_doc/mobilevit.md

Co-authored-by: Steven Liu &lt;59462357+stevhliu@users.noreply.github.com&gt;

* Update docs/source/en/model_doc/mobilevit.md

Co-authored-by: Steven Liu &lt;59462357+stevhliu@users.noreply.github.com&gt;

* Update docs/source/en/model_doc/mobilevit.md

Co-authored-by: Steven Liu &lt;59462357+stevhliu@users.noreply.github.com&gt;

* Update docs/source/en/model_doc/mobilevit.md

Co-authored-by: Steven Liu &lt;59462357+stevhliu@users.noreply.github.com&gt;

* Update docs/source/en/model_doc/mobilevit.md

Co-authored-by: Steven Liu &lt;59462357+stevhliu@users.noreply.github.com&gt;

* Update mobilevit.md

* Update mobilevit.md

* Update mobilevit.md

* Update docs/source/en/model_doc/mobilevit.md

Co-authored-by: Steven Liu &lt;59462357+stevhliu@users.noreply.github.com&gt;

* Update docs/source/en/model_doc/mobilevit.md

Co-authored-by: Steven Liu &lt;59462357+stevhliu@users.noreply.github.com&gt;

* Update mobilevit.md

* Update mobilevit.md

* Update mobilevit.md

* Update mobilevit.md

---------

Co-authored-by: Steven Liu &lt;59462357+stevhliu@users.noreply.github.com&gt;
diff --git a/docs/source/en/model_doc/mobilevit.md b/docs/source/en/model_doc/mobilevit.md
@@ -1,84 +1,114 @@
 <!--Copyright 2022 The HuggingFace Team. All rights reserved.
 
-Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
-the License. You may obtain a copy of the License at
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with  the License. You may obtain a copy of the License at
 
-http://www.apache.org/licenses/LICENSE-2.0
+[http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)
 
-Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
-an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
-specific language governing permissions and limitations under the License.
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on  an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the  specific language governing permissions and limitations under the License.
 
-⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
-rendered properly in your Markdown viewer.
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be  rendered properly in your Markdown viewer.
 
 -->
 
+
+
 # MobileViT
 
-<div class="flex flex-wrap space-x-1">
-<img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
-<img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
+
+<div style="float: right;">
+    <div class="flex flex-wrap space-x-2">
+        <img alt="PyTorch" src="https://img.shields.io/badge/PyTorch-DE3412?style=flat&logo=pytorch&logoColor=white">
+        <img alt="TensorFlow" src="https://img.shields.io/badge/TensorFlow-FF6F00?style=flat&logo=tensorflow&logoColor=white">
+    </div>
 </div>
 
-## Overview
+[MobileViT](https://huggingface.co/papers/2110.02178) is a lightweight vision transformer for mobile devices that merges CNNs's efficiency and inductive biases with transformers global context modeling. It treats transformers as convolutions, enabling global information processing without the heavy computational cost of standard ViTs.
 
-The MobileViT model was proposed in [MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer](https://huggingface.co/papers/2110.02178) by Sachin Mehta and Mohammad Rastegari. MobileViT introduces a new layer that replaces local processing in convolutions with global processing using transformers.
 
-The abstract from the paper is the following:
+<div class="flex justify-center">
+   <img src = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/MobileViT.png">
+</div>
+
 
-*Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters.*
+You can find all the original MobileViT checkpoints under the [Apple](https://huggingface.co/apple/models?search=mobilevit) organization.
 
-This model was contributed by [matthijs](https://huggingface.co/Matthijs). The TensorFlow version of the model was contributed by [sayakpaul](https://huggingface.co/sayakpaul). The original code and weights can be found [here](https://github.com/apple/ml-cvnets).
 
-## Usage tips
+> [!TIP]
+> - This model was contributed by [matthijs](https://huggingface.co/Matthijs) and the TensorFlow version was contributed by [sayakpaul](https://huggingface.co/sayakpaul).
+>
+> Click on the MobileViT models in the right sidebar for more examples of how to apply MobileViT to different vision tasks.
 
-- MobileViT is more like a CNN than a Transformer model. It does not work on sequence data but on batches of images. Unlike ViT, there are no embeddings. The backbone model outputs a feature map. You can follow [this tutorial](https://keras.io/examples/vision/mobilevit) for a lightweight introduction.
-- One can use [`MobileViTImageProcessor`] to prepare images for the model. Note that if you do your own preprocessing, the pretrained checkpoints expect images to be in BGR pixel order (not RGB).
-- The available image classification checkpoints are pre-trained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k) (also referred to as ILSVRC 2012, a collection of 1.3 million images and 1,000 classes).
-- The segmentation model uses a [DeepLabV3](https://huggingface.co/papers/1706.05587) head. The available semantic segmentation checkpoints are pre-trained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/).
-- As the name suggests MobileViT was designed to be performant and efficient on mobile phones. The TensorFlow versions of the MobileViT models are fully compatible with [TensorFlow Lite](https://www.tensorflow.org/lite).
 
-  You can use the following code to convert a MobileViT checkpoint (be it image classification or semantic segmentation) to generate a
-  TensorFlow Lite model:
+The example below demonstrates how to do [Image Classification] with [`Pipeline`] and the [`AutoModel`] class.
 
-```py
-from transformers import TFMobileViTForImageClassification
-import tensorflow as tf
+<hfoptions id="usage">
+<hfoption id="Pipeline">
 
+```python
 
-model_ckpt = "apple/mobilevit-xx-small"
-model = TFMobileViTForImageClassification.from_pretrained(model_ckpt)
+import torch
+from transformers import pipeline
 
-converter = tf.lite.TFLiteConverter.from_keras_model(model)
-converter.optimizations = [tf.lite.Optimize.DEFAULT]
-converter.target_spec.supported_ops = [
-    tf.lite.OpsSet.TFLITE_BUILTINS,
-    tf.lite.OpsSet.SELECT_TF_OPS,
-]
-tflite_model = converter.convert()
-tflite_filename = model_ckpt.split("/")[-1] + ".tflite"
-with open(tflite_filename, "wb") as f:
-    f.write(tflite_model)
+classifier = pipeline(
+   task="image-classification",
+   model="apple/mobilevit-small",
+   torch_dtype=torch.float16, device=0,
+)
+
+preds = classifier("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg")
+print(f"Prediction: {preds}\n")
 ```
 
-  The resulting model will be just **about an MB** making it a good fit for mobile applications where resources and network
-  bandwidth can be constrained.
+</hfoption>
+
+<hfoption id="AutoModel">
+
+```python
+
+import torch
+import requests
+from PIL import Image
+from transformers import AutoImageProcessor, MobileViTForImageClassification
+
+image_processor = AutoImageProcessor.from_pretrained(
+   "apple/mobilevit-small",
+   use_fast=True,
+)
+model = MobileViTForImageClassification.from_pretrained("apple/mobilevit-small")
+
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
+image = Image.open(requests.get(url, stream=True).raw)
+inputs = image_processor(image, return_tensors="pt").to("cuda")
+
+with torch.no_grad():
+    logits = model(**inputs).logits
+predicted_class_id = logits.argmax(dim=-1).item()
+
+class_labels = model.config.id2label
+predicted_class_label = class_labels[predicted_class_id]
+print(f"The predicted class label is:{predicted_class_label}")
+```
+
+</hfoption>
+</hfoptions>
+
+
 
-## Resources
 
-A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with MobileViT.
+## Notes
 
-<PipelineTag pipeline="image-classification"/>
+- Does **not** operate on sequential data, it's purely designed for image tasks.
+- Feature maps are used directly instead of token embeddings.
+- Use [`MobileViTImageProcessor`] to preprocess images.
+- If using custom preprocessing, ensure that images are in **BGR** format (not RGB), as expected by the pretrained weights.
+- The classification models are pretrained on [ImageNet-1k](https://huggingface.co/datasets/imagenet-1k).
+- The segmentation models use a [DeepLabV3](https://huggingface.co/papers/1706.05587) head and are pretrained on [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/).
+- TensorFlow versions are compatible with TensorFlow Lite, making them ideal for edge/mobile deployment.
 
-- [`MobileViTForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb).
-- See also: [Image classification task guide](../tasks/image_classification)
 
-**Semantic segmentation**
-- [Semantic segmentation task guide](../tasks/semantic_segmentation)
 
-If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.
 
+  
 ## MobileViTConfig
 
 [[autodoc]] MobileViTConfig
@@ -138,4 +168,4 @@ If you're interested in submitting a resource to be included here, please feel f
     - call
 
 </tf>
-</frameworkcontent>
+</frameworkcontent>