Skip to content

Commit b749e1f

Browse files
committed
update
1 parent 8ef1a71 commit b749e1f

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+9317
-3615
lines changed

README.md

Lines changed: 14 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Ao Wang, Hui Chen, Zijia Lin, Hengjun Pu, and Guiguang Ding\
1515
<summary>
1616
<font size="+1">Abstract</font>
1717
</summary>
18-
Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency compared with lightweight Convolutional Neural Networks (CNNs) on resource-constrained mobile devices. This improvement is usually attributed to the multi-head self-attention module, which enables the model to learn global representations. However, the architectural disparities between lightweight ViTs and lightweight CNNs have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs and emphasize their potential for mobile devices. We incrementally enhance the mobile-friendliness of a standard lightweight CNN, specifically MobileNetV3, by integrating the efficient architectural choices of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. On ImageNet, RepViT achieves over 80\% top-1 accuracy with nearly 1ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Our largest model, RepViT-M3, obtains 81.4\% accuracy with only 1.3ms latency.
18+
Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency compared with lightweight Convolutional Neural Networks (CNNs) on resource-constrained mobile devices. This improvement is usually attributed to the multi-head self-attention module, which enables the model to learn global representations. However, the architectural disparities between lightweight ViTs and lightweight CNNs have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs and emphasize their potential for mobile devices. We incrementally enhance the mobile-friendliness of a standard lightweight CNN, specifically MobileNetV3, by integrating the efficient architectural choices of lightweight ViTs. This ends up with a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. On ImageNet, RepViT achieves over 80\% top-1 accuracy with 1ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Our largest model, RepViT-M2.3, obtains 83.7\% accuracy with only 2.3ms latency.
1919
</details>
2020

2121
<br/>
@@ -29,18 +29,21 @@ Recently, lightweight Vision Transformers (ViTs) demonstrate superior performanc
2929

3030
### Models
3131

32-
| Model | Top-1 (300)| #params | MACs | Latency | Ckpt | Core ML | Log |
32+
| Model | Top-1 (300 / 450)| #params | MACs | Latency | Ckpt | Core ML | Log |
3333
|:---------------|:----:|:---:|:--:|:--:|:--:|:--:|:--:|
34-
| RepViT-M1 | 78.5 | 5.1M | 0.8G | 0.9ms | [M1](https://github.com/jameslahm/RepViT/releases/download/v1.0/repvit_m1_distill_300.pth) | [M1](https://github.com/jameslahm/RepViT/releases/download/v1.0/repvit_m1_224.mlmodel) | [M1](./logs/repvit_m1_train.log) |
35-
| RepViT-M2 | 80.6 | 8.2M | 1.3G | 1.1ms | [M2](https://github.com/jameslahm/RepViT/releases/download/v1.0/repvit_m2_distill_300.pth) | [M2](https://github.com/jameslahm/RepViT/releases/download/v1.0/repvit_m2_224.mlmodel) | [M2](./logs/repvit_m2_train.log) |
36-
| RepViT-M3 | 81.4 | 10.1M | 1.9G | 1.3ms | [M3](https://github.com/jameslahm/RepViT/releases/download/v1.0/repvit_m3_distill_300.pth) | [M3](https://github.com/jameslahm/RepViT/releases/download/v1.0/repvit_m3_224.mlmodel) | [M3](./logs/repvit_m3_train.log) |
34+
| RepViT-M0.9 | 78.7 / 79.1 | 5.1M | 0.8G | 0.9ms | [M0.9-300e]() / [M0.9-450e]() | [M0.9-300e]() / [M0.9-450e]() | [M0.9-300e](./logs/repvit_m0_9_distill_300e.txt) / [M0.9-450e](./logs/repvit_m0_9_distill_450e.txt) |
35+
| RepViT-M1.0 | 80.0 / 80.3 | 6.8M | 1.1G | 1.0ms | [M1.0-300e]() / [M1.0-450e]() | [M1.0-300e]() / [M1.0-450e]() | [M1.0-300e](./logs/repvit_m1_0_distill_300e.txt) / [M1.0-450e](./logs/repvit_m1_0_distill_450e.txt) |
36+
| RepViT-M1.1 | 80.7 / 81.1 | 8.2M | 1.3G | 1.1ms | [M1.1-300e]() / [M1.1-450e]() | [M1.1-300e]() / [M1.1-450e]() | [M1.1-300e](./logs/repvit_m1_1_distill_300e.txt) / [M1.1-450e](./logs/repvit_m1_1_distill_450e.txt) |
37+
| RepViT-M1.5 | 82.3 / 82.5 | 14.0M | 2.3G | 1.5ms | [M1.5-300e]() / [M1.5-450e]() | [M1.5-300e]() / [M1.5-450e]() | [M1.5-300e](./logs/repvit_m1_5_distill_300e.txt) / [M1.5-450e](./logs/repvit_m1_5_distill_450e.txt) |
38+
| RepViT-M2.3 | 83.3 / 83.7 | 22.9M | 4.5G | 2.3ms | [M2.3-300e]() / [M2.3-450e]() | [M2.3-300e]() / [M2.3-450e]() | [M2.3-300e](./logs/repvit_m2_3_distill_300e.txt) / [M2.3-450e](./logs/repvit_m2_3_distill_450e.txt) |
39+
3740

3841
Tips: Convert a training-time RepViT into the inference-time structure
3942
```
4043
from timm.models import create_model
4144
import utils
4245
43-
model = create_model('repvit_m1')
46+
model = create_model('repvit_m0_9')
4447
utils.replace_batchnorm(model)
4548
```
4649

@@ -49,15 +52,15 @@ utils.replace_batchnorm(model)
4952
The latency reported in RepViT for iPhone 12 (iOS 16) uses the benchmark tool from [XCode 14](https://developer.apple.com/videos/play/wwdc2022/10027/).
5053
For example, here is a latency measurement of RepViT-M1:
5154

52-
![](./figures/repvit_m1_latency.png)
55+
![](./figures/repvit_m0_9_latency.png)
5356

5457
Tips: export the model to Core ML model
5558
```
56-
python export_coreml.py --model repvit_m1 --ckpt pretrain/repvit_m1_distill_300.pth
59+
python export_coreml.py --model repvit_m0_9 --ckpt pretrain/repvit_m0_9_distill_300e.pth
5760
```
5861
Tips: measure the throughput on GPU
5962
```
60-
python speed_gpu.py --model repvit_m1
63+
python speed_gpu.py --model repvit_m0_9
6164
```
6265

6366

@@ -83,14 +86,14 @@ Download and extract ImageNet train and val images from http://image-net.org/. T
8386
To train RepViT-M1 on an 8-GPU machine:
8487

8588
```
86-
python -m torch.distributed.launch --nproc_per_node=8 --master_port 12346 --use_env main.py --model repvit_m1 --data-path ~/imagenet --dist-eval
89+
python -m torch.distributed.launch --nproc_per_node=8 --master_port 12346 --use_env main.py --model repvit_m0_9 --data-path ~/imagenet --dist-eval
8790
```
8891
Tips: specify your data path and model name!
8992

9093
### Testing
9194
For example, to test RepViT-M1:
9295
```
93-
python main.py --eval --model repvit_m3 --resume pretrain/repvit_m3_distill_300.pth --data-path ~/imagenet
96+
python main.py --eval --model repvit_m0_9 --resume pretrain/repvit_m0_9_distill_300e.pth --data-path ~/imagenet
9497
```
9598

9699
## Downstream Tasks

detection/README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,9 @@ Detection and instance segmentation on MS COCO 2017 is implemented based on [MMD
55
## Models
66
| Model | $AP^b$ | $AP_{50}^b$ | $AP_{75}^b$ | $AP^m$ | $AP_{50}^m$ | $AP_{75}^m$ | Latency | Ckpt | Log |
77
|:---------------|:----:|:---:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|
8-
| RepViT-M2 | 39.8 | 61.9 | 43.5 | 37.2 | 58.8 | 40.1 | 4.9ms | [M2](https://github.com/jameslahm/RepViT/releases/download/v1.0/repvit_m2_coco.pth) | [M2](./logs/repvit_m2_coco.json) |
9-
| RepViT-M3 | 41.1 | 63.1 | 45.0 | 38.3 | 60.4 | 41.0 | 5.9ms | [M3](https://github.com/jameslahm/RepViT/releases/download/v1.0/repvit_m3_coco.pth) | [M3](./logs/repvit_m3_coco.json) |
8+
| RepViT-M1_1 | 39.8 | 61.9 | 43.5 | 37.2 | 58.8 | 40.1 | 4.9ms | [M1_1]() | [M1_1](./logs/repvit_m1_1_coco.json) |
9+
| RepViT-M1_5 | 41.6 | 63.2 | 45.3 | 38.6 | 60.5 | 41.5 | 43.6 | 6.4ms | [M1_5]() | [M1_5](./logs/repvit_m1_5_coco.json) |
10+
| RepViT-M2_3 | 44.6 | 66.1 | 48.8 | 40.8 | 63.6 | 43.9 | 46.1 | 9.9ms | [M2_3]() | [M2_3](./logs/repvit_m2_3_coco.json) |
1011

1112
## Installation
1213

@@ -43,15 +44,15 @@ We provide a multi-GPU testing script, specify config file, checkpoint, and numb
4344
For example, to test RepViT-M1 on COCO 2017 on an 8-GPU machine,
4445

4546
```
46-
./dist_test.sh configs/mask_rcnn_repvit_m2_fpn_1x_coco.py path/to/repvit_m2_coco.pth 8 --eval bbox segm
47+
./dist_test.sh configs/mask_rcnn_repvit_m1_1_fpn_1x_coco.py path/to/repvit_m1_1_coco.pth 8 --eval bbox segm
4748
```
4849

4950
## Training
5051
Download ImageNet-1K pretrained weights into `./pretrain`
5152

5253
We provide PyTorch distributed data parallel (DDP) training script `dist_train.sh`, for example, to train RepViT-M1 on an 8-GPU machine:
5354
```
54-
./dist_train.sh configs/mask_rcnn_repvit_m2_fpn_1x_coco.py 8
55+
./dist_train.sh configs/mask_rcnn_repvit_m1_1_fpn_1x_coco.py 8
5556
```
5657
Tips: specify configs and #GPUs!
5758

detection/configs/mask_rcnn_repvit_m2_fpn_1x_coco.py renamed to detection/configs/mask_rcnn_repvit_m1_1_fpn_1x_coco.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,10 @@
77
# optimizer
88
model = dict(
99
backbone=dict(
10-
type='repvit_m2',
10+
type='repvit_m1_1',
1111
init_cfg=dict(
1212
type='Pretrained',
13-
checkpoint='pretrain/repvit_m2_distill_300.pth',
13+
checkpoint='pretrain/repvit_m1_1_distill_300e.pth',
1414
),
1515
out_indices = [2,6,20,24]
1616
),

detection/configs/mask_rcnn_repvit_m3_fpn_1x_coco.py renamed to detection/configs/mask_rcnn_repvit_m1_5_fpn_1x_coco.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,12 +7,12 @@
77
# optimizer
88
model = dict(
99
backbone=dict(
10-
type='repvit_m3',
10+
type='repvit_m1_5',
1111
init_cfg=dict(
1212
type='Pretrained',
13-
checkpoint='pretrain/repvit_m3_distill_300.pth',
13+
checkpoint='pretrain/repvit_m1_5_distill_300e.pth',
1414
),
15-
out_indices=[4,10,30,34]
15+
out_indices=[4, 10, 36, 42]
1616
),
1717
neck=dict(
1818
type='FPN',
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
_base_ = [
2+
'_base_/models/mask_rcnn_r50_fpn.py',
3+
'_base_/datasets/coco_instance.py',
4+
'_base_/schedules/schedule_1x.py',
5+
'_base_/default_runtime.py'
6+
]
7+
# optimizer
8+
model = dict(
9+
backbone=dict(
10+
type='repvit_m2_3',
11+
init_cfg=dict(
12+
type='Pretrained',
13+
checkpoint='pretrain/repvit_m2_3_distill_450e.pth',
14+
),
15+
out_indices=[6, 14, 50, 54]
16+
),
17+
neck=dict(
18+
type='FPN',
19+
in_channels=[80, 160, 320, 640],
20+
out_channels=256,
21+
num_outs=5))
22+
# optimizer
23+
optimizer = dict(_delete_=True, type='AdamW', lr=0.0002, weight_decay=0.05) # 0.0001
24+
optimizer_config = dict(grad_clip=None)

detection/eval.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
PORT=12345 ./dist_test.sh configs/mask_rcnn_repvit_m3_fpn_1x_coco.py det_pretrain/repvit_m3_coco.pth 8 --eval bbox segm
1+
PORT=12345 ./dist_test.sh configs/mask_rcnn_repvit_m1_1_fpn_1x_coco.py det_pretrain/repvit_m1_1_coco.pth 8 --eval bbox segm

0 commit comments

Comments
 (0)