Skip to content

Commit fe52bbc

Browse files
author
Anna Grebneva
authored
Added t2t-vit-14 model (#2967)
1 parent 97d751e commit fe52bbc

File tree

10 files changed

+242
-3
lines changed

10 files changed

+242
-3
lines changed

demos/classification_demo/python/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ omz_converter --list models.lst
7878
* squeezenet1.0
7979
* squeezenet1.1
8080
* swin-tiny-patch4-window7-224
81+
* t2t-vit-14
8182
* vgg16
8283
* vgg19
8384

demos/classification_demo/python/models.lst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,5 +46,6 @@ shufflenet-v2-x1.0
4646
squeezenet1.0
4747
squeezenet1.1
4848
swin-tiny-patch4-window7-224
49+
t2t-vit-14
4950
vgg16
5051
vgg19

models/public/device_support.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -119,6 +119,7 @@
119119
| ssd_mobilenet_v1_fpn_coco | YES | YES | YES |
120120
| ssdlite_mobilenet_v2 | YES | YES | YES |
121121
| swin-tiny-patch4-window7-224 | YES | YES | YES |
122+
| t2t-vit-14 | YES | | |
122123
| text-recognition-resnet-fc | YES | YES | YES |
123124
| ultra-lightweight-face-detection-rfb-320 | YES | YES | YES |
124125
| ultra-lightweight-face-detection-slim-320 | YES | YES | YES |

models/public/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ You can download models and convert them into Inference Engine format (\*.xml +
5353
| SqueezeNet v1.0 | Caffe\* | [squeezenet1.0](./squeezenet1.0/README.md)| 57.684%/80.38%| 1.737 | 1.248 |
5454
| SqueezeNet v1.1 | Caffe\*| [squeezenet1.1](./squeezenet1.1/README.md)| 58.382%/81% | 0.785 | 1.236 |
5555
| Swin Transformer Tiny, window size=7| PyTorch\* | [swin-tiny-patch4-window7-224](./swin-tiny-patch4-window7-224/README.md) | 81.38%/95.51% | 9.0280 | 28.8173 |
56+
| T2T-ViT, transformer layers number=14| PyTorch\* | [t2t-vit-14](./t2t-vit-14/README.md) | 81.44%/95.66% | 9.5451 | 21.5498 |
5657
| VGG 16 | Caffe\* | [vgg16](./vgg16/README.md) | 70.968%/89.878% | 30.974 | 138.358 |
5758
| VGG 19 | Caffe\* | [vgg19](./vgg19/README.md) | 71.062%/89.832% | 39.3 | 143.667 |
5859

models/public/t2t-vit-14/README.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
# t2t-vit-14
2+
3+
## Use Case and High-Level Description
4+
5+
The `t2t-vit-14` model is a variant of the Tokens-To-Token Vision Transformer(T2T-ViT) pre-trained on ImageNet dataset for image classification task. T2T-ViT progressively tokenize the image to tokens and has an efficient backbone. T2T-ViT consists of two main components: 1) a layer-wise "Tokens-to-Token module" to model the local structure information of the image and reduce the length of tokens progressively; 2) an efficient "T2T-ViT backbone" to draw the global attention relation on tokens from the T2T module. The model has 14 transformer layers in T2T-ViT backbone with 384 hidden dimensions.
6+
7+
More details provided in the [paper](https://arxiv.org/abs/2101.11986) and [repository](https://github.com/yitu-opensource/T2T-ViT).
8+
9+
## Specification
10+
11+
| Metric | Value |
12+
|---------------------------------|----------------|
13+
| Type | Classification |
14+
| GFlops | 9.5451 |
15+
| MParams | 21.5498 |
16+
| Source framework | PyTorch\* |
17+
18+
## Accuracy
19+
20+
| Metric | Value |
21+
| ------ | ------ |
22+
| Top 1 | 81.44% |
23+
| Top 5 | 95.66% |
24+
25+
## Input
26+
27+
### Original Model
28+
29+
Image, name: `image`, shape: `1, 3, 224, 224`, format: `B, C, H, W`, where:
30+
31+
- `B` - batch size
32+
- `C` - number of channels
33+
- `H` - image height
34+
- `W` - image width
35+
36+
Expected color order: `RGB`.
37+
Mean values - [123.675, 116.28, 103.53], scale values - [58.395, 57.12, 57.375].
38+
39+
### Converted Model
40+
41+
Image, name: `image`, shape: `1, 3, 224, 224`, format: `B, C, H, W`, where:
42+
43+
- `B` - batch size
44+
- `C` - number of channels
45+
- `H` - image height
46+
- `W` - image width
47+
48+
Expected color order: `BGR`.
49+
50+
## Output
51+
52+
### Original Model
53+
54+
Object classifier according to ImageNet classes, name: `probs`, shape: `1, 1000`, output data format is `B, C`, where:
55+
56+
- `B` - batch size
57+
- `C` - vector of probabilities for all dataset classes in logits format
58+
59+
### Converted Model
60+
61+
Object classifier according to ImageNet classes, name: `probs`, shape: `1, 1000`, output data format is `B, C`, where:
62+
63+
- `B` - batch size
64+
- `C` - vector of probabilities for all dataset classes in logits format
65+
66+
## Download a Model and Convert it into Inference Engine Format
67+
68+
You can download models and if necessary convert them into Inference Engine format using the [Model Downloader and other automation tools](../../../tools/model_tools/README.md) as shown in the examples below.
69+
70+
An example of using the Model Downloader:
71+
```
72+
omz_downloader --name <model_name>
73+
```
74+
75+
An example of using the Model Converter:
76+
```
77+
omz_converter --name <model_name>
78+
```
79+
80+
## Legal Information
81+
82+
The original model is distributed under the following
83+
[license](https://raw.githubusercontent.com/yitu-opensource/T2T-ViT/main/LICENSE):
84+
85+
```
86+
The Clear BSD License
87+
88+
Copyright (c) [2012]-[2021] Shanghai Yitu Technology Co., Ltd.
89+
All rights reserved.
90+
91+
Redistribution and use in source and binary forms, with or without modification, are permitted (subject to the limitations in the disclaimer below) provided that the following conditions are met:
92+
93+
* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
94+
* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
95+
* Neither the name of Shanghai Yitu Technology Co., Ltd. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
96+
97+
NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY SHANGHAI YITU TECHNOLOGY CO., LTD. AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SHANGHAI YITU TECHNOLOGY CO., LTD. OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
98+
```
Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
models:
2+
- name: t2t-vit-14
3+
4+
launchers:
5+
- framework: dlsdk
6+
adapter: classification
7+
8+
datasets:
9+
- name: imagenet_1000_classes
10+
reader: pillow_imread
11+
12+
preprocessing:
13+
- type: resize
14+
size: 256
15+
aspect_ratio_scale: greater
16+
use_pillow: True
17+
interpolation: BICUBIC
18+
- type: crop
19+
size: 224
20+
use_pillow: True
21+
- type: rgb_to_bgr
22+
23+
metrics:
24+
- name: accuracy@top1
25+
type: accuracy
26+
top_k: 1
27+
reference: 0.8144
28+
- name: accuracy@top5
29+
type: accuracy
30+
top_k: 5
31+
reference: 0.9566

models/public/t2t-vit-14/model.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Copyright (c) 2022 Intel Corporation
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from torch import load
16+
from models.t2t_vit import t2t_vit_14
17+
18+
19+
def create_model(weights):
20+
model = t2t_vit_14()
21+
22+
checkpoint = load(weights, map_location='cpu')['state_dict_ema']
23+
model.load_state_dict(checkpoint)
24+
25+
return model

models/public/t2t-vit-14/model.yml

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Copyright (c) 2022 Intel Corporation
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
description: >-
16+
The "t2t-vit-14" model is a variant of the Tokens-To-Token Vision Transformer(T2T-ViT)
17+
pre-trained on ImageNet dataset for image classification task. T2T-ViT progressively
18+
tokenize the image to tokens and has an efficient backbone. T2T-ViT consists of
19+
two main components: 1) a layer-wise "Tokens-to-Token module" to model the local
20+
structure information of the image and reduce the length of tokens progressively;
21+
2) an efficient "T2T-ViT backbone" to draw the global attention relation on tokens
22+
from the T2T module. The model has 14 transformer layers in T2T-ViT backbone with
23+
384 hidden dimensions.
24+
25+
More details provided in the paper <https://arxiv.org/abs/2101.11986> and repository
26+
<https://github.com/yitu-opensource/T2T-ViT>.
27+
task_type: classification
28+
files:
29+
- name: timm-0.5.4-py3-none-any.whl
30+
size: 431537
31+
checksum: e8f1967a8e2029fe21a43875132b4b123227b718abc35725d7f2b9fd0ef2062884ac3dd558570b51a780aad89bc375d6
32+
source: https://files.pythonhosted.org/packages/49/65/a83208746dc9c0d70feff7874b49780ff110810feb528df4b0ecadcbee60/timm-0.5.4-py3-none-any.whl
33+
- name: models/t2t_vit.py
34+
size: 12499
35+
checksum: caca4a1eced1616d403a3e582ebb30f79c9ab08be1b8789f4ddc5c34c55dccd0c95557062017e9794ff4e9b3bf017a63
36+
source: https://raw.githubusercontent.com/yitu-opensource/T2T-ViT/2c25580c6b6968bba043be3c8f5d581428c6716d/models/t2t_vit.py
37+
- name: models/token_transformer.py
38+
size: 2326
39+
checksum: a0e33192063f3ada47a504f5381c7f4c196ddae531b0514096ea06ee29bfa94a9716ffc8410fdd58bca07c732c0b4ef9
40+
source: https://raw.githubusercontent.com/yitu-opensource/T2T-ViT/2c25580c6b6968bba043be3c8f5d581428c6716d/models/token_transformer.py
41+
- name: models/token_performer.py
42+
size: 2370
43+
checksum: 77fd18c505f7780dd4f5a1149a208ef4b56b486bb9d89d05132321b32969ae79194ce5f49c8d194969176c1e9a4bedf6
44+
source: https://raw.githubusercontent.com/yitu-opensource/T2T-ViT/2c25580c6b6968bba043be3c8f5d581428c6716d/models/token_performer.py
45+
- name: models/transformer_block.py
46+
size: 3286
47+
checksum: e0ae48049b38c15664267d1e6a469f2427159d938feb9d438c37f6b922b65a959fe5172b8dee099fd3ae25a63c52489f
48+
source: https://raw.githubusercontent.com/yitu-opensource/T2T-ViT/2c25580c6b6968bba043be3c8f5d581428c6716d/models/transformer_block.py
49+
- name: 81.5_T2T_ViT_14.pth
50+
size: 86212823
51+
checksum: 98c8ae06a1e9997f3e83ad2dfbac49985bf7bd7723eb64e4b8462338163be02657998f0bd9f8997367417003caeaf7a1
52+
source: https://github.com/yitu-opensource/T2T-ViT/releases/download/main/81.5_T2T_ViT_14.pth.tar
53+
postprocessing:
54+
- $type: unpack_archive
55+
format: zip
56+
file: timm-0.5.4-py3-none-any.whl
57+
conversion_to_onnx_args:
58+
- --model-path=$dl_dir
59+
- --model-path=$config_dir
60+
- --model-name=create_model
61+
- --import-module=model
62+
- --model-param=weights=r"$dl_dir/81.5_T2T_ViT_14.pth"
63+
- --input-shape=1,3,224,224
64+
- --input-names=image
65+
- --output-names=probs
66+
- --output-file=$conv_dir/t2t-vit-14.onnx
67+
- --opset_version=12
68+
model_optimizer_args:
69+
- --input_shape=[1,3,224,224]
70+
- --input=image
71+
- --input_model=$conv_dir/t2t-vit-14.onnx
72+
- --mean_values=image[123.675,116.28,103.53]
73+
- --scale_values=image[58.395,57.12,57.375]
74+
- --reverse_input_channels
75+
- --output=probs
76+
framework: pytorch
77+
license: https://raw.githubusercontent.com/yitu-opensource/T2T-ViT/main/LICENSE
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
../../../models/public/t2t-vit-14/accuracy-check.yml

tools/model_tools/src/openvino/model_zoo/internal_scripts/pytorch_to_onnx.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -82,6 +82,7 @@ def parse_args():
8282
help='Data type for inputs')
8383
parser.add_argument('--conversion-param', type=model_parameter, default=[], action='append',
8484
help='Additional parameter for export')
85+
parser.add_argument('--opset_version', type=int, default=11, help='The ONNX opset version')
8586
return parser.parse_args()
8687

8788

@@ -118,7 +119,8 @@ def load_model(model_name, weights, model_paths, module_name, model_params):
118119

119120

120121
@torch.no_grad()
121-
def convert_to_onnx(model, input_shapes, output_file, input_names, output_names, inputs_dtype, conversion_params):
122+
def convert_to_onnx(model, input_shapes, output_file, input_names, output_names, inputs_dtype, conversion_params,
123+
opset_version):
122124
"""Convert PyTorch model to ONNX and check the resulting onnx model"""
123125

124126
output_file.parent.mkdir(parents=True, exist_ok=True)
@@ -127,7 +129,7 @@ def convert_to_onnx(model, input_shapes, output_file, input_names, output_names,
127129
torch.zeros(input_shape, dtype=INPUT_DTYPE_TO_TORCH[inputs_dtype])
128130
for input_shape in input_shapes)
129131
model(*dummy_inputs)
130-
torch.onnx.export(model, dummy_inputs, str(output_file), verbose=False, opset_version=11,
132+
torch.onnx.export(model, dummy_inputs, str(output_file), verbose=False, opset_version=opset_version,
131133
input_names=input_names.split(','), output_names=output_names.split(','), **conversion_params)
132134

133135
model = onnx.load(str(output_file))
@@ -144,7 +146,8 @@ def main():
144146
model = load_model(args.model_name, args.weights,
145147
args.model_paths, args.import_module, dict(args.model_param))
146148

147-
convert_to_onnx(model, args.input_shapes, args.output_file, args.input_names, args.output_names, args.inputs_dtype, dict(args.conversion_param))
149+
convert_to_onnx(model, args.input_shapes, args.output_file, args.input_names, args.output_names, args.inputs_dtype,
150+
dict(args.conversion_param), args.opset_version)
148151

149152

150153
if __name__ == '__main__':

0 commit comments

Comments
 (0)