THU-MIG
diff --git a/‎.gitignore‎
Lines changed: 9 additions & 0 deletions b/‎.gitignore‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 108 additions & 0 deletions b/‎README.md‎
Lines changed: 108 additions & 0 deletions
diff --git a/‎data/__init__.py‎ b/‎data/__init__.py‎
diff --git a/‎data/datasets.py‎
Lines changed: 140 additions & 0 deletions b/‎data/datasets.py‎
Lines changed: 140 additions & 0 deletions
diff --git a/‎data/samplers.py‎
Lines changed: 64 additions & 0 deletions b/‎data/samplers.py‎
Lines changed: 64 additions & 0 deletions
@@ -0,0 +1,9 @@
+wandb
+coreml
+pretrain
+**/__pycache__
+pretrain
+ignore
+*.zip
+checkpoints
+trt
@@ -0,0 +1,108 @@
+# [RepViT: Revisiting  Mobile CNN From ViT Perspective](https://arxiv.org/abs/2307.09283)
+
+Official PyTorch implementation of **RepViT**, from the following paper:
+
+[RepViT: Revisiting  Mobile CNN From ViT Perspective](https://arxiv.org/abs/2307.09283).\
+Ao Wang, Hui Chen, Zijia Lin, Hengjun Pu, and Guiguang Ding\
+[[`arXiv`](https://arxiv.org/abs/2307.09283)]
+
+<p align="center">
+  <img src="figures/latency.png" width=70%> <br>
+  Models are trained on ImageNet-1K and deployed on iPhone 12 with Core ML Tools to get latency.
+</p>
+
+<details>
+  <summary>
+  <font size="+1">Abstract</font>
+  </summary>
+Recently, lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency compared with lightweight Convolutional Neural Networks (CNNs) on resource-constrained mobile devices. This improvement is usually attributed to the multi-head self-attention module, which enables the model to learn global representations. However, the architectural disparities between lightweight ViTs and lightweight CNNs have not been adequately examined. In this study, we revisit the efficient design of lightweight CNNs and emphasize their potential for mobile devices. We incrementally enhance the mobile-friendliness of a standard lightweight CNN, specifically MobileNetV3, by integrating the efficient architectural choices of lightweight ViTs. To this end, we present a new family of pure lightweight CNNs, namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. On ImageNet, RepViT achieves over 80\% top-1 accuracy with nearly 1ms latency on an iPhone 12, which is the first time for a lightweight model, to the best of our knowledge. Our largest model, RepViT-M3, obtains 81.4\% accuracy with only 1.3ms latency.
+</details>
+
+<br>
+
+## Classification on ImageNet-1K
+
+### Models
+
+| Model | Top-1 (300)| #params | MACs | Latency | Ckpt | Core ML | Log |
+|:---------------|:----:|:---:|:--:|:--:|:--:|:--:|:--:|
+| RepViT-M1 |   78.5   |     5.1M    |   0.8G   |      0.9ms     |  [M1](https://github.com/jameslahm/RepViT/releases/download/untagged-75eb9e1fea235b938f50/repvit_m1_distill_300.pth)    |   [M1](https://github.com/jameslahm/RepViT/releases/download/untagged-75eb9e1fea235b938f50/repvit_m1_224.mlmodel)  | [M1](./logs/repvit_m1_train.log) |
+| RepViT-M2 |   80.6   |     8.8M    |   1.4G   |      1.1ms     |  [M2](https://github.com/jameslahm/RepViT/releases/download/untagged-75eb9e1fea235b938f50/repvit_m2_distill_300.pth)    |   [M2](https://github.com/jameslahm/RepViT/releases/download/untagged-75eb9e1fea235b938f50/repvit_m2_224.mlmodel)  | [M2](./logs/repvit_m2_train.log) |
+| RepViT-M3 |   81.4   |     10.1M    |   1.9G   |      1.3ms     |  [M3](https://github.com/jameslahm/RepViT/releases/download/untagged-75eb9e1fea235b938f50/repvit_m3_distill_300.pth)    |   [M3](https://github.com/jameslahm/RepViT/releases/download/untagged-75eb9e1fea235b938f50/repvit_m3_224.mlmodel)  | [M3](./logs/repvit_m3_train.log) |
+
+Tips: Convert a training-time RepViT into the inference-time structure
+```
+from timm.models import create_model
+import utils
+
+model = create_model('repvit_m1')
+utils.replace_batchnorm(model)
+```
+
+## Latency Measurement 
+
+The latency reported in RepViT for iPhone 12 (iOS 16) uses the benchmark tool from [XCode 14](https://developer.apple.com/videos/play/wwdc2022/10027/).
+For example, here is a latency measurement of RepViT-M1:
+
+![](./figures/repvit_m1_latency.png)
+
+Tips: export the model to Core ML model
+```
+python export_coreml.py --model repvit_m1 --ckpt pretrain/repvit_m1_distill_300.pth
+```
+Tips: measure the throughput on GPU
+```
+python speed_gpu.py --model repvit_m1
+```
+
+
+## ImageNet  
+
+### Prerequisites
+`conda` virtual environment is recommended. 
+```
+conda create -n repvit python=3.8
+pip install -r requirements.txt
+```
+
+### Data preparation
+
+Download and extract ImageNet train and val images from http://image-net.org/. The training and validation data are expected to be in the `train` folder and `val` folder respectively:
+```
+|-- /path/to/imagenet/
+    |-- train
+    |-- val
+```
+
+### Training
+To train RepViT-M1 on an 8-GPU machine:
+
+```
+python -m torch.distributed.launch --nproc_per_node=8 --master_port 12346 --use_env main.py --model repvit_m1 --data-path ~/imagenet --dist-eval
+```
+Tips: specify your data path and model name! 
+
+### Testing 
+For example, to test RepViT-M1:
+```
+python main.py --eval --model repvit_m3 --resume pretrain/repvit_m3_distill_300.pth --data-path ~/imagenet
+```
+
+## Downstream Tasks
+[Object Detection and Instance Segmentation](detection/README.md)<br>
+[Semantic Segmentation](segmentation/README.md)
+
+## Acknowledgement
+
+Classification (ImageNet) code base is partly built with [LeViT](https://github.com/facebookresearch/LeViT), [PoolFormer](https://github.com/sail-sg/poolformer) and [EfficientFormer](https://github.com/snap-research/EfficientFormer). 
+
+The detection and segmentation pipeline is from [MMCV](https://github.com/open-mmlab/mmcv) ([MMDetection](https://github.com/open-mmlab/mmdetection) and [MMSegmentation](https://github.com/open-mmlab/mmsegmentation)). 
+
+Thanks for the great implementations! 
+
+## Citation
+
+If our code or models help your work, please cite our papers:
+```BibTeX
+
+```
@@ -0,0 +1,140 @@
+'''
+Build trainining/testing datasets
+'''
+import os
+import json
+
+from torchvision import datasets, transforms
+from torchvision.datasets.folder import ImageFolder, default_loader
+import torch
+
+from timm.data.constants import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD
+from timm.data import create_transform
+
+try:
+    from timm.data import TimmDatasetTar
+except ImportError:
+    # for higher version of timm
+    from timm.data import ImageDataset as TimmDatasetTar
+
+class INatDataset(ImageFolder):
+    def __init__(self, root, train=True, year=2018, transform=None, target_transform=None,
+                 category='name', loader=default_loader):
+        self.transform = transform
+        self.loader = loader
+        self.target_transform = target_transform
+        self.year = year
+        # assert category in ['kingdom','phylum','class','order','supercategory','family','genus','name']
+        path_json = os.path.join(
+            root, f'{"train" if train else "val"}{year}.json')
+        with open(path_json) as json_file:
+            data = json.load(json_file)
+
+        with open(os.path.join(root, 'categories.json')) as json_file:
+            data_catg = json.load(json_file)
+
+        path_json_for_targeter = os.path.join(root, f"train{year}.json")
+
+        with open(path_json_for_targeter) as json_file:
+            data_for_targeter = json.load(json_file)
+
+        targeter = {}
+        indexer = 0
+        for elem in data_for_targeter['annotations']:
+            king = []
+            king.append(data_catg[int(elem['category_id'])][category])
+            if king[0] not in targeter.keys():
+                targeter[king[0]] = indexer
+                indexer += 1
+        self.nb_classes = len(targeter)
+
+        self.samples = []
+        for elem in data['images']:
+            cut = elem['file_name'].split('/')
+            target_current = int(cut[2])
+            path_current = os.path.join(root, cut[0], cut[2], cut[3])
+
+            categors = data_catg[target_current]
+            target_current_true = targeter[categors[category]]
+            self.samples.append((path_current, target_current_true))
+
+    # __getitem__ and __len__ inherited from ImageFolder
+
+
+def build_dataset(is_train, args):
+    transform = build_transform(is_train, args)
+
+    if args.data_set == 'CIFAR':
+        dataset = datasets.CIFAR100(
+            args.data_path, train=is_train, transform=transform)
+        nb_classes = 100
+    elif args.data_set == 'IMNET':
+        prefix = 'train' if is_train else 'val'
+        data_dir = os.path.join(args.data_path, f'{prefix}.tar')
+        if os.path.exists(data_dir):
+            dataset = TimmDatasetTar(data_dir, transform=transform)
+        else:
+            root = os.path.join(args.data_path, 'train' if is_train else 'val')
+            dataset = datasets.ImageFolder(root, transform=transform)
+        nb_classes = 1000
+    elif args.data_set == 'IMNETEE':
+        root = os.path.join(args.data_path, 'train' if is_train else 'val')
+        dataset = datasets.ImageFolder(root, transform=transform)
+        nb_classes = 10
+    elif args.data_set == 'FLOWERS':
+        root = os.path.join(args.data_path, 'train' if is_train else 'test')
+        dataset = datasets.ImageFolder(root, transform=transform)
+        if is_train:
+            dataset = torch.utils.data.ConcatDataset(
+                [dataset for _ in range(100)])
+        nb_classes = 102
+    elif args.data_set == 'INAT':
+        dataset = INatDataset(args.data_path, train=is_train, year=2018,
+                              category=args.inat_category, transform=transform)
+        nb_classes = dataset.nb_classes
+    elif args.data_set == 'INAT19':
+        dataset = INatDataset(args.data_path, train=is_train, year=2019,
+                              category=args.inat_category, transform=transform)
+        nb_classes = dataset.nb_classes
+    return dataset, nb_classes
+
+
+def build_transform(is_train, args):
+    resize_im = args.input_size > 32
+    if is_train:
+        # this should always dispatch to transforms_imagenet_train
+        transform = create_transform(
+            input_size=args.input_size,
+            is_training=True,
+            color_jitter=args.color_jitter,
+            auto_augment=args.aa,
+            interpolation=args.train_interpolation,
+            re_prob=args.reprob,
+            re_mode=args.remode,
+            re_count=args.recount,
+        )
+        if not resize_im:
+            # replace RandomResizedCropAndInterpolation with
+            # RandomCrop
+            transform.transforms[0] = transforms.RandomCrop(
+                args.input_size, padding=4)
+        return transform
+
+    t = []
+    if args.finetune:
+        t.append(
+            transforms.Resize((args.input_size, args.input_size),
+                                interpolation=3)
+        )
+    else:
+        if resize_im:
+            size = int((256 / 224) * args.input_size)
+            t.append(
+                # to maintain same ratio w.r.t. 224 images
+                transforms.Resize(size, interpolation=3),
+            )
+            t.append(transforms.CenterCrop(args.input_size))
+    
+    t.append(transforms.ToTensor())
+    t.append(transforms.Normalize(IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD))
+    return transforms.Compose(t)
@@ -0,0 +1,64 @@
+'''
+Build samplers for data loading
+'''
+import torch
+import torch.distributed as dist
+import math
+
+
+class RASampler(torch.utils.data.Sampler):
+    """Sampler that restricts data loading to a subset of the dataset for distributed,
+    with repeated augmentation.
+    It ensures that different each augmented version of a sample will be visible to a
+    different process (GPU)
+    Heavily based on torch.utils.data.DistributedSampler
+    """
+
+    def __init__(self, dataset, num_replicas=None, rank=None, shuffle=True):
+        if num_replicas is None:
+            if not dist.is_available():
+                raise RuntimeError(
+                    "Requires distributed package to be available")
+            num_replicas = dist.get_world_size()
+        if rank is None:
+            if not dist.is_available():
+                raise RuntimeError(
+                    "Requires distributed package to be available")
+            rank = dist.get_rank()
+        self.dataset = dataset
+        self.num_replicas = num_replicas
+        self.rank = rank
+        self.epoch = 0
+        self.num_samples = int(
+            math.ceil(len(self.dataset) * 3.0 / self.num_replicas))
+        self.total_size = self.num_samples * self.num_replicas
+        # self.num_selected_samples = int(math.ceil(len(self.dataset) / self.num_replicas))
+        self.num_selected_samples = int(math.floor(
+            len(self.dataset) // 256 * 256 / self.num_replicas))
+        self.shuffle = shuffle
+
+    def __iter__(self):
+        # deterministically shuffle based on epoch
+        g = torch.Generator()
+        g.manual_seed(self.epoch)
+        if self.shuffle:
+            indices = torch.randperm(len(self.dataset), generator=g).tolist()
+        else:
+            indices = list(range(len(self.dataset)))
+
+        # add extra samples to make it evenly divisible
+        indices = [ele for ele in indices for i in range(3)]
+        indices += indices[:(self.total_size - len(indices))]
+        assert len(indices) == self.total_size
+
+        # subsample
+        indices = indices[self.rank:self.total_size:self.num_replicas]
+        assert len(indices) == self.num_samples
+
+        return iter(indices[:self.num_selected_samples])
+
+    def __len__(self):
+        return self.num_selected_samples
+
+    def set_epoch(self, epoch):
+        self.epoch = epoch