Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

# Reference
- 2021-03 **[Swin-Transformer]** Swin Transformer: Hierarchical Vision Transformer using Shifted Windows[[Paper](https://arxiv.org/abs/2103.14030)] [[Code](https://github.com/microsoft/Swin-Transformer)]
- [图解Swin Transformer](https://zhuanlan.zhihu.com/p/367111046)


# Brief
- Swin - `Shifted window` 
  - 只计算**窗口之内patch**相互的注意力, **减少计算量，引入局部信息**
  - 移动窗口, 增加 windows 信息交互
- Tasks: `Classification/Detection/Semantic Segmentation`

## Motivation
- 背景/存在的问题
- 做了什么尝试/分别有什么效果
- 应用领域

Model | ViT/MSA | Swin Transformer/W-MSA
-- | -- | --
Arch | ![image](https://user-images.githubusercontent.com/2216970/115521374-ad99f980-a2bd-11eb-81ea-49a51b74c5bd.png) | ![image](https://user-images.githubusercontent.com/2216970/115521525-d3270300-a2bd-11eb-8381-5012d9036a3c.png) 
复杂度 | ![image](https://user-images.githubusercontent.com/2216970/115521039-54ca6100-a2bd-11eb-9d38-c562c57af64a.png) |  ![image](https://user-images.githubusercontent.com/2216970/115521067-5b58d880-a2bd-11eb-87fb-1752605e2f02.png)



## Arch
- Split Patch `HxWx3` ==> `4x4x3` x `H/4 x W/4`
- Linear Embedding (default C=96) ==> `B` x `Token Vector(H/4 x W/4)` x `C`
- Swin Transformer blocks 
  - depths = [2, 2, 6, 2]
  - num_heads=[3, 6, 12, 24]
- Patch Merging `2x2 patches`


![image](https://user-images.githubusercontent.com/2216970/113246258-3dfea300-92eb-11eb-987b-520a5db1b948.png)
-- |
![image](https://user-images.githubusercontent.com/2216970/115517634-ec2db500-a2b9-11eb-8f61-41fe0e02bc2a.png)


### Swin Transformer Blocks
- `Stage1/2/3/4` 分别包含 `2/2/6/2` 个 `Transformer Blocks`
- 每两个 `Transformer Blocks` 作为一组
  - 第一个 TB 使用 `W-MSA`
  - 第二个 TB 使用 `SW-MSA`

![image](https://user-images.githubusercontent.com/2216970/115821721-e1982a80-a435-11eb-9eea-b9345f69f690.png) | ![image](https://user-images.githubusercontent.com/2216970/115821732-e826a200-a435-11eb-93b1-b3eb809684c2.png)
-- | --

### W-MSA & SW-MSA
- `MSA` ==> `W-MSA` ==> `SW-MSA`
- **MSA**
  - `image size = 224x224`/ `patch size = 4x4` / `window size = 7x7`/`shift size = 0`
  - `window = 8x8`
- **SW-MSA**
  - `shift size = window_size//2 = 3x3`
  - `window = (8+1)x(8+1)`
- cyclic shift
  - 通过设置 mask 计算 `shift window attention`
  - mask 设置为 `-100`, softmax 后就会忽略掉

![image](https://user-images.githubusercontent.com/2216970/113250145-8a011600-92f2-11eb-8681-4bf7c7c1c02c.png) | ![image](https://user-images.githubusercontent.com/2216970/115518589-ebe1e980-a2ba-11eb-9a37-beec62d9e786.png)
-- | --

## Patch Merging
- `B H W C` ==> `B H/2 W/2 4*C`
  - x0 - `B H/2 H/2 C` - `H` 偶数部分 `W` 偶数部分
  - x1 - `B H/2 H/2 C` - `H` 奇数部分 `W` 偶数部分
  - x2 - `B H/2 H/2 C` - `H` 偶数部分 `W` 奇数部分
  - x3 - `B H/2 H/2 C` - `H` 奇数部分 `W` 奇数部分
  - concat [x0, x1, x2, x3]
- `B H/2 W/2 4*C` ==> `B H/2 W/2 2*C`
  - nn.Linear(4 * dim, 2 * dim)

![image](https://user-images.githubusercontent.com/2216970/115827715-45732100-a43f-11eb-9b05-673bad20960c.png)


# Evaluation
- **速度没有提高???** `相同量级的速度没提高，但是低量级的能够达到更高的精度`
  - 比较 `ViT-B 384` & `Swin-B 384`  [Param/FLOPS/Throughput] 都差不多
  - 比较 `ViT-B 384` & `Swin-T 224` 更高的精度，更小的image size 更快的速度 
- **精度些微提升**


Num | Evaluation
-- | --
1 Classification | ![image](https://user-images.githubusercontent.com/2216970/113248742-1bbb5400-92f0-11eb-968c-54bd0e1517e6.png)
2 Detection | ![image](https://user-images.githubusercontent.com/2216970/115519020-66ab0480-a2bb-11eb-948d-bbc365e3ccc6.png)
3 Segmentation | ![image](https://user-images.githubusercontent.com/2216970/115519063-6f033f80-a2bb-11eb-83ff-ffa4f362f421.png)
4 Ablation | ![image](https://user-images.githubusercontent.com/2216970/115519131-7c202e80-a2bb-11eb-932c-96744ca1d6aa.png)
5 Speed| ![image](https://user-images.githubusercontent.com/2216970/115519193-8c380e00-a2bb-11eb-868b-c122306050fd.png)
6 | ![image](https://user-images.githubusercontent.com/2216970/115519217-93f7b280-a2bb-11eb-88ff-7510aa06eac1.png)


# Tricks


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows #958

Reference

Brief

Motivation

Arch

Swin Transformer Blocks

W-MSA & SW-MSA

Patch Merging

Evaluation

Tricks

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows #958

Description

Reference

Brief

Motivation

Arch

Swin Transformer Blocks

W-MSA & SW-MSA

Patch Merging

Evaluation

Tricks

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions