a question about kv reshape in Efficient Self-Attention

Thanks for sharing your work, your code is so elegant, and inspired me a lot.
Here is a question about the implementation of  Efficient Self-Attention

It seems you use a "mean op" to reshape k,v.
and the official implementation uses a (learnable) linear mapping to reshape k,v

may I ask, whether this difference significantly matters in your experiment ? 

in your code: 
```python
k, v = map(lambda t: reduce(t, 'b c (h r1) (w r2) -> b c h w', 'mean', r1 = r, r2 = r), (k, v))
```

the [original implementation](https://github.com/NVlabs/SegFormer/blob/08c5998dfc2c839c3be533e01fce9c681c6c224a/mmseg/models/backbones/mix_transformer.py) uses:

```python
self.kv = nn.Linear(dim, dim * 2, bias=qkv_bias)
self.sr = nn.Conv2d(dim, dim, kernel_size=sr_ratio, stride=sr_ratio)
self.norm = nn.LayerNorm(dim)

x_ = x.permute(0, 2, 1).reshape(B, C, H, W)
x_ = self.sr(x_).reshape(B, C, -1).permute(0, 2, 1)
x_ = self.norm(x_)
kv = self.kv(x_).reshape(B, -1, 2, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
k, v = kv[0], kv[1]
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

a question about kv reshape in Efficient Self-Attention #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

a question about kv reshape in Efficient Self-Attention #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions