Is it possible the positional encoding rather than VSA works？

In your code file  `ViTAE-VSA\Image-Classification\vitaev2_vsa\NormalCell.py` L130:
`self.pos = nn.Conv2d(dim, dim, window_size//2*2+1, 1, window_size//2, groups=dim, bias=True)`
your `window_size`is 7，so the `self.pos` convolution kernel is 7 too, in most Positional Encoding extractor it is so large.

**So is it possible that the positional encoding rather than VSA is working ？**