You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: index.md
+2-8Lines changed: 2 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -82,9 +82,7 @@ Three specific issues arise from the ViT sparse sampling problem;
82
82
These issues hinder efficient optimization of SFS under standard tokenization - in other words, we posit that **grids cannot align every salient region**.
83
83
84
84

85
-
<divalign="center">
86
85
*Figure 1: A $5 \times 5$ patch grid (gray) with three optimal region placements for sparse feature selection. **(a)** The green patch is well aligned (A), yellow straddles two cells (B), and red lies on a corner (C) and leaks into four cells. Translating the grid only swaps which peak is misaligned---one patch is always bad. **(b)** Our subpixel tokenizer drops fixed-size windows (\textcolor{ok}{green} squares) directly on each peak, eliminating the alignment trade-off while still allowing conventional grid tokens when they \emph{are} well aligned.*
87
-
</div>
88
86
89
87
90
88
## Methodology: SPoT in a Nutshell
@@ -115,19 +113,15 @@ This means that models can be evaluated with the exact same features as a standa
115
113
By removing the strict adherence to grids in ViTs, we can leverage more continuous spatial priors for token placements for optimal feature extraction.
116
114
We compare several spatial priors, each encoding different assumptions about feature importance and spatial distribution.
117
115
118
-
119
-

120
-
<divalign="center">
121
-
*Figure 2: An illustration of different spatial priors investigated with SPoT.*
122
-
</div>
123
-
124
116
-*Uniform*: randomly samples locations with no spatial bias, assuming all regions are equally important.
125
117
-*Gaussian*: randomly samples locations with a central bias, which encodes a prior belief that subjects are typically centered in images.
126
118
-*Sobol*: provides quasirandom sampling aimed at uniform coverage while reducing overlap.
127
119
-*Isotropic*: deterministically distributes tokens evenly in a subpixel grid, emphasizing coverage.
128
120
-*Center*: deterministically distributes tokens evenly with slight central-bias.
129
121
-*Salient*: encodes object-centric bias by placing tokens based on regions identified as visually salient from a pretrained saliency model.
130
122
123
+

124
+
*Figure 2: An illustration of different spatial priors investigated with SPoT.*
131
125
132
126
### Exploring Oracle Neighbourhoods with SPoT-ON
133
127
In addition to investigating spatial different spatial priors, we also look to directly explore differentiable optimization for token placement.
0 commit comments