Skip to content

RobvanGastel/removing-pos-vit-bias

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Post-Training to Remove Positional Bias

Using Franca's "Removal of Absolute Spatial Attributes" Post-Training (RASA) (Venkataramanan et al., 2025) to remove positional bias in of other SSL pretrained ViTs is simple and provides an increase in this downstream performance. I evaluated the performance with OverClustering (Ziegler & Asano, 2022) and obtain a 1-3% performance boost on the validation set for different ViT model sizes. Since the original codebase for RASA is not easy to reuse I adjusted it to easily put in any pre-trained encoder. I have observed an increase in performance for the DINOv2, DINOv3 encoders.

This image displays patch cosine similarity between a selected patch token and the other patches, like on page 4 of the DINOv3 paper (Siméoni et al., 2025). This quantative evaluation helps us see how well it can distinguish between object types in the image. In the visualization.ipynb I evaluate what encoder size and RASA post-training does to the performance of the model. Smaller models still struggle to produce good cosine similarities. See the visualization.ipynb notebook or test it for yourself in Google collab.

Setup

Install the packages using the requirements.txt file.

# using conda
conda create --name dino python=3.11
conda activate dino
# Run the code, adjust the ./configs/rasa.yml or argparse flags
python main.py --exp_name "rasa_vits"

Results

Pascal VOC2012
Performance on the validation set with the DINOv3 ViT-S encoder for OverClustering (Ziegler & Asano, 2022) with k={21, 100, 300} on 40/90 batches of the validation set with batch size 16 (due to compute constraints).

k Validation mIoU After RASA Post-Training Validation mIoU Δ vs Original
21 15.67% 16.12% +0.45%
100 46.56% 47.64% +1.08%
300 59.14% 59.94% +0.80%

Performance on the validation set with the DINOv3 ViT-B encoder for OverClustering (Ziegler & Asano, 2022) with k={21, 100, 300} on 40/90 batches of the validation set with batch size 16 (due to compute constraints).

k Validation mIoU After RASA Post-Training Validation mIoU Δ vs Original
21 20.07% 21.56% +1.56%
100 51.30% 54.22% +2.92%
300 68.03% 66.60% -1.43%

I picked the best weights based on the an intermediate evaluation with $k=21$. Therefore, it might work suboptimal for larger $k$.

References

Venkataramanan, S., Pariza, V., Salehi, M., Knobel, L., Gidaris, S., Ramzi, E., Bursuc, A., & Asano, Y. M. (2025). Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning (No. arXiv:2507.14137). arXiv. https://doi.org/10.48550/arXiv.2507.14137

Siméoni, O., Vo, H. V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., … Bojanowski, P. (2025). DINOv3 (No. arXiv:2508.10104). arXiv. https://doi.org/10.48550/arXiv.2508.10104

Ziegler, A., & Asano, Y. M. (2022). Self-Supervised Learning of Object Parts for Semantic Segmentation (No. arXiv:2204.13101). arXiv. https://doi.org/10.48550/arXiv.2204.13101

About

Using RASA post-training to remove positional bias from pretrained encoders like DINOv3

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published